Register Guidelines E-Books Search Today's Posts Mark Forums Read

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 10-03-2010, 06:46 AM   #76
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
Quote:
Originally Posted by da_jane View Post
If I want to remove multiple lines of text, do I enclose my reg expressions in parentheses and then separate the sets by a ?
If I understand you correctly, you want to do something like
Code:
(set1)?(set2)
That would match the strings "set1set2" and "set1set1set2", as the question mark is interpreted as a quantifier to be applied to the first group. There are two ways you can match multiple lines, both depend on you knowing where the linebreaks are (or could be).
The first way ist to use whitespaces: "\s*?" will match a linebreak, as, on windows, it is displayed as two whitespaces (carriage return and linefeed). Put that in the place(s) you expect a linebreak to happen- it will let the whole expression match even if there isn't any linebreak, that's why I used the "*" quantifier.
The second way would be using the dot and the dotall- flag, as in you'd put ".*?" where you expect linebreaks to happen, and append or prepend your whole expression with the flag construct, "(?s)". The same remarks as in the first way apply, plus a caveat: be careful when using this, as the dot doesn't only match whitespaces, but any characters. You might accidentally match more than you intended to.

For the record, I think you got confused at the point where I was talking about flags, right? If you have suggestions how to improve the tutorial so that doesn't happen, I'd be glad to hear them.
Manichean is offline  
Old 11-10-2010, 03:29 AM   #77
bucsie
Connoisseur
bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.bucsie ought to be getting tired of karma fortunes by now.
 
Posts: 71
Karma: 674766
Join Date: Sep 2010
Device: Kindle
Hi

Could you please explain the regular expression used for detecting headers?

Quote:
(?i)(?<=<hr>)((\s*<a name=\d+></a>((<img.+?>)*<br>\s*)?\d+<br>\s*.*?\s*)|(\s*<a name=\d+></a>((<img.+?>)*<br>\s*)?.*?<br>\s*\d+))(?=<br>)
I am not a newbie with regex-es, but it's difficult to follow, and there are a couple of things I don't understand, like what is this for: (?<= or this one: (?=
bucsie is offline  
Advert
Old 11-10-2010, 06:09 AM   #78
kacir
Wizard
kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.kacir ought to be getting tired of karma fortunes by now.
 
kacir's Avatar
 
Posts: 3,455
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
Quote:
Originally Posted by bucsie View Post
I am not a newbie with regex-es, but it's difficult to follow, and there are a couple of things I don't understand, like what is this for: (?<= or this one: (?=
those funny non-standard RE constructs are described at
http://docs.python.org/library/re.html
Code:

(?i)   # switch ignorecase on
(?<=<hr>)   # lookbehind assertion. It means the following RE can only match if <hr> precedes it. The <hr> string will not be part of the resulting match
(      # beginning of the main RE
   (   # beginning of Group A Re
      \s*<a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here (see the first post for explanation)

         <br>
         \s*
       )?
      \d+
      <br>
      \s*
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      \s*
   )   # end of of Group A Re
   |   # Group A OR Group B will be matched
   (   # beginning of Group A Re
      \s*
      <a name=\d+></a>
      (
         (<img.+?>)*   # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here 
         <br>
         \s*
      )?
      .*?   # *? is non greedy quantifier. It means match as little characters as possible 
      <br>
      \s*
      \d+
   )   end of of Group B Re
)      # end of the main RE
(?=<br>)   # lookahead assertion. It means that the preceding RE can only match if it is followed by a <br>. The <br> string will not be part of the resulting match


Last edited by kacir; 11-10-2010 at 06:14 AM.
kacir is offline  
Old 11-10-2010, 08:55 AM   #79
mediax
Readaholic
mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.
 
mediax's Avatar
 
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
This thread (or an edited version of it) would make a good sticky

mediax is offline  
Old 11-10-2010, 09:03 AM   #80
Manichean
Wizard
Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.Manichean is the 'tall, dark, handsome stranger' all the fortune-tellers are referring to.
 
Manichean's Avatar
 
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
The first post is available in the user manual, which is why I haven't thought it necessary to sticky this. However, if someone feels differently, I won't mind.
Manichean is offline  
Advert
Old 11-11-2010, 08:37 AM   #81
mediax
Readaholic
mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.mediax ought to be getting tired of karma fortunes by now.
 
mediax's Avatar
 
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
RTBM, for God's sake - RTBM !!!
mediax is offline  
Closed Thread

Tags
regexp calibre tutorial

Thread Tools Search this Thread
Search this Thread:

Advanced Search

Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Problem with regular expressions Manichean Conversion 10 02-03-2011 03:27 PM
Custom Regular Expressions for adding book information bigbot3 Calibre 1 12-25-2010 07:28 PM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 12:04 PM
Regular Expressions help needed Phil_C Workshop 20 10-03-2009 01:14 AM
BookDesigner v5 and regular expressions ShineOn Sony Reader 11 08-25-2008 05:06 PM


All times are GMT -4. The time now is 11:03 AM.


MobileRead.com is a privately owned, operated and funded community.