10-03-2010, 06:46 AM | #76 | |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Quote:
Code:
(set1)?(set2) The first way ist to use whitespaces: "\s*?" will match a linebreak, as, on windows, it is displayed as two whitespaces (carriage return and linefeed). Put that in the place(s) you expect a linebreak to happen- it will let the whole expression match even if there isn't any linebreak, that's why I used the "*" quantifier. The second way would be using the dot and the dotall- flag, as in you'd put ".*?" where you expect linebreaks to happen, and append or prepend your whole expression with the flag construct, "(?s)". The same remarks as in the first way apply, plus a caveat: be careful when using this, as the dot doesn't only match whitespaces, but any characters. You might accidentally match more than you intended to. For the record, I think you got confused at the point where I was talking about flags, right? If you have suggestions how to improve the tutorial so that doesn't happen, I'd be glad to hear them. |
|
11-10-2010, 03:29 AM | #77 | |
Connoisseur
Posts: 71
Karma: 674766
Join Date: Sep 2010
Device: Kindle
|
Hi
Could you please explain the regular expression used for detecting headers? Quote:
|
|
Advert | |
|
11-10-2010, 06:09 AM | #78 | |
Wizard
Posts: 3,455
Karma: 10484861
Join Date: May 2006
Device: PocketBook 360, before it was Sony Reader, cassiopeia A-20
|
Quote:
http://docs.python.org/library/re.html Code:
(?i) # switch ignorecase on
(?<=<hr>) # lookbehind assertion. It means the following RE can only match if <hr> precedes it. The <hr> string will not be part of the resulting match
( # beginning of the main RE
( # beginning of Group A Re
\s*<a name=\d+></a>
(
(<img.+?>)* # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here (see the first post for explanation)
<br>
\s*
)?
\d+
<br>
\s*
.*? # *? is non greedy quantifier. It means match as little characters as possible
\s*
) # end of of Group A Re
| # Group A OR Group B will be matched
( # beginning of Group A Re
\s*
<a name=\d+></a>
(
(<img.+?>)* # +? is non greedy quantifier. It means match as little characters as possible, but at least one character. I personally would have written (<img[^>]+>)* here
<br>
\s*
)?
.*? # *? is non greedy quantifier. It means match as little characters as possible
<br>
\s*
\d+
) end of of Group B Re
) # end of the main RE
(?=<br>) # lookahead assertion. It means that the preceding RE can only match if it is followed by a <br>. The <br> string will not be part of the resulting match
Last edited by kacir; 11-10-2010 at 06:14 AM. |
|
11-10-2010, 08:55 AM | #79 |
Readaholic
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
|
This thread (or an edited version of it) would make a good sticky
|
11-10-2010, 09:03 AM | #80 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
The first post is available in the user manual, which is why I haven't thought it necessary to sticky this. However, if someone feels differently, I won't mind.
|
Advert | |
|
11-11-2010, 08:37 AM | #81 |
Readaholic
Posts: 255
Karma: 1058454
Join Date: Jul 2009
Location: Swindon, UK
Device: Sony PRS-T2 (previously 505 and 650)
|
RTBM, for God's sake - RTBM !!!
|
Tags |
regexp calibre tutorial |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Problem with regular expressions | Manichean | Conversion | 10 | 02-03-2011 03:27 PM |
Custom Regular Expressions for adding book information | bigbot3 | Calibre | 1 | 12-25-2010 07:28 PM |
Help with Regular Expressions | ghostyjack | Workshop | 2 | 01-08-2010 12:04 PM |
Regular Expressions help needed | Phil_C | Workshop | 20 | 10-03-2009 01:14 AM |
BookDesigner v5 and regular expressions | ShineOn | Sony Reader | 11 | 08-25-2008 05:06 PM |