02-05-2012, 01:56 PM | #1 |
Sigil developer
Posts: 1,274
Karma: 1101600
Join Date: Jan 2011
Location: UK
Device: Kindle PW, K4 NT, K3, Kobo Touch
|
Regex examples
I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.
For instance, is there a regex to do other types of replacement but only inside body tags? Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute? If you have any suggestions for the above cases, or any other useful Regex expressions please post them. |
02-05-2012, 08:15 PM | #2 |
Connoisseur
Posts: 54
Karma: 37363
Join Date: Aug 2011
Location: Istanbul
Device: EBW1150, Nook STR
|
Matches regex inside body element and inside character data only.
(First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.) (Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.) Code:
(?s)regex(?![^<>]*>)(?!.*<body[^>]*>)
(If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to " , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.) Code:
regex(?=[^<]*>)(?!(?:[^<"]*"[^<"]*")+\s*/?>)
Edit 2: Added clarification in bold. Edit 3: Slight simplification in the second code. Last edited by Timur; 02-05-2012 at 08:35 PM. |
Advert | |
|
02-05-2012, 08:32 PM | #3 |
Guru
Posts: 973
Karma: 2458402
Join Date: Aug 2010
Location: St. Louis
Device: Kindle Keyboard, Nook HD+
|
This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks
For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view) Search for CHAPTER [0-9XVI]+ And replace with <hr class="sigilChapterBreak" /><h3>\0</h3> On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split) Last edited by JeremyR; 02-05-2012 at 08:36 PM. |
02-06-2012, 07:26 AM | #4 |
♫
Posts: 661
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
|
02-06-2012, 07:29 AM | #5 |
♫
Posts: 661
Karma: 506380
Join Date: Aug 2010
Location: Germany
Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color
|
I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too:
http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!) |
Advert | |
|
02-20-2012, 06:10 PM | #6 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...
I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics". So I'm looking for an expression that will find: Code:
<span class="italics">This is three words</span> Code:
<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span> Code:
<span class="italics">one</span> |
02-20-2012, 06:23 PM | #7 | |
Well trained by Cats
Posts: 30,445
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
Code:
<span class="italics">(\w+){2,}</span> |
|
02-20-2012, 07:48 PM | #8 | |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Code:
<span class="italics">Three weeks later</span></p> Code:
<span class="italics">Well, dammit, it’s been two days.</span> |
|
02-20-2012, 09:06 PM | #9 |
Evangelist
Posts: 432
Karma: 1720909
Join Date: Mar 2011
Device: Voyage, K3
|
What about:
Code:
<span class="italics">"?\w+,?\.?\s Last edited by tilia; 02-20-2012 at 09:22 PM. Reason: Typo |
02-20-2012, 09:23 PM | #10 |
Well trained by Cats
Posts: 30,445
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Code:
(?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>) |
02-20-2012, 11:14 PM | #11 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).
Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word. Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. |
02-21-2012, 12:17 AM | #12 | |
Well trained by Cats
Posts: 30,445
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
there are 1 to n cases of a word followed by a space AND then a single word with No space. I don't know if [:punct:] will find mdash and ellipse |
|
02-21-2012, 12:35 AM | #13 |
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
How about:
Code:
<span class="italics">\w+\s+.*</span> |
02-21-2012, 01:45 AM | #14 |
Connoisseur
Posts: 54
Karma: 37363
Join Date: Aug 2011
Location: Istanbul
Device: EBW1150, Nook STR
|
@davidfor: Add (?U) in front of your regexp for lazy matching.
|
02-21-2012, 08:13 AM | #15 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great.
I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Examples of Subgroups | emonti8384 | Lounge | 32 | 02-26-2011 07:00 PM |
Accessories Pen examples | Gunnerp245 | enTourage Archive | 15 | 02-21-2011 04:23 PM |
Stylesheet examples? | Skitzman69 | Sigil | 15 | 09-24-2010 09:24 PM |
Examples | kafkaesque1978 | iRiver Story | 1 | 07-26-2010 04:49 PM |
Looking for examples of typos in eBooks | Tonycole | General Discussions | 1 | 05-05-2010 05:23 AM |