Regex examples

meme · 02-05-2012, 01:56 PM

I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed.

For instance, is there a regex to do other types of replacement but only inside body tags?

Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute?

If you have any suggestions for the above cases, or any other useful Regex expressions please post them.

Timur · 02-05-2012, 08:15 PM

Matches regex inside body element and inside character data only.
(First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.)
(Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.)

Code:

(?s)regex(?![^<>]*>)(?!.*<body[^>]*>)

Matches regex only inside attribute values.
(If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to " , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.)

Code:

regex(?=[^<]*>)(?!(?:[^<"]*"[^<"]*")+\s*/?>)

Edit: Typo.
Edit 2: Added clarification in bold.
Edit 3: Slight simplification in the second code.

JeremyR · 02-05-2012, 08:32 PM

This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks

For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view)

Search for

CHAPTER [0-9XVI]+

And replace with

<hr class="sigilChapterBreak" /><h3>\0</h3>

On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split)

WS64 · 02-06-2012, 07:26 AM

Quote:

Originally Posted by meme

Is there one only for the actual text - words not part of a tag name or attribute?

F: (>[^<]*)old
R: \1new
(If the text contains > or < this will go wrong, but Sigil cleans the code up so it should work.)

WS64 · 02-06-2012, 07:29 AM

I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too:
http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!)

DiapDealer · 02-20-2012, 06:10 PM

I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:

Code:

<span class="italics">This is three words</span>

or:

Code:

<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>

but not:

Code:

<span class="italics">one</span>

I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.

theducks · 02-20-2012, 06:23 PM

Quote:

Originally Posted by DiapDealer

I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help...

I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics".

So I'm looking for an expression that will find:

Code:

<span class="italics">This is three words</span>

or:

Code:

<span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span>

but not:

Code:

<span class="italics">one</span>

I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.

Keyword Quantifier

Code:

<span class="italics">(\w+){2,}</span>

2 or more

DiapDealer · 02-20-2012, 07:48 PM

Quote:

Originally Posted by theducks

Keyword Quantifier

Code:

<span class="italics">(\w+){2,}</span>

2 or more

That seems to be finding all occurrences of <span class="italics"></span> that enclose 2 or more word characters. And it's still returning one-word instances, while skipping things like:

Code:

<span class="italics">Three weeks later</span></p>

And definitely skipping multiple word instances that contain punctuation and/or quotes:

Code:

<span class="italics">Well, dammit, it’s been two days.</span>

What else ya got?

tilia · 02-20-2012, 09:06 PM

What about:

Code:

<span class="italics">"?\w+,?\.?\s

theducks · 02-20-2012, 09:23 PM

Code:

(?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>)

DiapDealer · 02-20-2012, 11:14 PM

Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway.

theducks · 02-21-2012, 12:17 AM

Quote:

Originally Posted by DiapDealer

Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…).

Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word.

Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway. ;)

The trick I found:
there are 1 to n cases of a word followed by a space AND then a single word with No space. I don't know if [:punct:] will find mdash and ellipse

davidfor · 02-21-2012, 12:35 AM

How about:

Code:

<span class="italics">\w+\s+.*</span>

That seems to work in my tests. There is an issue with greediness as I happened to have a paragraph with two multiword italic sections in my test book. The search worked but it selected the two italic sections and everything between them. But it didn't find any of the single word italics.

Timur · 02-21-2012, 01:45 AM

@davidfor: Add (?U) in front of your regexp for lazy matching.

DiapDealer · 02-21-2012, 08:13 AM

Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great.

I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not.

02-05-2012, 01:56 PM	#1
meme Sigil developer Posts: 1,274 Karma: 1101600 Join Date: Jan 2011 Location: UK Device: Kindle PW, K4 NT, K3, Kobo Touch	Regex examples I'd like to see if I can collect Regular Expressions (PCRE format as introduced in Sigil 0.5.0) used for common or difficult issues, and maybe add them to the FAQ, etc. Partly so I can have a list to refer to when needed, but also to collect a lot of what's probably already been mentioned in this forum. And maybe to find out if there isn't a way to do a replacement that's needed. For instance, is there a regex to do other types of replacement but only inside body tags? Is there one only for the actual text - words not part of a tag name or attribute? Words that are only aprt of a tag name or attribute? If you have any suggestions for the above cases, or any other useful Regex expressions please post them.

02-05-2012, 08:15 PM	#2
Timur Connoisseur Posts: 54 Karma: 37363 Join Date: Aug 2011 Location: Istanbul Device: EBW1150, Nook STR	Matches regex inside body element and inside character data only. (First negative look-ahead(character data req.) works, although specification does not require the greater-than sign in character data to be escaped. But you have to save the epub at least once in Sigil and then reload it to escape all greater-than signs to > , or else you might miss some matches.) (Second negative lookahead will not work if your document has more than one body element. Sigil allows this, but W3C validator gives error to such documents. I do not know the strict specifications for multiple body elements.) Code: (?s)regex(?![^<>]>)(?!.<body[^>]>) Matches regex only inside attribute values. (If your document has single quotes(apostrophe) somewhere as attribute value delimiter instead of doubles, again, save and reload to change them all to double quotes, so that this regex works reliably. Saving and reloading also escapes all quotes inside attribute values to " , so that your elements stay well-formed. Reloading also escapes all greater-than signs, otherwise you might have the risk of matching something inside character data.) Code: regex(?=[^<]>)(?!(?:[^<"]"[^<"]")+\s/?>) Edit: Typo. Edit 2: Added clarification in bold. Edit 3: Slight simplification in the second code. Last edited by Timur; 02-05-2012 at 08:35 PM.*

02-05-2012, 08:32 PM	#3
JeremyR Guru Posts: 973 Karma: 2458402 Join Date: Aug 2010 Location: St. Louis Device: Kindle Keyboard, Nook HD+	This one uses the old format of Sigil Regex, but I find it very useful. Basically I use it to take a document that was text and thus not broken up into chapters but with them labeled, and to find them, highlight them and add the break marks For books with Chapter I, Chapter II, and so on (Roman Numerals) or with Chapter 1, Chapter 2. (And of course, use it in code view) Search for CHAPTER [0-9XVI]+ And replace with <hr class="sigilChapterBreak" /><h3>\0</h3> On occasion it will find a phrase like "chapter in" in the text, but that's pretty rare (and just check the TOC before having it split) Last edited by JeremyR; 02-05-2012 at 08:36 PM.

02-20-2012, 06:10 PM	#6
DiapDealer Grand Sorcerer Posts: 28,045 Karma: 199464182 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	I can usually bang my head against something long enough to figure it out, but I'm giving up and looking for help... I'm looking for an expression that will locate span tags (of a specific class) that enclose more than one word. Let's just say the span class is "italics". So I'm looking for an expression that will find: Code: <span class="italics">This is three words</span> or: Code: <span class="italics">This is a great big bunch of words, maybe even a whole dang paragraph</span> but not: Code: <span class="italics">one</span> I know I'm probably missing something obvious and am going to feel like a complete idiot when I see the answer. I don't even really care what, exactly, gets gets highlighted, as I won't be using it to replace anything... only to manually eyeball the occurrence.

02-20-2012, 09:06 PM	#9
tilia Evangelist Posts: 432 Karma: 1720909 Join Date: Mar 2011 Device: Voyage, K3	What about: Code: <span class="italics">"?\w+,?\.?\s Last edited by tilia; 02-20-2012 at 09:22 PM. Reason: Typo

02-06-2012, 07:29 AM	#5
WS64 ♫ Posts: 661 Karma: 506380 Join Date: Aug 2010 Location: Germany Device: Kobo Aura / PB Lux 2 / Bookeen Frontlight / Kobo Mini / Nook Color	I'm collecting the regex expressions I often use here, I guess at least some of them are interesting for others too: http://ws64.com/regex/ (this page lives, so there might be changes anytime, and of course, use at your own risk!)

02-20-2012, 09:23 PM	#10
theducks Well trained by Cats Posts: 30,455 Karma: 58055868 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Code: (?<=<span class="italics">)(\w+ ){1,}(\w+)(?=</span>)

02-20-2012, 11:14 PM	#11
DiapDealer Grand Sorcerer Posts: 28,045 Karma: 199464182 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Sorry theducks. That's some nice lookaheads/lookbehinds you have going on there, but I need it to include multi-word phrases that may have punctuation (“”‘’.,?!:;-—…). Tilia's takes me to a lot more of what I'm looking for, but seems to skip multi-word phrases that have an apostrophe in the very first word. Basically, I've accidentally blown up a lot of the italic spans in a document of mine. I'd like to be able to reliably find every instance of <span class="italics">(.*?)</span> EXCEPT instances where there's only one word in the span. That would narrow what I need to check from 700 instances down to... well... I'm not exactly sure of the number (obviously), but it would be a heck of a lot less than 700, anyway.

02-21-2012, 12:35 AM	#13
davidfor Grand Sorcerer Posts: 24,905 Karma: 47303822 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	How about: Code: <span class="italics">\w+\s+.*</span> That seems to work in my tests. There is an issue with greediness as I happened to have a paragraph with two multiword italic sections in my test book. The search worked but it selected the two italic sections and everything between them. But it didn't find any of the single word italics.

02-21-2012, 01:45 AM	#14
Timur Connoisseur Posts: 54 Karma: 37363 Join Date: Aug 2011 Location: Istanbul Device: EBW1150, Nook STR	@davidfor: Add (?U) in front of your regexp for lazy matching.

02-21-2012, 08:13 AM	#15
DiapDealer Grand Sorcerer Posts: 28,045 Karma: 199464182 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Correct me if wrong davidfor, but won't that still exclude instances where the very first word contains an apostrophe, or any other non-word character? It gets me very close, certainly, but I just don't think what I'm looking for is going to be based on "\w". The potential for too many non-word characters being present in the words (including the first one) is just too great. I'm not so much concerned with the greediness of the expression (as I'm not blindly replacing anything with it) as I am with seeing every single non-one-word occurrence... regardless if that occurrence contains non-word characters or not.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 07:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 04:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 09:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 04:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 05:23 AM

Advert

Advert