My RegEx isn't doing what I hoped to remove page numbers and a fixed string

winterminute · 12-17-2010, 04:13 PM

The person who built the PDF I'm using used a trial version of some XML formatter which spits out some text on every page, but this is hidden in the PDF, but when I convert to ePUB it shows up. I figured I could just remove this using a RegEx on the Header/Footer, but no luck.

Code:

String:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br>

RegEx:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br>

I'd also like to remove page numbers and page titles, here's an example

Code:

String:
<A name=13></a><IMG src="index-13_1.jpg"><br>Title <br>11 <br>

RegEx:
<A name=[0-9][0-9][0-9]></a><IMG src="index-[0-9][0-9][0-9]_1.jpg"><br>Title <br>[0-9][0-9][0-9] <br>

Did I completely misunderstand how regular expressions work?

kiwidude · 12-17-2010, 04:18 PM

Replace [0-9][0-9][0-9] with \d+

Your current expression says you must have exactly three digits - either use [0-9]+ or \d+ to say you want one or more digits.

Also, you might need to escape that minus sign after index - use \- instead of -

i.e. your final regex is:

Code:

<A name=\d+></a><IMG src="index\-\d+_1.jpg"><br>Title <br>\d+ <br>

kiwidude · 12-17-2010, 04:22 PM

Forgot to answer your other query. Make sure in your regex you escape any regex characters like periods and brackets in your example. So anywhere you see a . or a ( or ) put a \ in front of it in your regex so it becomes \. \( \) etc.

winterminute · 12-17-2010, 04:40 PM

Wow - That was a quick reply. Thank you. I made the changes, but still no dice. Turns out that the IMG tag only exists in the first few pages and I also realized that I can combine both the fixed string and the title/page.

So, here's what I'm working with:

Code:

<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation)  http://www.antennahouse.com</a><br><hr><A name=14></a>Title <br>12 <br>

and here's my modified RegEx:

Code:

<a href="http://www\.antennahouse\.com">Antenna House XSL Formatter \(Evaluation\)  http://www\.antennahouse\.com</a><br><hr><A name=\d+></a>Title <br>\d+ <br>

However, the Test button doesn't appear to do anything which I assume means my RegEx is wrong. Am I missing some special chars? Is there a list of what needs to be escaped?

kiwidude · 12-17-2010, 06:35 PM

Expression looks fine to me. When you hit the test button, what it should do is highlight in yellow all the places it found a match. So if you scroll the window to where you know it should find a match then you should see it turn yellow instantly when you click Test.

If you mean it does nothing in the epub output, make sure you remembered to actually tick the "Remove headers" checkbox itself, not just set the regular expression. I have made that mistake a few times...

ldolse · 12-17-2010, 08:54 PM

You should try replacing all empty space with \s+, \s*, or \s. I find that most times it won't work if you actually try to leave the empty space in.

It's also a good idea to build the regex in pieces so you can use the test function at each stage. If something goes wrong it's easier to figure it out that way.

winterminute · 12-19-2010, 10:55 PM

Alright, I'm an idiot. There were some CRLF that I was stripping out while editing. ldolse had a good suggestion for getting it working in smaller pieces and was what led me to the fact that the line breaks was where the mistake was.

Thanks for all the help.

12-17-2010, 04:13 PM	#1
winterminute Junior Member Posts: 3 Karma: 10 Join Date: Dec 2010 Device: NookColor, maybe	My RegEx isn't doing what I hoped to remove page numbers and a fixed string The person who built the PDF I'm using used a trial version of some XML formatter which spits out some text on every page, but this is hidden in the PDF, but when I convert to ePUB it shows up. I figured I could just remove this using a RegEx on the Header/Footer, but no luck. Code: String: <a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br> RegEx: <a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br> I'd also like to remove page numbers and page titles, here's an example Code: String: <A name=13></a><IMG src="index-13_1.jpg"><br>Title <br>11 <br> RegEx: <A name=[0-9][0-9][0-9]></a><IMG src="index-[0-9][0-9][0-9]_1.jpg"><br>Title <br>[0-9][0-9][0-9] <br> Did I completely misunderstand how regular expressions work?

12-17-2010, 04:18 PM	#2
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Replace [0-9][0-9][0-9] with \d+ Your current expression says you must have exactly three digits - either use [0-9]+ or \d+ to say you want one or more digits. Also, you might need to escape that minus sign after index - use \- instead of - i.e. your final regex is: Code: <A name=\d+></a><IMG src="index\-\d+_1.jpg"><br>Title <br>\d+ <br>

12-17-2010, 04:40 PM	#4
winterminute Junior Member Posts: 3 Karma: 10 Join Date: Dec 2010 Device: NookColor, maybe	Wow - That was a quick reply. Thank you. I made the changes, but still no dice. Turns out that the IMG tag only exists in the first few pages and I also realized that I can combine both the fixed string and the title/page. So, here's what I'm working with: Code: <a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br><hr><A name=14></a>Title <br>12 <br> and here's my modified RegEx: Code: <a href="http://www\.antennahouse\.com">Antenna House XSL Formatter \(Evaluation\) http://www\.antennahouse\.com</a><br><hr><A name=\d+></a>Title <br>\d+ <br> However, the Test button doesn't appear to do anything which I assume means my RegEx is wrong. Am I missing some special chars? Is there a list of what needs to be escaped?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
remove page numbers?	JeanC	Calibre	7	11-25-2010 04:13 AM
PDF -> MOBI: a string is added to the bottom of each page	falconfoxxx	Calibre	3	09-14-2010 01:28 AM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
Regex to remove header from PDF	neonbible	Calibre	4	09-07-2010 10:08 AM

12-17-2010, 04:22 PM	#3
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Forgot to answer your other query. Make sure in your regex you escape any regex characters like periods and brackets in your example. So anywhere you see a . or a ( or ) put a \ in front of it in your regex so it becomes \. \( \) etc.

12-17-2010, 06:35 PM	#5
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Expression looks fine to me. When you hit the test button, what it should do is highlight in yellow all the places it found a match. So if you scroll the window to where you know it should find a match then you should see it turn yellow instantly when you click Test. If you mean it does nothing in the epub output, make sure you remembered to actually tick the "Remove headers" checkbox itself, not just set the regular expression. I have made that mistake a few times...

12-17-2010, 08:54 PM	#6
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	You should try replacing all empty space with \s+, \s*, or \s. I find that most times it won't work if you actually try to leave the empty space in. It's also a good idea to build the regex in pieces so you can use the test function at each stage. If something goes wrong it's easier to figure it out that way.

12-19-2010, 10:55 PM	#7
winterminute Junior Member Posts: 3 Karma: 10 Join Date: Dec 2010 Device: NookColor, maybe	Alright, I'm an idiot. There were some CRLF that I was stripping out while editing. ldolse had a good suggestion for getting it working in smaller pieces and was what led me to the fact that the line breaks was where the mistake was. Thanks for all the help.

Advert

Advert