|
|
Thread Tools | Search this Thread |
12-17-2010, 05:13 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
|
My RegEx isn't doing what I hoped to remove page numbers and a fixed string
The person who built the PDF I'm using used a trial version of some XML formatter which spits out some text on every page, but this is hidden in the PDF, but when I convert to ePUB it shows up. I figured I could just remove this using a RegEx on the Header/Footer, but no luck.
Code:
String: <a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br> RegEx: <a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br> Code:
String: <A name=13></a><IMG src="index-13_1.jpg"><br>Title <br>11 <br> RegEx: <A name=[0-9][0-9][0-9]></a><IMG src="index-[0-9][0-9][0-9]_1.jpg"><br>Title <br>[0-9][0-9][0-9] <br> Did I completely misunderstand how regular expressions work? |
12-17-2010, 05:18 PM | #2 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Replace [0-9][0-9][0-9] with \d+
Your current expression says you must have exactly three digits - either use [0-9]+ or \d+ to say you want one or more digits. Also, you might need to escape that minus sign after index - use \- instead of - i.e. your final regex is: Code:
<A name=\d+></a><IMG src="index\-\d+_1.jpg"><br>Title <br>\d+ <br> |
Advert | |
|
12-17-2010, 05:22 PM | #3 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Forgot to answer your other query. Make sure in your regex you escape any regex characters like periods and brackets in your example. So anywhere you see a . or a ( or ) put a \ in front of it in your regex so it becomes \. \( \) etc.
|
12-17-2010, 05:40 PM | #4 |
Junior Member
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
|
Wow - That was a quick reply. Thank you. I made the changes, but still no dice. Turns out that the IMG tag only exists in the first few pages and I also realized that I can combine both the fixed string and the title/page.
So, here's what I'm working with: Code:
<a href="http://www.antennahouse.com">Antenna House XSL Formatter (Evaluation) http://www.antennahouse.com</a><br><hr><A name=14></a>Title <br>12 <br> Code:
<a href="http://www\.antennahouse\.com">Antenna House XSL Formatter \(Evaluation\) http://www\.antennahouse\.com</a><br><hr><A name=\d+></a>Title <br>\d+ <br> |
12-17-2010, 07:35 PM | #5 |
Calibre Plugins Developer
Posts: 4,688
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Expression looks fine to me. When you hit the test button, what it should do is highlight in yellow all the places it found a match. So if you scroll the window to where you know it should find a match then you should see it turn yellow instantly when you click Test.
If you mean it does nothing in the epub output, make sure you remembered to actually tick the "Remove headers" checkbox itself, not just set the regular expression. I have made that mistake a few times... |
Advert | |
|
12-17-2010, 09:54 PM | #6 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
You should try replacing all empty space with \s+, \s*, or \s. I find that most times it won't work if you actually try to leave the empty space in.
It's also a good idea to build the regex in pieces so you can use the test function at each stage. If something goes wrong it's easier to figure it out that way. |
12-19-2010, 11:55 PM | #7 |
Junior Member
Posts: 3
Karma: 10
Join Date: Dec 2010
Device: NookColor, maybe
|
Alright, I'm an idiot. There were some CRLF that I was stripping out while editing. ldolse had a good suggestion for getting it working in smaller pieces and was what led me to the fact that the line breaks was where the mistake was.
Thanks for all the help. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 01:21 PM |
remove page numbers? | JeanC | Calibre | 7 | 11-25-2010 05:13 AM |
PDF -> MOBI: a string is added to the bottom of each page | falconfoxxx | Calibre | 3 | 09-14-2010 02:28 AM |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 10:42 AM |
Regex to remove header from PDF | neonbible | Calibre | 4 | 09-07-2010 11:08 AM |