09-07-2010, 08:51 AM | #1 |
Addict
Posts: 202
Karma: 10802
Join Date: Sep 2010
Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7
|
Regex to remove header from PDF
I have been reading through a few posts about using Regex to remove headers and footers.
I have successfully managed to remove page numbers and static strings. But the particular PDF I am using use the chapter title as the header. So this is going to change a lot. How do I specify a regex expression to match a phrase/string? |
09-07-2010, 09:12 AM | #2 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
try this:
<br>\s*(title1|title2|title3|title\s+with\s+spaces )\s*<br> - changing title to whatever your chapters are. You can also add <hr>\s* or \s*<hr> to the beginning or end (depending on whether it's header or footer), to more accurately tie it just to the page headers. If you tie it to the <hr> tag you might be able to get away with something like this: <br>\s*(\w+\s*)+\s*<br>\s*<hr> Use the test function with some of those examples to see if you can get what you need. http://www.regular-expressions.info/ is the best place to read up on how to use regex. edit - here's a sample regex I used for a file which also had chapter title headers: Code:
((Castello\s|The\s(Phleg|nun|night|prince\sof\smus|garden|secret\spalac)|Epilogu|Prefac|Four\scarnival|Amalf|La\sSiren|Marriage\sto|Montevergin|Spaccanapol|A\sstiletto|Gesualdo\sC)[^<]+<br>\s*)?(\d|[xvi])+<br>\s*(The\sD\s*e\s*v\s*i\s*l\s*[^<]+<br>)?\s*((Bh|27)[^<]+<br>\s*){4,4}\s*<hr>\s*<A name=\d+></a> Last edited by ldolse; 09-07-2010 at 09:29 AM. |
Advert | |
|
09-07-2010, 09:35 AM | #3 |
Addict
Posts: 202
Karma: 10802
Join Date: Sep 2010
Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7
|
Thanks. I take it | means OR, so I just type out all the chapter headings.
|
09-07-2010, 10:17 AM | #4 |
Wizard
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
Yes - the | means or - but is a regex operator and is part of the regular expression allowing the one expression to match one of a number of strings
|
09-07-2010, 11:08 AM | #5 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Correct, and as itimpi noted, you need to use it correctly in the context of a regex, which primarily means surrounding all the OR'd items with parentheses for that particular operator.
Make sure to include the <br> tags in your pattern as well (at a minimum) so that you don't delete words from the book text. |
Advert | |
|
Tags |
footer, header, pdf, regex |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 01:21 PM |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 10:42 AM |
regex request for specific header removal | cellocgw | Calibre | 2 | 04-15-2010 03:42 PM |
Remove Header feature not working | sentience | Calibre | 1 | 01-09-2010 03:11 PM |
Remove Header from PDF | rrosenwald | Calibre | 10 | 08-22-2009 09:36 PM |