Regex to remove header from PDF

neonbible · 09-07-2010, 08:51 AM

I have been reading through a few posts about using Regex to remove headers and footers.

I have successfully managed to remove page numbers and static strings. But the particular PDF I am using use the chapter title as the header. So this is going to change a lot.

How do I specify a regex expression to match a phrase/string?

ldolse · 09-07-2010, 09:12 AM

try this:
 \s*(title1|title2|title3|title\s+with\s+spaces )\s* - changing title to whatever your chapters are.

You can also add <hr>\s* or \s*<hr> to the beginning or end (depending on whether it's header or footer), to more accurately tie it just to the page headers. If you tie it to the <hr> tag you might be able to get away with something like this:

 \s*(\w+\s*)+\s* \s*<hr>

Use the test function with some of those examples to see if you can get what you need.

http://www.regular-expressions.info/ is the best place to read up on how to use regex.

edit - here's a sample regex I used for a file which also had chapter title headers:

Code:

((Castello\s|The\s(Phleg|nun|night|prince\sof\smus|garden|secret\spalac)|Epilogu|Prefac|Four\scarnival|Amalf|La\sSiren|Marriage\sto|Montevergin|Spaccanapol|A\sstiletto|Gesualdo\sC)[^<]+<br>\s*)?(\d|[xvi])+<br>\s*(The\sD\s*e\s*v\s*i\s*l\s*[^<]+<br>)?\s*((Bh|27)[^<]+<br>\s*){4,4}\s*<hr>\s*<A name=\d+></a>

I believe in this case it was a footer, <A name=\d+></a> also shows up on every page break, so it's another way to tie the regex to the header/footer by including that in the pattern.

neonbible · 09-07-2010, 09:35 AM

Thanks. I take it | means OR, so I just type out all the chapter headings.

itimpi · 09-07-2010, 10:17 AM

Yes - the | means or - but is a regex operator and is part of the regular expression allowing the one expression to match one of a number of strings

ldolse · 09-07-2010, 11:08 AM

Correct, and as itimpi noted, you need to use it correctly in the context of a regex, which primarily means surrounding all the OR'd items with parentheses for that particular operator.

Make sure to include the tags in your pattern as well (at a minimum) so that you don't delete words from the book text.

09-07-2010, 08:51 AM	#1
neonbible Addict Posts: 202 Karma: 10802 Join Date: Sep 2010 Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7	Regex to remove header from PDF I have been reading through a few posts about using Regex to remove headers and footers. I have successfully managed to remove page numbers and static strings. But the particular PDF I am using use the chapter title as the header. So this is going to change a lot. How do I specify a regex expression to match a phrase/string?

09-07-2010, 09:12 AM	#2
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	try this: <br>\s(title1\|title2\|title3\|title\s+with\s+spaces )\s<br> - changing title to whatever your chapters are. You can also add <hr>\s* or \s<hr> to the beginning or end (depending on whether it's header or footer), to more accurately tie it just to the page headers. If you tie it to the <hr> tag you might be able to get away with something like this: <br>\s(\w+\s)+\s<br>\s<hr> Use the test function with some of those examples to see if you can get what you need. http://www.regular-expressions.info/ is the best place to read up on how to use regex. edit - here's a sample regex I used for a file which also had chapter title headers: Code: ((Castello\s\|The\s(Phleg\|nun\|night\|prince\sof\smus\|garden\|secret\spalac)\|Epilogu\|Prefac\|Four\scarnival\|Amalf\|La\sSiren\|Marriage\sto\|Montevergin\|Spaccanapol\|A\sstiletto\|Gesualdo\sC)[^<]+<br>\s)?(\d\|[xvi])+<br>\s(The\sD\se\sv\si\sl\s[^<]+<br>)?\s((Bh\|27)[^<]+<br>\s){4,4}\s<hr>\s<A name=\d+></a> I believe in this case it was a footer, <A name=\d+></a> also shows up on every page break, so it's another way to tie the regex to the header/footer by including that in the pattern. Last edited by ldolse; 09-07-2010 at 09:29 AM.

09-07-2010, 11:08 AM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Correct, and as itimpi noted, you need to use it correctly in the context of a regex, which primarily means surrounding all the OR'd items with parentheses for that particular operator. Make sure to include the <br> tags in your pattern as well (at a minimum) so that you don't delete words from the book text.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 01:21 PM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 10:42 AM
regex request for specific header removal	cellocgw	Calibre	2	04-15-2010 03:42 PM
Remove Header feature not working	sentience	Calibre	1	01-09-2010 03:11 PM
Remove Header from PDF	rrosenwald	Calibre	10	08-22-2009 09:36 PM

09-07-2010, 09:35 AM	#3
neonbible Addict Posts: 202 Karma: 10802 Join Date: Sep 2010 Device: Kindle Paperwhite, iPhone 5, iPad Air, Nexus 7	Thanks. I take it \| means OR, so I just type out all the chapter headings.

09-07-2010, 10:17 AM	#4
itimpi Wizard Posts: 4,553 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	Yes - the \| means or - but is a regex operator and is part of the regular expression allowing the one expression to match one of a number of strings

Advert

Advert