Remove Footer - Page 3

mago55 · 01-18-2010, 09:52 AM

What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?.

I sounds like a lot of work, but i really cant understand anything about how Calibre removes headers, its all too much like maths for my poor mind!.

Just in case someone can give me a hand with that, here is what im trying to get rid of: i have many pdf files with this kind of header: "file:///D|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt"
and this kind of footer: "file:///D|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM".
Those appears in the middle of the txt when i convert it to lrf which is veery annoying.
Any ideas how to remove that with Calibre?. Otherwise i will have to convert the pdf to html and remove those manually in every single page.
Thanks again!

DoctorOhh · 01-18-2010, 10:16 AM

Quote:

Originally Posted by mago55

What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?.

By trash headers I'm referring to the kind you list below. Using mobipocket creator I have been successful when converting a pdf to html in eliminating the below kind of headers.

Quote:

Originally Posted by mago55

J"file:///D|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt"
and this kind of footer: "file:///D|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM".

My success in eliminating the above kind of header was why I recalled it as being an out and out success. Mobipocket Creator should eliminate the above header while converting it to html (it did for me last night when I tested it.) Then use wordpad or other software to clean it up.

mago55 · 01-18-2010, 12:47 PM

Thanks, i will try that out and see what happens...

ac4lt · 01-18-2010, 04:30 PM

Quote:

Originally Posted by poodlemama

OK let me ask if this is the correct process...

I have a PDF book. . . I add it to my Calibre library... I go to convert ebook and update all metadata and the go to Page set up... make sure input and output are correct.. go to structure detection.. click on "Remove Footer" and click on the wizard tool... type in (\d+\s*) and see the highlighted page numbers and codes... push ok and then click ok to start the conversion.. what am I missing?

That sounds like what I've tried as well.

From the testing I've done (which is limited and not very thorough), any time my regex has referenced a "" or "" it fails even though the wizard shows it should succeed.

aerosol_grey · 01-28-2010, 08:37 AM

This is quite similar to something mentioned above but for me the string

\d+()

has just worked (for all but one instance!). Leaving out the brackets or putting everything in brackets seemed to fail.
This did mess up the position of some of the chapter headings but there are a lot less of those to fix!

concern · 02-05-2010, 03:55 PM

and just don't seem to want to match in the regexp. Even when I use a blanket expression such as this: Text.*$, if the $ envelops a or it doesn't work. I have to explicitly make sure that $ does not match one of these, so I use: Text.*More text. As long as "More text" finishes just before the it is all good.

Any ideas when this will be fixed? Removing headers and footers at the moment is more about guesswork and experimentation.

aerosol_grey · 02-13-2010, 08:01 AM

I agree totally that it is currently a case of trial and error, my expression above worked for me but in the wizard it didn't highlight the text I wanted to remove! I have a feeling that if the text contains a <\p> then finding a term to match the phrase and then replacing the two paragraph markers with
() will work provided that it is not itself enclosed in paranthasis. I haven't tried this though as I only convert one book at a time and then read it.

It would seem that the problem comes from the fact that the view presented in the wizard does not correspond to the point were the expression is applied! To save messing around most of the time I convert the file 'as is' to RTF format and remove the oddities using wildcards using msword's 'find & replace'. Then tidy the format by making titles into Header1 format Chapters into Header 2 format and then use Callibre again to convert the RTF to the required format. It's more roundabout but works out quicker

kristarella · 03-09-2010, 08:35 AM

I also can't remove headers and footers. They get highlighted in the regex wizard just fine, but they are never removed. It may be that I'm trying to use paragraph tags, which folks are saying don't work properly, but there is no point trying to remove some books' footers without because if it's just the page number in the footer or book title in the header, the is the only thing differentiating it from any other number/title in the book.

For now I am using Sigil to edit the epub output from Calibre and the regex find/replace works quite well in Sigil. However, Sigil is extremely slow at processing and making changes, so it'd be much better if it could be fixed in Calibre. =)

matthias · 03-09-2010, 09:17 AM

i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked:
(i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway)

Code:

(<p>\s*\d+\s*<p>)

this regex will remove every pagenumber that stands by itself in a row.

if there is something like "Page 3" (having the html-Syntax of "Page 3 ", you have to adjust your regex, too:

Code:

(<p>Page\s*\d+\s*<p>)

If it's becoming more difficult, you can use the wizard to verify your results, but in general you have to replace the closing tag with a "normal" to get it to work with in the conversion.

kristarella · 03-12-2010, 01:08 AM

Quote:

Originally Posted by matthias

i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked:
(i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway)

Code:

(<p>\s*\d+\s*<p>)

this regex will remove every pagenumber that stands by itself in a row.

if there is something like "Page 3" (having the html-Syntax of "Page 3 ", you have to adjust your regex, too:

Code:

(<p>Page\s*\d+\s*<p>)

If it's becoming more difficult, you can use the wizard to verify your results, but in general you have to replace the closing tag with a "normal" to get it to work with in the conversion.

Thanks matthias, I will bookmark this and give it a go next time. I usually try to do more than page numbers, I do headers with book and chapter names (usually have some formatting that differentiates them from other text) and sometimes document name/date header/footers. It's hard to know if you're nuking them without the highlighting. Will try though.

matthias · 03-12-2010, 06:18 AM

Hello kristarella,

to make the wizard highlight it AND remove it you'll have to modify your regex a little bit, but it works. simply use the following regex

Code:

(<p>Page\s*\d+\s*</*p>)

OR

Code:

(<p>Page\s*\d+\s*<(/)*p>)

(notice that both will work in the same way)

only thing changed since my last post is the "/*", which will allow the wizard to highlight the found therm, even if the "/" is in the tag, but it removes the found therm also if its not there, so it'll work for the preview and for the conversion.

i hope this will help to get your conversions done as comfortable as possible.

jernej · 05-08-2010, 04:35 PM

Hello,
matthias I tried your seuggestion to remove header with the following regex:

(</*p>\s*</*p>\s*\d+\s*</*p>\s.*</*p>\s*</*p>\s*</*p>)

The header looks like this:


1 
Chapter 1: Hello 



The wizard successfully higlights all the page headers, but the conversion still doesn't work.
Any ideas?
Thanks!

flinx1 · 05-20-2010, 10:50 AM

In the project Gutenberg books (html version), there's some header and footer text in the file that are formatted using the <pre>texthere<pre>. I'd like to remove them, but can't figure out the regex formatting. Help?

ehupp · 07-17-2010, 03:54 PM

Working off of earlier regex expressions I'm still having a problem as many others are removing a page number at the bottom of a pdf file when converting. I am converting to mobi for a kindle and have used the following regex to remove it.

My test output was as follows:
Last line of text on page. 
3 

So I took the preceding expression of (\s*\d+\s*)
and changed it to ( \s*\d+\s* )

when tested the 3 became highlighted.
When converted it appears to have worked. Will need to scan through text to verify though.

Hope this helps anyone else trying to fix this issue.

Eric.

Lonas · 07-21-2010, 05:48 AM

What we really need is that the 'remove footer' dialogue displays the string in the same version where the regex is applied to. Everything else is only guessing and will never completely solve the problem.

So I beg the programmer of this feature to fix this issue finally (shouldn't be that hard from an outside perspective anyway), it's really annoying in an otherwise great program.

01-28-2010, 08:37 AM	#35
aerosol_grey Junior Member Posts: 7 Karma: 10 Join Date: Jan 2010 Device: Sony prs-505	This is quite similar to something mentioned above but for me the string \d+(<p>) has just worked (for all but one instance!). Leaving out the brackets or putting everything in brackets seemed to fail. This did mess up the position of some of the chapter headings but there are a lot less of those to fix!

02-05-2010, 03:55 PM	#36
concern Junior Member Posts: 5 Karma: 10 Join Date: Jan 2010 Device: Kindle 2	<p> and </p> just don't seem to want to match in the regexp. Even when I use a blanket expression such as this: Text.$, if the $ envelops a <p> or </p> it doesn't work. I have to explicitly make sure that $ does not match one of these, so I use: Text.More text. As long as "More text" finishes just before the <p></p> it is all good. Any ideas when this will be fixed? Removing headers and footers at the moment is more about guesswork and experimentation.

02-13-2010, 08:01 AM	#37
aerosol_grey Junior Member Posts: 7 Karma: 10 Join Date: Jan 2010 Device: Sony prs-505	I agree totally that it is currently a case of trial and error, my expression above worked for me but in the wizard it didn't highlight the text I wanted to remove! I have a feeling that if the text contains a <\p><p> then finding a term to match the phrase and then replacing the two paragraph markers with (<p>) will work provided that it is not itself enclosed in paranthasis. I haven't tried this though as I only convert one book at a time and then read it. It would seem that the problem comes from the fact that the view presented in the wizard does not correspond to the point were the expression is applied! To save messing around most of the time I convert the file 'as is' to RTF format and remove the oddities using wildcards using msword's 'find & replace'. Then tidy the format by making titles into Header1 format Chapters into Header 2 format and then use Callibre again to convert the RTF to the required format. It's more roundabout but works out quicker

03-09-2010, 08:35 AM	#38
kristarella Junior Member Posts: 3 Karma: 10 Join Date: Mar 2010 Device: iPhone	I also can't remove headers and footers. They get highlighted in the regex wizard just fine, but they are never removed. It may be that I'm trying to use paragraph tags, which folks are saying don't work properly, but there is no point trying to remove some books' footers without </p> because if it's just the page number in the footer or book title in the header, the </p> is the only thing differentiating it from any other number/title in the book. For now I am using Sigil to edit the epub output from Calibre and the regex find/replace works quite well in Sigil. However, Sigil is extremely slow at processing and making changes, so it'd be much better if it could be fixed in Calibre. =)

03-09-2010, 09:17 AM	#39
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked: (i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway) Code: (<p>\s\d+\s<p>) this regex will remove every pagenumber that stands by itself in a row. if there is something like "Page 3" (having the html-Syntax of "<p>Page 3 </p>", you have to adjust your regex, too: Code: (<p>Page\s\d+\s<p>) If it's becoming more difficult, you can use the wizard to verify your results, but in general you have to replace the closing tag with a "normal" to get it to work with in the conversion. Last edited by matthias; 03-09-2010 at 09:24 AM.

01-18-2010, 09:52 AM	#31
mago55 Junior Member Posts: 6 Karma: 10 Join Date: Nov 2009 Device: sony ebook PR-505	What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?. I sounds like a lot of work, but i really cant understand anything about how Calibre removes headers, its all too much like maths for my poor mind!. Just in case someone can give me a hand with that, here is what im trying to get rid of: i have many pdf files with this kind of header: "file:///D\|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt" and this kind of footer: "file:///D\|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM". Those appears in the middle of the txt when i convert it to lrf which is veery annoying. Any ideas how to remove that with Calibre?. Otherwise i will have to convert the pdf to html and remove those manually in every single page. Thanks again!

01-18-2010, 12:47 PM	#33
mago55 Junior Member Posts: 6 Karma: 10 Join Date: Nov 2009 Device: sony ebook PR-505	Thanks, i will try that out and see what happens...

03-12-2010, 06:18 AM	#41
matthias Enthusiast Posts: 25 Karma: 4212 Join Date: Nov 2009 Location: South Tyrol, Italy Device: Sony Reader PRS-505	Hello kristarella, to make the wizard highlight it AND remove it you'll have to modify your regex a little bit, but it works. simply use the following regex Code: (<p>Page\s\d+\s</p>) OR Code: (<p>Page\s\d+\s<(/)p>) (notice that both will work in the same way) only thing changed since my last post is the "/", which will allow the wizard to highlight the found therm, even if the "/" is in the tag, but it removes the found therm also if its not there, so it'll work for the preview and for the conversion. i hope this will help to get your conversions done as comfortable as possible. Last edited by matthias; 03-12-2010 at 06:20 AM.*

05-08-2010, 04:35 PM	#42
jernej Junior Member Posts: 4 Karma: 10 Join Date: Oct 2009 Device: none	Hello, matthias I tried your seuggestion to remove header with the following regex: (</p><p>\s</p><p>\s\d+\s</p><p>\s.</p><p>\s</p><p>\s</p><p>) The header looks like this: </p><p> 1 </p><p> Chapter 1: Hello </p><p> </p><p> </p><p> The wizard successfully higlights all the page headers, but the conversion still doesn't work. Any ideas? Thanks!

05-20-2010, 10:50 AM	#43
flinx1 Member Posts: 11 Karma: 10 Join Date: May 2009 Device: Sony PRS-505	In the project Gutenberg books (html version), there's some header and footer text in the file that are formatted using the <pre>texthere<pre>. I'd like to remove them, but can't figure out the regex formatting. Help?

07-17-2010, 03:54 PM	#44
ehupp Junior Member Posts: 1 Karma: 10 Join Date: Jul 2010 Device: kindle	Working off of earlier regex expressions I'm still having a problem as many others are removing a page number at the bottom of a pdf file when converting. I am converting to mobi for a kindle and have used the following regex to remove it. My test output was as follows: Last line of text on page. <br> 3<br> So I took the preceding expression of (<p>\s\d+\s<p>) and changed it to (<br>\s\d+\s<br>) when tested the <br>3<br> became highlighted. When converted it appears to have worked. Will need to scan through text to verify though. Hope this helps anyone else trying to fix this issue. Eric.

07-21-2010, 05:48 AM	#45
Lonas Junior Member Posts: 5 Karma: 10 Join Date: May 2010 Device: iPad	What we really need is that the 'remove footer' dialogue displays the string in the same version where the regex is applied to. Everything else is only guessing and will never completely solve the problem. So I beg the programmer of this feature to fix this issue finally (shouldn't be that hard from an outside perspective anyway), it's really annoying in an otherwise great program.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Regex help to remove HTML footer	neonbible	Calibre	4	09-09-2010 09:42 AM
footer removal help	icy	Calibre	7	08-27-2010 01:21 PM
remove PDF footer containing variable?	irisclara	Calibre	10	03-06-2010 10:53 PM
RFE: Remove remove tags in bulk edit	magphil	Calibre	0	08-11-2009 10:37 AM

Advert

Advert