Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 01-18-2010, 09:52 AM   #31
mago55
Junior Member
mago55 began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2009
Device: sony ebook PR-505
What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?.

I sounds like a lot of work, but i really cant understand anything about how Calibre removes headers, its all too much like maths for my poor mind!.

Just in case someone can give me a hand with that, here is what im trying to get rid of: i have many pdf files with this kind of header: "file:///D|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt"
and this kind of footer: "file:///D|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM".
Those appears in the middle of the txt when i convert it to lrf which is veery annoying.
Any ideas how to remove that with Calibre?. Otherwise i will have to convert the pdf to html and remove those manually in every single page.
Thanks again!
mago55 is offline   Reply With Quote
Old 01-18-2010, 10:16 AM   #32
DoctorOhh
US Navy, Retired
DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.DoctorOhh ought to be getting tired of karma fortunes by now.
 
DoctorOhh's Avatar
 
Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
Quote:
Originally Posted by mago55 View Post
What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?.
By trash headers I'm referring to the kind you list below. Using mobipocket creator I have been successful when converting a pdf to html in eliminating the below kind of headers.

Quote:
Originally Posted by mago55 View Post
J"file:///D|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt"
and this kind of footer: "file:///D|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM".
My success in eliminating the above kind of header was why I recalled it as being an out and out success. Mobipocket Creator should eliminate the above header while converting it to html (it did for me last night when I tested it.) Then use wordpad or other software to clean it up.
DoctorOhh is offline   Reply With Quote
Advert
Old 01-18-2010, 12:47 PM   #33
mago55
Junior Member
mago55 began at the beginning.
 
Posts: 6
Karma: 10
Join Date: Nov 2009
Device: sony ebook PR-505
Thanks, i will try that out and see what happens...
mago55 is offline   Reply With Quote
Old 01-18-2010, 04:30 PM   #34
ac4lt
Connoisseur
ac4lt began at the beginning.
 
ac4lt's Avatar
 
Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
Quote:
Originally Posted by poodlemama View Post
OK let me ask if this is the correct process...

I have a PDF book. . . I add it to my Calibre library... I go to convert ebook and update all metadata and the go to Page set up... make sure input and output are correct.. go to structure detection.. click on "Remove Footer" and click on the wizard tool... type in (\d+\s*</p><p>) and see the highlighted page numbers and codes... push ok and then click ok to start the conversion.. what am I missing?
That sounds like what I've tried as well.

From the testing I've done (which is limited and not very thorough), any time my regex has referenced a "<p>" or "</p>" it fails even though the wizard shows it should succeed.
ac4lt is offline   Reply With Quote
Old 01-28-2010, 08:37 AM   #35
aerosol_grey
Junior Member
aerosol_grey began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Sony prs-505
This is quite similar to something mentioned above but for me the string

\d+(<p>)

has just worked (for all but one instance!). Leaving out the brackets or putting everything in brackets seemed to fail.
This did mess up the position of some of the chapter headings but there are a lot less of those to fix!
aerosol_grey is offline   Reply With Quote
Advert
Old 02-05-2010, 03:55 PM   #36
concern
Junior Member
concern began at the beginning.
 
Posts: 5
Karma: 10
Join Date: Jan 2010
Device: Kindle 2
<p> and </p> just don't seem to want to match in the regexp. Even when I use a blanket expression such as this: Text.*$, if the $ envelops a <p> or </p> it doesn't work. I have to explicitly make sure that $ does not match one of these, so I use: Text.*More text. As long as "More text" finishes just before the <p></p> it is all good.

Any ideas when this will be fixed? Removing headers and footers at the moment is more about guesswork and experimentation.
concern is offline   Reply With Quote
Old 02-13-2010, 08:01 AM   #37
aerosol_grey
Junior Member
aerosol_grey began at the beginning.
 
Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Sony prs-505
I agree totally that it is currently a case of trial and error, my expression above worked for me but in the wizard it didn't highlight the text I wanted to remove! I have a feeling that if the text contains a <\p><p> then finding a term to match the phrase and then replacing the two paragraph markers with
(<p>) will work provided that it is not itself enclosed in paranthasis. I haven't tried this though as I only convert one book at a time and then read it.

It would seem that the problem comes from the fact that the view presented in the wizard does not correspond to the point were the expression is applied! To save messing around most of the time I convert the file 'as is' to RTF format and remove the oddities using wildcards using msword's 'find & replace'. Then tidy the format by making titles into Header1 format Chapters into Header 2 format and then use Callibre again to convert the RTF to the required format. It's more roundabout but works out quicker
aerosol_grey is offline   Reply With Quote
Old 03-09-2010, 08:35 AM   #38
kristarella
Junior Member
kristarella began at the beginning.
 
kristarella's Avatar
 
Posts: 3
Karma: 10
Join Date: Mar 2010
Device: iPhone
I also can't remove headers and footers. They get highlighted in the regex wizard just fine, but they are never removed. It may be that I'm trying to use paragraph tags, which folks are saying don't work properly, but there is no point trying to remove some books' footers without </p> because if it's just the page number in the footer or book title in the header, the </p> is the only thing differentiating it from any other number/title in the book.

For now I am using Sigil to edit the epub output from Calibre and the regex find/replace works quite well in Sigil. However, Sigil is extremely slow at processing and making changes, so it'd be much better if it could be fixed in Calibre. =)
kristarella is offline   Reply With Quote
Old 03-09-2010, 09:17 AM   #39
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked:
(i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway)

Code:
(<p>\s*\d+\s*<p>)
this regex will remove every pagenumber that stands by itself in a row.

if there is something like "Page 3" (having the html-Syntax of "<p>Page 3 </p>", you have to adjust your regex, too:

Code:
(<p>Page\s*\d+\s*<p>)
If it's becoming more difficult, you can use the wizard to verify your results, but in general you have to replace the closing tag with a "normal" to get it to work with in the conversion.

Last edited by matthias; 03-09-2010 at 09:24 AM.
matthias is offline   Reply With Quote
Old 03-12-2010, 01:08 AM   #40
kristarella
Junior Member
kristarella began at the beginning.
 
kristarella's Avatar
 
Posts: 3
Karma: 10
Join Date: Mar 2010
Device: iPhone
Quote:
Originally Posted by matthias View Post
i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked:
(i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway)

Code:
(<p>\s*\d+\s*<p>)
this regex will remove every pagenumber that stands by itself in a row.

if there is something like "Page 3" (having the html-Syntax of "<p>Page 3 </p>", you have to adjust your regex, too:

Code:
(<p>Page\s*\d+\s*<p>)
If it's becoming more difficult, you can use the wizard to verify your results, but in general you have to replace the closing tag with a "normal" to get it to work with in the conversion.
Thanks matthias, I will bookmark this and give it a go next time. I usually try to do more than page numbers, I do headers with book and chapter names (usually have some formatting that differentiates them from other text) and sometimes document name/date header/footers. It's hard to know if you're nuking them without the highlighting. Will try though.
kristarella is offline   Reply With Quote
Old 03-12-2010, 06:18 AM   #41
matthias
Enthusiast
matthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura aboutmatthias has a spectacular aura about
 
Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
Hello kristarella,

to make the wizard highlight it AND remove it you'll have to modify your regex a little bit, but it works. simply use the following regex

Code:
(<p>Page\s*\d+\s*</*p>)
OR

Code:
(<p>Page\s*\d+\s*<(/)*p>)
(notice that both will work in the same way)

only thing changed since my last post is the "/*", which will allow the wizard to highlight the found therm, even if the "/" is in the tag, but it removes the found therm also if its not there, so it'll work for the preview and for the conversion.

i hope this will help to get your conversions done as comfortable as possible.

Last edited by matthias; 03-12-2010 at 06:20 AM.
matthias is offline   Reply With Quote
Old 05-08-2010, 04:35 PM   #42
jernej
Junior Member
jernej began at the beginning.
 
Posts: 4
Karma: 10
Join Date: Oct 2009
Device: none
Hello,
matthias I tried your seuggestion to remove header with the following regex:

(</*p><p>\s*</*p><p>\s*\d+\s*</*p><p>\s.*</*p><p>\s*</*p><p>\s*</*p><p>)

The header looks like this:

</p><p>
1 </p><p>
Chapter 1: Hello </p><p>
</p><p>
</p><p>


The wizard successfully higlights all the page headers, but the conversion still doesn't work.
Any ideas?
Thanks!
jernej is offline   Reply With Quote
Old 05-20-2010, 10:50 AM   #43
flinx1
Member
flinx1 began at the beginning.
 
Posts: 11
Karma: 10
Join Date: May 2009
Device: Sony PRS-505
In the project Gutenberg books (html version), there's some header and footer text in the file that are formatted using the <pre>texthere<pre>. I'd like to remove them, but can't figure out the regex formatting. Help?
flinx1 is offline   Reply With Quote
Old 07-17-2010, 03:54 PM   #44
ehupp
Junior Member
ehupp began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Jul 2010
Device: kindle
Working off of earlier regex expressions I'm still having a problem as many others are removing a page number at the bottom of a pdf file when converting. I am converting to mobi for a kindle and have used the following regex to remove it.

My test output was as follows:
Last line of text on page. <br>
3<br>

So I took the preceding expression of (<p>\s*\d+\s*<p>)
and changed it to (<br>\s*\d+\s*<br>)

when tested the <br>3<br> became highlighted.
When converted it appears to have worked. Will need to scan through text to verify though.

Hope this helps anyone else trying to fix this issue.

Eric.
ehupp is offline   Reply With Quote
Old 07-21-2010, 05:48 AM   #45
Lonas
Junior Member
Lonas began at the beginning.
 
Posts: 5
Karma: 10
Join Date: May 2010
Device: iPad
What we really need is that the 'remove footer' dialogue displays the string in the same version where the regex is applied to. Everything else is only guessing and will never completely solve the problem.

So I beg the programmer of this feature to fix this issue finally (shouldn't be that hard from an outside perspective anyway), it's really annoying in an otherwise great program.
Lonas is offline   Reply With Quote
Reply

Tags
calibre pdf footer remove


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Structure Detection - Remove Header (or Footer) Regex DarkKipper Conversion 69 11-09-2013 12:21 PM
Regex help to remove HTML footer neonbible Calibre 4 09-09-2010 09:42 AM
footer removal help icy Calibre 7 08-27-2010 01:21 PM
remove PDF footer containing variable? irisclara Calibre 10 03-06-2010 10:53 PM
RFE: Remove remove tags in bulk edit magphil Calibre 0 08-11-2009 10:37 AM


All times are GMT -4. The time now is 04:40 PM.


MobileRead.com is a privately owned, operated and funded community.