![]() |
#31 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Nov 2009
Device: sony ebook PR-505
|
What do you mean by "trash headers"?. Then it is still possible to convert pdf to html with mobipocket, then manually delete all the headers and footers and finally convert it to lrf with Calibre... is that right?.
I sounds like a lot of work, but i really cant understand anything about how Calibre removes headers, its all too much like maths for my poor mind!. Just in case someone can give me a hand with that, here is what im trying to get rid of: i have many pdf files with this kind of header: "file:///D|/Documents%20and%20Settings/harry/Desktop/New%20Folder...20The%20Paths%20of%20Darkness%201%2 0-%20The%20Silent%20Blade.txt" and this kind of footer: "file:///D|/Documents%20and%20Settings/harry/Deskto...20of%20Darkness%201%20-%20The%20Silent%20Blade.txt (1 of 275)3/13/2004 12:03:17 AM". Those appears in the middle of the txt when i convert it to lrf which is veery annoying. Any ideas how to remove that with Calibre?. Otherwise i will have to convert the pdf to html and remove those manually in every single page. Thanks again! |
![]() |
![]() |
![]() |
#32 | ||
US Navy, Retired
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 9,865
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Nexus 7
|
Quote:
Quote:
|
||
![]() |
![]() |
Advert | |
|
![]() |
#33 |
Junior Member
![]() Posts: 6
Karma: 10
Join Date: Nov 2009
Device: sony ebook PR-505
|
Thanks, i will try that out and see what happens...
|
![]() |
![]() |
![]() |
#34 | |
Connoisseur
![]() Posts: 61
Karma: 36
Join Date: Jan 2010
Location: Reston, Virginia, US
Device: ipad
|
Quote:
From the testing I've done (which is limited and not very thorough), any time my regex has referenced a "<p>" or "</p>" it fails even though the wizard shows it should succeed. |
|
![]() |
![]() |
![]() |
#35 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Sony prs-505
|
This is quite similar to something mentioned above but for me the string
\d+(<p>) has just worked (for all but one instance!). Leaving out the brackets or putting everything in brackets seemed to fail. This did mess up the position of some of the chapter headings but there are a lot less of those to fix! |
![]() |
![]() |
Advert | |
|
![]() |
#36 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: Jan 2010
Device: Kindle 2
|
<p> and </p> just don't seem to want to match in the regexp. Even when I use a blanket expression such as this: Text.*$, if the $ envelops a <p> or </p> it doesn't work. I have to explicitly make sure that $ does not match one of these, so I use: Text.*More text. As long as "More text" finishes just before the <p></p> it is all good.
Any ideas when this will be fixed? Removing headers and footers at the moment is more about guesswork and experimentation. |
![]() |
![]() |
![]() |
#37 |
Junior Member
![]() Posts: 7
Karma: 10
Join Date: Jan 2010
Device: Sony prs-505
|
I agree totally that it is currently a case of trial and error, my expression above worked for me but in the wizard it didn't highlight the text I wanted to remove! I have a feeling that if the text contains a <\p><p> then finding a term to match the phrase and then replacing the two paragraph markers with
(<p>) will work provided that it is not itself enclosed in paranthasis. I haven't tried this though as I only convert one book at a time and then read it. It would seem that the problem comes from the fact that the view presented in the wizard does not correspond to the point were the expression is applied! To save messing around most of the time I convert the file 'as is' to RTF format and remove the oddities using wildcards using msword's 'find & replace'. Then tidy the format by making titles into Header1 format Chapters into Header 2 format and then use Callibre again to convert the RTF to the required format. It's more roundabout but works out quicker ![]() |
![]() |
![]() |
![]() |
#38 |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Mar 2010
Device: iPhone
|
I also can't remove headers and footers. They get highlighted in the regex wizard just fine, but they are never removed. It may be that I'm trying to use paragraph tags, which folks are saying don't work properly, but there is no point trying to remove some books' footers without </p> because if it's just the page number in the footer or book title in the header, the </p> is the only thing differentiating it from any other number/title in the book.
For now I am using Sigil to edit the epub output from Calibre and the regex find/replace works quite well in Sigil. However, Sigil is extremely slow at processing and making changes, so it'd be much better if it could be fixed in Calibre. =) |
![]() |
![]() |
![]() |
#39 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
|
i know i'm repeating myself, but i converted last week (with calibre 0.6.43) to convert several PDF-Files with pagenumbers in all the different ways there are. For most of them, the following Regex worked:
(i know it won't be highlighted in the wizard, but when you convert it, you will notice that it works anyway) Code:
(<p>\s*\d+\s*<p>) if there is something like "Page 3" (having the html-Syntax of "<p>Page 3 </p>", you have to adjust your regex, too: Code:
(<p>Page\s*\d+\s*<p>) Last edited by matthias; 03-09-2010 at 09:24 AM. |
![]() |
![]() |
![]() |
#40 | |
Junior Member
![]() Posts: 3
Karma: 10
Join Date: Mar 2010
Device: iPhone
|
Quote:
|
|
![]() |
![]() |
![]() |
#41 |
Enthusiast
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 25
Karma: 4212
Join Date: Nov 2009
Location: South Tyrol, Italy
Device: Sony Reader PRS-505
|
Hello kristarella,
to make the wizard highlight it AND remove it you'll have to modify your regex a little bit, but it works. simply use the following regex Code:
(<p>Page\s*\d+\s*</*p>) Code:
(<p>Page\s*\d+\s*<(/)*p>) only thing changed since my last post is the "/*", which will allow the wizard to highlight the found therm, even if the "/" is in the tag, but it removes the found therm also if its not there, so it'll work for the preview and for the conversion. i hope this will help to get your conversions done as comfortable as possible. Last edited by matthias; 03-12-2010 at 06:20 AM. |
![]() |
![]() |
![]() |
#42 |
Junior Member
![]() Posts: 4
Karma: 10
Join Date: Oct 2009
Device: none
|
Hello,
matthias I tried your seuggestion to remove header with the following regex: (</*p><p>\s*</*p><p>\s*\d+\s*</*p><p>\s.*</*p><p>\s*</*p><p>\s*</*p><p>) The header looks like this: </p><p> 1 </p><p> Chapter 1: Hello </p><p> </p><p> </p><p> The wizard successfully higlights all the page headers, but the conversion still doesn't work. Any ideas? Thanks! |
![]() |
![]() |
![]() |
#43 |
Member
![]() Posts: 11
Karma: 10
Join Date: May 2009
Device: Sony PRS-505
|
In the project Gutenberg books (html version), there's some header and footer text in the file that are formatted using the <pre>texthere<pre>. I'd like to remove them, but can't figure out the regex formatting. Help?
|
![]() |
![]() |
![]() |
#44 |
Junior Member
![]() Posts: 1
Karma: 10
Join Date: Jul 2010
Device: kindle
|
Working off of earlier regex expressions I'm still having a problem as many others are removing a page number at the bottom of a pdf file when converting. I am converting to mobi for a kindle and have used the following regex to remove it.
My test output was as follows: Last line of text on page. <br> 3<br> So I took the preceding expression of (<p>\s*\d+\s*<p>) and changed it to (<br>\s*\d+\s*<br>) when tested the <br>3<br> became highlighted. When converted it appears to have worked. Will need to scan through text to verify though. Hope this helps anyone else trying to fix this issue. Eric. |
![]() |
![]() |
![]() |
#45 |
Junior Member
![]() Posts: 5
Karma: 10
Join Date: May 2010
Device: iPad
|
What we really need is that the 'remove footer' dialogue displays the string in the same version where the regex is applied to. Everything else is only guessing and will never completely solve the problem.
So I beg the programmer of this feature to fix this issue finally (shouldn't be that hard from an outside perspective anyway), it's really annoying in an otherwise great program. |
![]() |
![]() |
![]() |
Tags |
calibre pdf footer remove |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
Regex help to remove HTML footer | neonbible | Calibre | 4 | 09-09-2010 09:42 AM |
footer removal help | icy | Calibre | 7 | 08-27-2010 01:21 PM |
remove PDF footer containing variable? | irisclara | Calibre | 10 | 03-06-2010 10:53 PM |
RFE: Remove remove tags in bulk edit | magphil | Calibre | 0 | 08-11-2009 10:37 AM |