08-06-2010, 11:54 AM | #1 |
Nameless Being
|
clean HTML or PDF before mobi conversion in Calibre
I recently bought a dvd which contains 1000+ books. Having preordered a Kindle 3, I'd like to create mobi files of some of this content, but have run into some formatting challenges. I'd appraciate any advice on how to resolve these and automate the solution as much as possible.
I have found two methods to export content from the dvd so far. 1. Create a PDF of the book. Most formatting is preserved in the PDF, but importing it in Calibre results in garbled text, missing text, double text and other formatting issues. 2. Copy the book content to a text editor. The formatting is always lost this way. Also, all the headings in the text now also appear twice, like this: Chapter 1 (Chapter 1) instead of this: Chapter 1 Option one appears to be the only way to retain most of the text's formatting, but is only useful if the content is copied over from PDF to another text editor or extracted some other way. Copy / pasting from PDF to Word 2003 and saving as a filtered webpage causes a new problem: every last few lines of each PDF page now appears twice in the HTML file. Option 2 produces ok content except for the double header and missing formatting. I found that using wildcards in "find and replace" in Word can help here: find: (*)^13(\(\1\)) replace: \1 The options for using wildcards and using bold text in the replace field also need to be activated. However, the problem with this search and replace solution is that Word freezes when it's applied to documents longer than 5 pages! Is there another way to get rid of the 2nd header in brackets and bold the first one? I have attached a sample pdf and htm file, and would appreciate any help on how to handle these files to produce clean mobi files. edit, seems I can't upload htm files... |
08-06-2010, 12:07 PM | #2 |
Wizard
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
|
Do you have any idea what format the books are stored in on the DVD. As you have discovered PDF conversion is often far from ideal, so other formats are normally preferred.
|
Advert | |
|
08-06-2010, 02:03 PM | #3 |
Nameless Being
|
the books are all in a massive 5 GB nfo file.
Do you think there is a way to manipulate a text file that big? |
08-06-2010, 02:18 PM | #4 |
Guru
Posts: 897
Karma: 950683
Join Date: Oct 2009
Device: Kobo Libra2
|
If this is a CD of public domain books, such as Huckleberry Finn, you can find much nicer copies in various formats on the internet.
For example, Huckleberry Finn can be found already in mobi format, right here on mobileread: https://www.mobileread.com/forums/sho...ht=huckleberry -Marcy |
08-06-2010, 07:27 PM | #5 |
Nameless Being
|
Yes, I've seen decent mobi versions of this book on various places. It was the book I had open when I needed to create an example pdf. However, there's lot's of content on the dvd that I can't find elsewhere.
So as for the question on identifying double text entries like Chapter 1 (Chapter 1) and changing them to Chapter 1 If I can restore the header formatting, I can certainly live with losing the other formatting. Anybody know how to automate this? |
Advert | |
|
08-06-2010, 07:48 PM | #6 | |
Grand Sorcerer
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
Quote:
find: (Chapter [0-9]{1,3})^13(\(\1\)) replace: \1 which would get less hits. Also, as an aside, you may find it more useful to have your Replace specify formating of Style "Heading 2" (or any other of the built-in Heading styles) rather than Bold. If you do this, when you use Calibre to convert to mobi you will be able to use the Heading 2 tags to specify a TOC. If you don't like the way Heading 2 looks in Word then modify the built-in Style first. |
|
08-06-2010, 07:57 PM | #7 |
Grand Sorcerer
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
... or you could try the Find/Replace in 2 passes.
Pass 1: Find: \(Chapter [0-9]{1,3}\) Replace: Pass 2: Find:Chapter [0-9]{1,3}^13 Replace: \1 (with chosen formating) |
08-06-2010, 11:22 PM | #8 |
Junior Member
Posts: 5
Karma: 12
Join Date: Jun 2010
Location: Houston, TX
Device: Kindle DX, iPhone
|
Mark, the only folks I know still using Infobase text databases are Deseret Book and the LDS Church. Is this by any chance GL2005? If so, PM me. I have a much better way to extract the database's contents using a local web service.
|
08-07-2010, 05:03 AM | #9 |
Nameless Being
|
@jackie_w: The difficult part is that for lots of books, the chapter or paragraph names vary a lot. So I guess that forces me to always use (*) at the beginning of the string. And perhaps this is exactly what freezes Word on large docs. Would I could do is identify all the books with headings that have a "chapter 1, chapter 2" structure, correct those with your 2 strings, and correct the rest manually.
I'll give that a try, thanks @Oboe Joe: Yes, I'm using LDS Library 2005 & 2009. PM sent |
12-25-2010, 10:37 PM | #10 |
Junior Member
Posts: 1
Karma: 10
Join Date: Dec 2010
Device: Blackberry/itouch
|
Best way to convert PDF to epub
Calibre will try to convert it for you, but it will usually have lots a formatting issues.
I have tried alot of ways to convert PDFs to other formats, and this is by far the best. Open your pdf in adobe, click edit-->Copy file to Clipboard Open word or wordpad, and paste the file in. This will get it into word with the formatting intact. Save as an RTF, and use calibre to convert to whatever format you want. I use Mobi, but my kids all have Itouches so they need EPUB for stanza. Once you have it in RTF calibre can handle any other conversions you need. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF to Mobi Conversion | rayh | Calibre | 2 | 09-24-2010 03:33 AM |
Problems with PDF to Mobi conversion in Calibre (for my Kindle 3) | star | Calibre | 1 | 09-13-2010 02:01 PM |
BookDesigner HTML0 to clean HTML conversion utility | Pablo | Workshop | 15 | 08-24-2010 01:05 PM |
Some Calibre PDF>Mobi conversion advise please | AdrianC | Calibre | 3 | 09-16-2009 03:00 PM |
Tool to easily clean and refurbish html-text before conversion | Pulp | Workshop | 3 | 10-13-2008 11:16 AM |