Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Reply
 
Thread Tools Search this Thread
Old 08-06-2010, 11:54 AM   #1
mark235
Nameless Being
 
clean HTML or PDF before mobi conversion in Calibre

I recently bought a dvd which contains 1000+ books. Having preordered a Kindle 3, I'd like to create mobi files of some of this content, but have run into some formatting challenges. I'd appraciate any advice on how to resolve these and automate the solution as much as possible.

I have found two methods to export content from the dvd so far.
1. Create a PDF of the book. Most formatting is preserved in the PDF, but importing it in Calibre results in garbled text, missing text, double text and other formatting issues.
2. Copy the book content to a text editor. The formatting is always lost this way. Also, all the headings in the text now also appear twice, like this:
Chapter 1
(Chapter 1)
instead of this:
Chapter 1

Option one appears to be the only way to retain most of the text's formatting, but is only useful if the content is copied over from PDF to another text editor or extracted some other way. Copy / pasting from PDF to Word 2003 and saving as a filtered webpage causes a new problem: every last few lines of each PDF page now appears twice in the HTML file.

Option 2 produces ok content except for the double header and missing formatting. I found that using wildcards in "find and replace" in Word can help here:

find: (*)^13(\(\1\))
replace: \1
The options for using wildcards and using bold text in the replace field also need to be activated. However, the problem with this search and replace solution is that Word freezes when it's applied to documents longer than 5 pages! Is there another way to get rid of the 2nd header in brackets and bold the first one?

I have attached a sample pdf and htm file, and would appreciate any help on how to handle these files to produce clean mobi files.
edit, seems I can't upload htm files...
Attached Files
File Type: pdf Adventures of Huckleberry Finn.pdf (8.3 KB, 422 views)
  Reply With Quote
Old 08-06-2010, 12:07 PM   #2
itimpi
Wizard
itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.itimpi ought to be getting tired of karma fortunes by now.
 
Posts: 4,553
Karma: 950151
Join Date: Nov 2008
Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)
Do you have any idea what format the books are stored in on the DVD. As you have discovered PDF conversion is often far from ideal, so other formats are normally preferred.
itimpi is offline   Reply With Quote
Advert
Old 08-06-2010, 02:03 PM   #3
mark235
Nameless Being
 
the books are all in a massive 5 GB nfo file.
Do you think there is a way to manipulate a text file that big?
  Reply With Quote
Old 08-06-2010, 02:18 PM   #4
Marcy
Guru
Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.Marcy ought to be getting tired of karma fortunes by now.
 
Marcy's Avatar
 
Posts: 897
Karma: 950683
Join Date: Oct 2009
Device: Kobo Libra2
If this is a CD of public domain books, such as Huckleberry Finn, you can find much nicer copies in various formats on the internet.

For example, Huckleberry Finn can be found already in mobi format, right here on mobileread:

https://www.mobileread.com/forums/sho...ht=huckleberry

-Marcy
Marcy is offline   Reply With Quote
Old 08-06-2010, 07:27 PM   #5
mark235
Nameless Being
 
Yes, I've seen decent mobi versions of this book on various places. It was the book I had open when I needed to create an example pdf. However, there's lot's of content on the dvd that I can't find elsewhere.

So as for the question on identifying double text entries like

Chapter 1
(Chapter 1)

and changing them to

Chapter 1

If I can restore the header formatting, I can certainly live with losing the other formatting. Anybody know how to automate this?
  Reply With Quote
Advert
Old 08-06-2010, 07:48 PM   #6
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
Quote:
Originally Posted by mark235 View Post
I found that using wildcards in "find and replace" in Word can help here:

find: (*)^13(\(\1\))
replace: \1
The options for using wildcards and using bold text in the replace field also need to be activated. However, the problem with this search and replace solution is that Word freezes when it's applied to documents longer than 5 pages! Is there another way to get rid of the 2nd header in brackets and bold the first one?
The first part of your Find: string is very general. It would match every line. Does Word behave any better if you make it more specific? e.g.

find: (Chapter [0-9]{1,3})^13(\(\1\))
replace: \1

which would get less hits.

Also, as an aside, you may find it more useful to have your Replace specify formating of Style "Heading 2" (or any other of the built-in Heading styles) rather than Bold. If you do this, when you use Calibre to convert to mobi you will be able to use the Heading 2 tags to specify a TOC. If you don't like the way Heading 2 looks in Word then modify the built-in Style first.
jackie_w is offline   Reply With Quote
Old 08-06-2010, 07:57 PM   #7
jackie_w
Grand Sorcerer
jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.jackie_w ought to be getting tired of karma fortunes by now.
 
Posts: 6,216
Karma: 16534894
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
... or you could try the Find/Replace in 2 passes.

Pass 1:
Find: \(Chapter [0-9]{1,3}\)
Replace:

Pass 2:
Find:Chapter [0-9]{1,3}^13
Replace: \1 (with chosen formating)
jackie_w is offline   Reply With Quote
Old 08-06-2010, 11:22 PM   #8
Oboe Joe
Junior Member
Oboe Joe began at the beginning.
 
Posts: 5
Karma: 12
Join Date: Jun 2010
Location: Houston, TX
Device: Kindle DX, iPhone
Mark, the only folks I know still using Infobase text databases are Deseret Book and the LDS Church. Is this by any chance GL2005? If so, PM me. I have a much better way to extract the database's contents using a local web service.
Oboe Joe is offline   Reply With Quote
Old 08-07-2010, 05:03 AM   #9
mark235
Nameless Being
 
@jackie_w: The difficult part is that for lots of books, the chapter or paragraph names vary a lot. So I guess that forces me to always use (*) at the beginning of the string. And perhaps this is exactly what freezes Word on large docs. Would I could do is identify all the books with headings that have a "chapter 1, chapter 2" structure, correct those with your 2 strings, and correct the rest manually.
I'll give that a try, thanks

@Oboe Joe: Yes, I'm using LDS Library 2005 & 2009. PM sent
  Reply With Quote
Old 12-25-2010, 10:37 PM   #10
jpather
Junior Member
jpather began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Dec 2010
Device: Blackberry/itouch
Best way to convert PDF to epub

Calibre will try to convert it for you, but it will usually have lots a formatting issues.

I have tried alot of ways to convert PDFs to other formats, and this is by far the best.

Open your pdf in adobe, click edit-->Copy file to Clipboard
Open word or wordpad, and paste the file in.
This will get it into word with the formatting intact.
Save as an RTF, and use calibre to convert to whatever format you want.

I use Mobi, but my kids all have Itouches so they need EPUB for stanza. Once you have it in RTF calibre can handle any other conversions you need.
jpather is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF to Mobi Conversion rayh Calibre 2 09-24-2010 03:33 AM
Problems with PDF to Mobi conversion in Calibre (for my Kindle 3) star Calibre 1 09-13-2010 02:01 PM
BookDesigner HTML0 to clean HTML conversion utility Pablo Workshop 15 08-24-2010 01:05 PM
Some Calibre PDF>Mobi conversion advise please AdrianC Calibre 3 09-16-2009 03:00 PM
Tool to easily clean and refurbish html-text before conversion Pulp Workshop 3 10-13-2008 11:16 AM


All times are GMT -4. The time now is 11:10 AM.


MobileRead.com is a privately owned, operated and funded community.