03-12-2008, 08:11 AM | #1 |
Other
Posts: 143
Karma: 644
Join Date: Jan 2008
Location: Norway
Device: Cybook, Kindle
|
PDF extraction – what is the best tool?
When converting PDFs to MobiPocket for my Cybook I have so far used MobiPocket Creator and Adobe Acrobat v6.
I think that the best result is archived if I export the PDF to HTML from Acrobat and then convert the HTML file to .prc using MopiPocket Creator, instead of converting directly from PDF to .prc in MobiPocket creator. As far as I know I could also use BookDesigner for this task. The conversion is never perfect and there are always issues with formatting. How do you extract your PDFs? What is the best current tool/process? Will I archive better results if I update Acrobat to the latest version? |
03-12-2008, 09:04 AM | #2 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
For me the best tool is ABBYY PDF Transformer. As I remember it is about $99. It creates MS Word documents that can be edited and loaded into BD.
|
Advert | |
|
03-29-2008, 10:55 AM | #3 | |
Addict
Posts: 230
Karma: 334908
Join Date: Oct 2006
Device: multiple
|
Quote:
My other favorite is Gemini, by Iceni, a British software company. Its output from pdf to html is the best I have seen, but you do pay a price- $159 when I bought it. It's absolutely top-of-the-line. Gemini and UnPDF are the only 2 softwares out there I would recommend for this task. |
|
03-29-2008, 12:15 PM | #4 |
reader
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
|
|
03-29-2008, 01:41 PM | #5 |
Other
Posts: 143
Karma: 644
Join Date: Jan 2008
Location: Norway
Device: Cybook, Kindle
|
I downloaded the demo version of Gemini and I agree that it is a great tool that works better than both Adobe Acrobate and Mobipocket Creator.
Thanks! |
Advert | |
|
03-29-2008, 05:40 PM | #6 |
Addict
Posts: 230
Karma: 334908
Join Date: Oct 2006
Device: multiple
|
|
04-14-2008, 07:04 PM | #7 |
Junior Member
Posts: 3
Karma: 10
Join Date: Mar 2008
Device: Palm TX
|
I have had great luck with the different converters from ABC Amber ( http://www.processtext.com/ ).
They have a PDF converter that convert to almost any format you can think of - for only $12.95. I use the companies " MS Lit" converter almost every day, as so many ebooks are released in Lit format , which my Palm TX can not read. Hope it helps! |
04-24-2008, 06:45 PM | #8 |
Resident Curmudgeon
Posts: 76,008
Karma: 134368292
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
ABC Amber Lit converter doesn't work well as it's based on a buggy version of ConvertLIT.
|
04-25-2008, 02:17 PM | #9 |
Guru
Posts: 860
Karma: 4380
Join Date: Feb 2008
Location: Almada, Portugal
Device: Cybook Gen3, Sony PRS 505, Kindle DXG and Samsung Galaxy Note
|
Acrobat pro 8.0 (export as text and ou word) and Omnipage pro 16 (OCR the PDF file and save as text or word).
|
04-25-2008, 04:12 PM | #10 |
Wizard
Posts: 1,244
Karma: 3439432
Join Date: Feb 2008
Device: Amazon Kindle Paperwhite (300ppi), Samsung Galaxy Book 12
|
The best tool I've found for this is Marcel Weiher's TextLightning.app available from www.metaobject.com (ob. discl. I was a beta-tester). Although it's a Mac OS X app, it's available for Linux and could probably be compiled for Windows using the recently improved support for Windows GNUstep www.gnustep.org affords.
William |
08-20-2009, 08:10 AM | #11 |
Junior Member
Posts: 9
Karma: 10
Join Date: Sep 2008
Device: Cybook Gen3 eBoo
|
LRF conversion seems to have been removed from docudesk unpdf professional version 3.0? Can anyone confirm? I've downloaded trials of 2 and 3 and this seems to be the case...
|
08-24-2009, 05:54 AM | #12 | |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2009
Device: none
|
Quote:
Since you want to retain the format, I think converting PDF to Word or HTML, and then to .prc could be a choice. Anyway, I think there will be problems with formatting once a file is converted for 2 or more times with different tools. |
|
08-24-2009, 08:46 PM | #13 |
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
I extract PDFs to Word docs (or RTF; the file is the same from my viewpoint), and then edit the Word doc. If I were more fluent in HTML, I'd extract to that--and expect spend the same amount of time editing the HTML file as I spend on the average PDF-to-Word conversion.
I generally have to fix the page sizes & margins, remove text boxes, change pictures to inline with text, and do odd things to get rid of the page numbers & headers. Then I fix the paragraph settings starting by making them all single-spaced, and removing the right & left margin indents if any; if it's reasonable, I change them all to the same before & after amounts and justification. Then I set the font--make it all one font, use find & replace to fix the sizes, make sure it's all 100% size, not condensed or expanded. I'd expect HTML files to work better if the fonts were normalized, remove the extra "div" sections and "align" tags, get rid of tables that force the page structure. Basic novels should transfer nicely. Of course, basic novels probably transfer fine from the original PDF straight to Mobi. It's when there are other formatting aspects that the conversion breaks down, and none of the auto-converters shines as the best one, because PDF wasn't designed to be a convert-from format. |
09-26-2009, 03:12 PM | #14 |
Groupie
Posts: 162
Karma: 24658
Join Date: Sep 2009
Device: PRS-505
|
Hi Elfwreck,
I posted in another thread regarding this, but you seem to have a lot of experience with PDF->Word conversions. You outlined a lot of postprocessing that you do. Does your convertor insert paragraph breaks at the end of a page even if a sentence is continued on the next? If so, do you go in and manually delete every spurious paragraph break for each page? I can't figure out if there is a software smart enough to not include these breaks at the end of a page, or if there is an easy way to correct for it. Thanks! |
09-26-2009, 05:13 PM | #15 | ||
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Quote:
Quote:
Otherwise, I look for ways to identify paragraph breaks in the wrong places. This starts with removing unwanted page breaks; sometimes I remove them all (replace with a space); sometimes I try to keep them before chapter breaks, if chapter headers have identifiable typographical issues that I can search for. Then: Search for [any letter]^p (or [any letter][space]^p), replace with [find what text]qqq, then replace ^pqqq with [space]. This doesn't work if some paragraphs are supposed to end with letters instead of punctuation (like tables), so it may involve some checking & manual touch-up. And it won't catch sentences that ended on one page, and the first line of the next page is supposed to be part of the same paragraph. Sometimes I can search for tabs or indentation of first line--often, anything that's not indented is either a chapter header or should be part of the previous page. So, semi-manual: search, then manually fix. It gets faster with practice. It's always a bit choppy, and never as good as a page-by-page QC, although I find it plenty acceptable for personal reading. Since most of the PDFs I convert this way are either not legal to distribute, or only of interest to a very limited crowd (I convert legal rulings from PDF to neatly-formatted Word docs for friends), I've not had to develop anything that works more smoothly. Last edited by Elfwreck; 09-26-2009 at 05:15 PM. |
||
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
eBook PDF - free tool for creating PDF eBooks from text files | KACartlidge | 6 | 01-04-2012 09:41 AM | |
Best PDF conversion tool. | Dark123 | 19 | 04-21-2010 02:52 AM | |
Best PDF Convertion Tool | Nathan Campos | Workshop | 5 | 12-27-2009 10:47 AM |
Yet another PDF cropping tool | sjvr767 | iRex | 7 | 02-14-2009 07:04 AM |