02-06-2013, 07:57 PM | #1 |
Junior Member
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Best format to extract text from speed vs accuracy
Good folk.
I have been given the task of extracting the text out of thousands of ebooks. Some of these are available in several formats (epub, lit, mobi, pdf, etc). For the purpose of extracting the text (unicode): 1. Which source format is the best to extract from? 2. Which source format would be fastest to extract from? Preliminary experiments so far point to epub/mobi as the best and pdf as the worst in terms of accuracy. Does anyone have any experience on this? Thank you all in advance. |
02-06-2013, 08:31 PM | #2 | |
Well trained by Cats
Posts: 30,121
Karma: 57500000
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
I know you can open (explode) EPUB pretty easy (Just use Tweak: built in). HTML is also accessible. If you HAVE Acrobat, the PDF might not be so bad . |
|
Advert | |
|
02-06-2013, 08:38 PM | #3 | |
Junior Member
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Quote:
What is Tweak? I've been playing with ebook-convert. |
|
02-06-2013, 09:45 PM | #4 | |
Well trained by Cats
Posts: 30,121
Karma: 57500000
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Quote:
This tool allows you to unpack the books pieces to allow (small) edits, then put them back together when done, maintaining the original structure. For Bigger edits( add/remove chapters..., Sigil is easier for the novice-intermediate). |
|
02-06-2013, 10:24 PM | #5 | |
null operator (he/him)
Posts: 20,772
Karma: 27405072
Join Date: Mar 2012
Location: Sydney Australia
Device: none
|
Quote:
AFAIK what you see in the EPUB Viewer is what you'll get in TXT output file - but without any formatting/styling or images - the important settings are the TXT Output settings Given that EPUB is Calibre's native format I would anticipate it might be faster. If you don't have access to PDF editing software like Acrobat, Nitro etc to do the conversions, then you could try
I suggest you steer clear of the "Free PDF to ..." converters unless you get a specific recommendation - as in the case of MobiCreator. BR Last edited by BetterRed; 02-06-2013 at 10:29 PM. |
|
Advert | |
|
02-06-2013, 10:41 PM | #6 |
Resident Curmudgeon
Posts: 75,114
Karma: 131686272
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Calibre can convert to TXT. Just dump your eBooks into Calibre (not PDF) and batch convert to TXT. You can leave it running overnight. You don't have to care which is faster as it will just do it while you are not at the computer. I don't know the maximum you can queue at one time, but you could do it with Calibre.
|
02-07-2013, 12:54 AM | #7 |
Junior Member
Posts: 3
Karma: 10
Join Date: Feb 2013
Device: None
|
Thank you all for the answers and leads.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Increase Epub Search Speed and Accuracy | Matimio | Sigil | 1 | 12-31-2011 07:08 AM |
Page Change Speed - PDF vs <insert format> | Polydwarf | Astak EZReader | 1 | 02-22-2010 02:11 AM |
Text to Speech and audio books - speed? | moz | Reading and Management | 3 | 05-30-2008 02:02 PM |
What is best format, speed for MP3/Acc files? | jgbrut | Sony Reader | 0 | 11-20-2006 02:02 PM |