05-28-2021, 05:33 AM | #1 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Archive.org ePub
I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use. They do not load in ADE (right now I closed one instance of ADE running for 3 hours to load such an ePub and being all the way irresponsive).
It doesn't matter whether the book is old or new (printing year), just that the books I have used years before work ok (still have some quirks, but rather inoffensive, solved by reconverting to epub in calibre) and those recent don't. I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here). The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources). I did not trust to load them on my hardware eReaders for the fear of getting them bricked. Has anyone encountered this issue? |
05-28-2021, 07:52 AM | #2 |
the rook, bossing Never.
Posts: 12,349
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
I find that many archive dot org epub or mobi or PDF are poor unproofed OCR of scans.
Also they don't care about copyright. Likely that's why you have a problem. Their ebooks are often rubbish quality. So I no longer download from there, only using it to find archives of defunct websites. Use sites such as gutenberg.org and here that have human curated and proofed genuine public domain sites, or buy on Smashwords, kobo, Amazon etc. |
Advert | |
|
05-28-2021, 10:36 PM | #3 | |||
Wizard
Posts: 2,304
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
Quote:
You'd get better and more accurate results by download the PDFs and running your own OCR. I wrote about some of that here: "Optimize PDFs from archive.org for E-Ink devices" and just last month: "Tutorial-from Paper Book to Ebook PDF - 400 pages in 4 hours" I wouldn't touch Archive.org "EPUBs" with a ten foot pole though. To call those actual EPUBs is a travesty. What? This is complete hogwash. Last edited by Tex2002ans; 05-28-2021 at 10:42 PM. |
|||
05-29-2021, 09:38 AM | #4 |
the rook, bossing Never.
Posts: 12,349
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
I mean the bad OCRed scans is the source of problem. Not copyright. The Open Library and other copyright shenanigans at Archive are nothing to do with ghastly mobi/epub quality. They have been scanning paper books themselves for about 12 years as well as source fro Google, Microsoft and uploaders. The problem is that none of it is human curated or proofed. It's automated.
I just set up Linux box with a 20 year old Epson Perfection1200 on SCSI and Tesseract and gocr* last night. The newish funky colour laser printer-copier-scanner is not obviously better and is also downstairs. I have some 1890s to 1920s books, but likely I'm more interested in OCR of PD PDFs already scanned elsewhere. Yes, I know about AbbyFineReader. But I don't have it. I couldn't find any sort of SCSI adaptor for the laptop. I used to have a PCMCIA card and a laptop that could take them. [* Xsane seems to want gocr, but 15 years ago I would have saved the scans, adjusted in PaintShopPro and used the OCR on files. I can't imagine why I do it from inside Xsane, even though I have a sheetfeeder] Last edited by Quoth; 05-29-2021 at 09:44 AM. |
05-29-2021, 11:07 AM | #5 |
Guru
Posts: 732
Karma: 10216666
Join Date: Jul 2017
Device: Boox Nova 2
|
One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.
I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID. |
Advert | |
|
05-29-2021, 08:52 PM | #6 | |||
Wizard
Posts: 2,304
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
But the great thing about Archive.org is they release all the source files. So if you have problems with the B&W PDF, then instead download the:
If you check out Post #4+#6 in that Tutorial thread, I showed the why/how. You can then use Scan Tailor Advanced in order correct "yellowed pages" -> B&W. Using that allows you to tweak all the variables to get a much better/cleaner B&W image. * * * And they're always tweaking their workflows. Like in December 2020, they rescanned/rereleased the entire "Computerworld" magazine from microfilm: https://blog.archive.org/2020/12/30/...age-microfilm/ Microfilm scanning technology has gotten much better since it was first digitized, so now a much higher quality release is available. Quote:
Like GrannyGrump's conversion of the original Sweeney Todd story: "The String of Pearls": https://www.mobileread.com/forums/sh...d.php?t=299744 https://archive.org/details/stringof...e/n13/mode/2up I think that book was locked away in Oxford University, one of the only copies left in the world, and it's not even available to the public. Now because of Archive.org, the entire world can read it. Quote:
99.9999% accuracy on a few hundreds (maybe thousands) of books per year on Gutenberg. vs. 99% OCR accuracy on millions of books. (And all original source files are available.) And the scope is different too: Sure, you get the nice ebooks (I really wish Gutenberg released the original PDFs though)... But Archive.org is actually about making the works available/searchable. (NOT automating perfect ebooks. Those converted formats are just a side addition.) Last edited by Tex2002ans; 05-29-2021 at 08:57 PM. |
|||
05-31-2021, 06:31 AM | #7 |
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi Concerning PDFs from archive.org, here is a short list (most relevant) of the threads I have consulted before posting this question: https://www.mobileread.com/forums/sh...ht=archive.org https://www.mobileread.com/forums/sh...ht=archive.org The errors in calibre are many, and while some repeat across epubs ("stock" errors) a good deal are new (non-repetitive, "guests"). The book is salvageable if the PDF was rather well OCRed, mostly unfortunately not. It's not the copyright, not the DRM, not the PDF but rather the defectuous format of the epub. |
05-31-2021, 09:21 AM | #8 |
the rook, bossing Never.
Posts: 12,349
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
|
05-31-2021, 09:42 AM | #9 | |
Grand Sorcerer
Posts: 6,750
Karma: 86234863
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
Quote:
You don’t say what version of ADE you are using and on which platform. I suspect that the problem is the result of using outdated software on a modern file. |
|
05-31-2021, 09:52 AM | #10 |
Grand Sorcerer
Posts: 6,750
Karma: 86234863
Join Date: Nov 2011
Location: Charlottesville, VA
Device: Kindles
|
I tried the book using ADE 2.0.1 under Windows 10. It didn’t lock up but it did fail to work properly. Paging forward through the book caused it to skip around in the content, frequently jumping back to the beginning.
The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process. Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1. Last edited by jhowell; 05-31-2021 at 11:24 AM. Reason: Add more info |
05-31-2021, 12:37 PM | #11 | |
Wizard
Posts: 2,304
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
I'm betting the problem is the monolithic HTML file: ~900 KBs. If you have an older ereader, that would crash (can only handle files ~300 KBs). Like you also figured out, a simple Calibre EPUB->EPUB with file splitting should take care of that issue. Also, the book is laid out in two-column format. Usually, that's incredibly hard to OCR correctly. OCR might think both columns are a single line, so you get half-left/half-right sentences, making the ebook completely unreadable. According to the metadata, looks like they ran it through Finereader 8.0. I ran it through Finreader 12 for you, then created a very rough EPUB. This one should be more accurate + will at least not have all the headers/footers clogging up the text. Note: This book's font also had very low-hanging+round 'g's. OCR thought they were 'O's on their own line, so you'll see lots of those randomly appearing within the EPUB. Last edited by Tex2002ans; 05-31-2021 at 12:45 PM. |
|
05-31-2021, 12:53 PM | #12 | |
the rook, bossing Never.
Posts: 12,349
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
Quote:
I feel I wasted a lot of time and download cap trying to read epubs & mobi from Archive before I realised what they are at (automatic on demand from unproofed PDF OCR layer). If it's too big (like a multicolumn magazine) I'd use the 10" Lenovo tablet or even the laptop if it's not too many pages. |
|
06-01-2021, 03:55 AM | #13 | |||
Fanatic
Posts: 563
Karma: 403106
Join Date: Aug 2014
Device: PRS-T1
|
Quote:
Quote:
Quote:
The ADE 4.5.11.187212 I have reads it perfectly, as far as I see it. |
|||
Tags |
archive.org, epub, error |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
archive.org 1 hour checkout? | hobnail | General Discussions | 14 | 08-01-2020 01:14 PM |
Archive.org, Google and Piracy | Quoth | News | 60 | 04-16-2020 02:39 PM |
archive.org downloads | abrogard | Calibre | 2 | 08-11-2018 07:08 PM |
Archive.org | crutledge | General Discussions | 129 | 08-28-2015 07:22 AM |
Archive.org copyright question | Hatgirl | General Discussions | 7 | 03-23-2010 08:58 PM |