Archive.org ePub

Ghitulescu · 05-28-2021, 05:33 AM

I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use. They do not load in ADE (right now I closed one instance of ADE running for 3 hours to load such an ePub and being all the way irresponsive).

It doesn't matter whether the book is old or new (printing year), just that the books I have used years before work ok (still have some quirks, but rather inoffensive, solved by reconverting to epub in calibre) and those recent don't.

I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here).

The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources).
I did not trust to load them on my hardware eReaders for the fear of getting them bricked.

Has anyone encountered this issue?

Quoth · 05-28-2021, 07:52 AM

I find that many archive dot org epub or mobi or PDF are poor unproofed OCR of scans.
Also they don't care about copyright. Likely that's why you have a problem. Their ebooks are often rubbish quality.
So I no longer download from there, only using it to find archives of defunct websites.

Use sites such as gutenberg.org and here that have human curated and proofed genuine public domain sites, or buy on Smashwords, kobo, Amazon etc.

Tex2002ans · 05-28-2021, 10:36 PM

Quote:

Originally Posted by Ghitulescu

I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use.

Can you link to some that are broken?

Quote:

Originally Posted by Ghitulescu

I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs

??? First I'm hearing about this. What's the problems with their PDFs?

Quote:

Originally Posted by Ghitulescu

but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here).

Because the EPUBs (and MOBI, TXT, [...]) are auto-converted by OCR based on the PDF scans.

You'd get better and more accurate results by download the PDFs and running your own OCR.

I wrote about some of that here:

"Optimize PDFs from archive.org for E-Ink devices"

and just last month:

"Tutorial-from Paper Book to Ebook PDF - 400 pages in 4 hours"

I wouldn't touch Archive.org "EPUBs" with a ten foot pole though. To call those actual EPUBs is a travesty.

Quote:

Originally Posted by Quoth

Also they don't care about copyright. Likely that's why you have a problem.

What? This is complete hogwash.

Quoth · 05-29-2021, 09:38 AM

I mean the bad OCRed scans is the source of problem. Not copyright. The Open Library and other copyright shenanigans at Archive are nothing to do with ghastly mobi/epub quality. They have been scanning paper books themselves for about 12 years as well as source fro Google, Microsoft and uploaders. The problem is that none of it is human curated or proofed. It's automated.

I just set up Linux box with a 20 year old Epson Perfection1200 on SCSI and Tesseract and gocr* last night. The newish funky colour laser printer-copier-scanner is not obviously better and is also downstairs.
I have some 1890s to 1920s books, but likely I'm more interested in OCR of PD PDFs already scanned elsewhere.

Yes, I know about AbbyFineReader. But I don't have it.

I couldn't find any sort of SCSI adaptor for the laptop. I used to have a PCMCIA card and a laptop that could take them.

[* Xsane seems to want gocr, but 15 years ago I would have saved the scans, adjusted in PaintShopPro and used the OCR on files. I can't imagine why I do it from inside Xsane, even though I have a sheetfeeder]

salamanderjuice · 05-29-2021, 11:07 AM

One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.

I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.

Tex2002ans · 05-29-2021, 08:52 PM

Quote:

Originally Posted by salamanderjuice

One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine.

Yep, their automatic Color->B&W doesn't work well for all books. (Though most do perfectly find.)

But the great thing about Archive.org is they release all the source files.

So if you have problems with the B&W PDF, then instead download the:

Color PDF
Original source images [JPEG2000]

If you check out Post #4+#6 in that Tutorial thread, I showed the why/how.

You can then use Scan Tailor Advanced in order correct "yellowed pages" -> B&W. Using that allows you to tweak all the variables to get a much better/cleaner B&W image.

* * *

And they're always tweaking their workflows.

Like in December 2020, they rescanned/rereleased the entire "Computerworld" magazine from microfilm:

https://blog.archive.org/2020/12/30/...age-microfilm/

Microfilm scanning technology has gotten much better since it was first digitized, so now a much higher quality release is available.

Quote:

Originally Posted by salamanderjuice

I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.

Like GrannyGrump's conversion of the original Sweeney Todd story: "The String of Pearls":

https://www.mobileread.com/forums/sh...d.php?t=299744
https://archive.org/details/stringof...e/n13/mode/2up

I think that book was locked away in Oxford University, one of the only copies left in the world, and it's not even available to the public.

Now because of Archive.org, the entire world can read it.

Quote:

Originally Posted by Quoth

The problem is that none of it is human curated or proofed. It's automated.

Yeah, but the scale is on a completely different level.

99.9999% accuracy on a few hundreds (maybe thousands) of books per year on Gutenberg.

vs.

99% OCR accuracy on millions of books. (And all original source files are available.)

And the scope is different too:

Sure, you get the nice ebooks (I really wish Gutenberg released the original PDFs though)...

But Archive.org is actually about making the works available/searchable. (NOT automating perfect ebooks. Those converted formats are just a side addition.)

Ghitulescu · 05-31-2021, 06:31 AM

So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi

Concerning PDFs from archive.org, here is a short list (most relevant) of the threads I have consulted before posting this question:
https://www.mobileread.com/forums/sh...ht=archive.org
https://www.mobileread.com/forums/sh...ht=archive.org

The errors in calibre are many, and while some repeat across epubs ("stock" errors) a good deal are new (non-repetitive, "guests"). The book is salvageable if the PDF was rather well OCRed, mostly unfortunately not.

It's not the copyright, not the DRM, not the PDF but rather the defectuous format of the epub.

Quoth · 05-31-2021, 09:21 AM

Quote:

Originally Posted by Ghitulescu

not the PDF but rather the defectuous format of the epub.

It's because the epub is from a scan with bad OCR.
Use the PDF, or the image and if need be do your own OCR. The epub/mobi on archive org are rubbish.

jhowell · 05-31-2021, 09:42 AM

Quote:

Originally Posted by Ghitulescu

So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi

While it does contain numerous OCR errors, that book appears to be a properly structured EPUB 3. It passes EpubCheck with no errors.

You don’t say what version of ADE you are using and on which platform. I suspect that the problem is the result of using outdated software on a modern file.

jhowell · 05-31-2021, 09:52 AM

I tried the book using ADE 2.0.1 under Windows 10. It didn’t lock up but it did fail to work properly. Paging forward through the book caused it to skip around in the content, frequently jumping back to the beginning.

The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process.

Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1.

Tex2002ans · 05-31-2021, 12:37 PM

Quote:

Originally Posted by Ghitulescu

So, this is one of the offenders (but almost every single epub locks my ADE)
https://archive.org/details/russoturkishwari01hozi

Yep, you most likely figured it out.

I'm betting the problem is the monolithic HTML file: ~900 KBs. If you have an older ereader, that would crash (can only handle files ~300 KBs).

Like you also figured out, a simple Calibre EPUB->EPUB with file splitting should take care of that issue.

Also, the book is laid out in two-column format. Usually, that's incredibly hard to OCR correctly. OCR might think both columns are a single line, so you get half-left/half-right sentences, making the ebook completely unreadable.

According to the metadata, looks like they ran it through Finereader 8.0.

I ran it through Finreader 12 for you, then created a very rough EPUB. This one should be more accurate + will at least not have all the headers/footers clogging up the text.

Note: This book's font also had very low-hanging+round 'g's. OCR thought they were 'O's on their own line, so you'll see lots of those randomly appearing within the EPUB.

Quoth · 05-31-2021, 12:53 PM

Quote:

Originally Posted by jhowell

The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process.
* * *
Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1.

Because it's simply the automated OCR layer automatically converted to epub with no rules to find breaks and create separate files. If I can't find a real ebook of a PD text on Archive I download the PDF. The 7.8" Mars with the autocrop on margins is better for PDFs than 9.7" DXG, kindle PW3 or Kobo Libra. Much faster too.
I feel I wasted a lot of time and download cap trying to read epubs & mobi from Archive before I realised what they are at (automatic on demand from unproofed PDF OCR layer).
If it's too big (like a multicolumn magazine) I'd use the 10" Lenovo tablet or even the laptop if it's not too many pages.

Ghitulescu · 06-01-2021, 03:55 AM

Quote:

Originally Posted by jhowell

You don’t say what version of ADE you are using and on which platform. I suspect that the problem is the result of using outdated software on a modern file.

Quote:

Originally Posted by Ghitulescu

The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources).

ADE 4.5.11.187212 (strangely, I cannot directly copy this information and had to write it down by hand as before Gutenberg).

Quote:

Originally Posted by Tex2002ans

I'm betting the problem is the monolithic HTML file: ~900 KBs. If you have an older ereader, that would crash (can only handle files ~300 KBs).

Like you also figured out, a simple Calibre EPUB->EPUB with file splitting should take care of that issue.

[...]

I ran it through Finreader 12 for you, then created a very rough EPUB. This one should be more accurate + will at least not have all the headers/footers clogging up the text.

Well, there is a night'n day difference, I would say. Thank you for your effort.

The ADE 4.5.11.187212 I have reads it perfectly, as far as I see it.

05-28-2021, 05:33 AM	#1
Ghitulescu Fanatic Posts: 563 Karma: 403106 Join Date: Aug 2014 Device: PRS-T1	Archive.org ePub I noticed that a lot (I probably could almost safely say 100%) of ePubs from archive, in the last say 2-3 years, appear to block the ADE and give various errors in Calibre 4.xx I use. They do not load in ADE (right now I closed one instance of ADE running for 3 hours to load such an ePub and being all the way irresponsive). It doesn't matter whether the book is old or new (printing year), just that the books I have used years before work ok (still have some quirks, but rather inoffensive, solved by reconverting to epub in calibre) and those recent don't. I spend some hours reading the various complains about archive, it appeared that many are discontent with the PDFs but the bad quality (incompatibility) of ePubs have not been raised that often (there is one single thread about it, here). The ADE is 4.5.11.something and has no problems whatsoever displaying good ePubs I have (mine, made with sigil and calibre, or from other trusted sources). I did not trust to load them on my hardware eReaders for the fear of getting them bricked. Has anyone encountered this issue?

05-31-2021, 09:52 AM	#10
jhowell Grand Sorcerer Posts: 6,750 Karma: 86234863 Join Date: Nov 2011 Location: Charlottesville, VA Device: Kindles	I tried the book using ADE 2.0.1 under Windows 10. It didn’t lock up but it did fail to work properly. Paging forward through the book caused it to skip around in the content, frequently jumping back to the beginning. The book content is in a single fairly large HTML file. That might be too large for the old ADE version to process. Added: I used calibre to convert from EPUB to EPUB and tested the resulting file in ADE 2.0.1. When I disabled splitting of large HTML files the resulting EPUB failed in ADE the same as the original EPUB. Enabling splitting resulted in seven smaller HTML files in the EPUB and that worked properly with ADE. This confirms the large HTML file (996KB) in the original EPUB causing a problem for ADE 2.0.1. Last edited by jhowell; 05-31-2021 at 11:24 AM. Reason: Add more info

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
archive.org 1 hour checkout?	hobnail	General Discussions	14	08-01-2020 01:14 PM
Archive.org, Google and Piracy	Quoth	News	60	04-16-2020 02:39 PM
archive.org downloads	abrogard	Calibre	2	08-11-2018 07:08 PM
Archive.org	crutledge	General Discussions	129	08-28-2015 07:22 AM
Archive.org copyright question	Hatgirl	General Discussions	7	03-23-2010 08:58 PM

05-28-2021, 07:52 AM	#2
Quoth the rook, bossing Never. Posts: 12,349 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I find that many archive dot org epub or mobi or PDF are poor unproofed OCR of scans. Also they don't care about copyright. Likely that's why you have a problem. Their ebooks are often rubbish quality. So I no longer download from there, only using it to find archives of defunct websites. Use sites such as gutenberg.org and here that have human curated and proofed genuine public domain sites, or buy on Smashwords, kobo, Amazon etc.

05-29-2021, 09:38 AM	#4
Quoth the rook, bossing Never. Posts: 12,349 Karma: 92073397 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I mean the bad OCRed scans is the source of problem. Not copyright. The Open Library and other copyright shenanigans at Archive are nothing to do with ghastly mobi/epub quality. They have been scanning paper books themselves for about 12 years as well as source fro Google, Microsoft and uploaders. The problem is that none of it is human curated or proofed. It's automated. I just set up Linux box with a 20 year old Epson Perfection1200 on SCSI and Tesseract and gocr* last night. The newish funky colour laser printer-copier-scanner is not obviously better and is also downstairs. I have some 1890s to 1920s books, but likely I'm more interested in OCR of PD PDFs already scanned elsewhere. Yes, I know about AbbyFineReader. But I don't have it. I couldn't find any sort of SCSI adaptor for the laptop. I used to have a PCMCIA card and a laptop that could take them. [* Xsane seems to want gocr, but 15 years ago I would have saved the scans, adjusted in PaintShopPro and used the OCR on files. I can't imagine why I do it from inside Xsane, even though I have a sheetfeeder] Last edited by Quoth; 05-29-2021 at 09:44 AM.

05-29-2021, 11:07 AM	#5
salamanderjuice Guru Posts: 732 Karma: 10216666 Join Date: Jul 2017 Device: Boox Nova 2	One issue I've had with their PDFs is they don't do any sort of correction for yellowed pages so on a B&W eReader they can look like serious junk with banding in the background. Other than that it's fine. I also can't really blame them for automating this stuff, they just have way too much content and often it's the only place to get it on the web. I needed a chapter from some 70 year old niche book recently and they had it, only other option was a university library 6 hours away that was closed anyways due to COVID.

05-31-2021, 06:31 AM	#7
Ghitulescu Fanatic Posts: 563 Karma: 403106 Join Date: Aug 2014 Device: PRS-T1	So, this is one of the offenders (but almost every single epub locks my ADE) https://archive.org/details/russoturkishwari01hozi Concerning PDFs from archive.org, here is a short list (most relevant) of the threads I have consulted before posting this question: https://www.mobileread.com/forums/sh...ht=archive.org https://www.mobileread.com/forums/sh...ht=archive.org The errors in calibre are many, and while some repeat across epubs ("stock" errors) a good deal are new (non-repetitive, "guests"). The book is salvageable if the PDF was rather well OCRed, mostly unfortunately not. It's not the copyright, not the DRM, not the PDF but rather the defectuous format of the epub.

Advert

Advert