Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Formats > ePub

Notices

Reply
 
Thread Tools Search this Thread
Old Today, 03:56 AM   #1
Shohreh
Groupie
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 164
Karma: 248528
Join Date: Jan 2016
Device: none
Question [SOLVED] Extract HTML files, fix, re-import?

Hello,

I need to fix typos in an EPUB file that actually originated from the "OCR layer" in a PDF. The EPUB was created by Abbyy Finereader.

For some reason, LibreOffice can't open the file* while Sigil can after complaining**… but the latter doesn't seem to have an English/French spellchecker.

As an alternative, and provided LO's spellchecker is good enough, I'm thinking of 1) extracting the HTML files from the EPUB, 2) open them in LO to fix errors, and 3) re-import and replace the HTML files into the EPUB.

Before I experiment, should it work?

Thank you.


* LibreOffice: "The file 'output.epub' is corrupt and therefore cannot be opened. LibreOffice can try to repair the file. The corruption could be the result of document manipulation or of structural document damage due to data transmission. We recommend that you do not trust the content of the repaired document. Execution of macros is disabled for this document." followed by "The file 'output.epub' could not be repaired and therefore cannot be opened."

** Sigil: "Warning: This EPUB had HTML files that were not well formed or are missing a DOCTYPE, html, head or body elements. They were automatically fixed based on your Preference setting to Clean on Open."

--
Edit: Done. I found LibreOffice's spellchecker to be very good, and simple to use: Hit F7 to display the dialog box and run it, edit the typo in the text, click back on the dialog to hit its Resume button, repeat.

To edit the files:
  1. Make a copy of the original EPUB, replacing its .epub extension with .zip
  2. Unzip the HTML files
  3. Edit in LO
  4. In the zip, update the HTML files
  5. Rename the extension from .zip to .epub
  6. Voilà!

---
Edit: False hope. The file crashes when opened in my e-reader. After opening it in Sigil, OKing the error message mentioned above, saving a new version, and opening that in my e-reader… it still crashes. Could there be some references that must be updated in eg. toc.ncx, content.opf, and/or .\META-INF\container.xml?

Are those references in toc.ncx generated by hashing files, which then must re-generated somehow after editing?

Code:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="fr-FR">
<head><meta name="dtb:uid" content="Id_BC4AD4AE-749C-47E3-A5CE-DC60A911137F_"/></head>
<docTitle><text>Some title</text></docTitle>
<navMap>
<navPoint playOrder="1" id="Id_C492A9F4-4D63-47EE-9234-532796CC95EA_"><navLabel><text>Préface</text></navLabel><content src="main-1.xhtml"/></navPoint>
<navPoint playOrder="2" id="Id_E03E00CF-3CDD-48E5-8311-8F6DE332770F_"><navLabel><text>Avant-propos</text></navLabel><content src="main-2.xhtml"/></navPoint>
<navPoint playOrder="3" id="Id_C4FDB976-C07F-448A-BE62-4560285EEEAA_"><navLabel><text>PREMIÈRE PARTIE</text></navLabel><content src="main-3.xhtml"/></navPoint>
<navPoint playOrder="4" id="Id_A88231ED-D7FC-4804-8A24-4351BE895135_"><navLabel><text>Blah</text></navLabel><content src="main-4.xhtml"/></navPoint>
<navPoint playOrder="5" id="Id_C1C631F1-2D1C-485A-B140-AE634C0473CE_"><navLabel><text>TROISIÈME PARTIE</text></navLabel><content src="main-5.xhtml"/></navPoint>
<navPoint playOrder="6" id="Id_3248876C-83BE-4A77-9C11-44E6BC0C1AFB_"><navLabel><text>Blah</text></navLabel><content src="main-6.xhtml"/></navPoint>
<navPoint playOrder="7" id="Id_D6E9CD24-5C8D-4CCC-8F32-4793DEA2717B_"><navLabel><text>Blah</text></navLabel><content src="main-7.xhtml"/></navPoint>
<navPoint playOrder="8" id="Id_E57324AE-A971-4076-A55C-C5CB583F7831_"><navLabel><text>Blah</text></navLabel><content src="main-8.xhtml"/></navPoint>
<navPoint playOrder="9" id="Id_64E0916D-A264-4267-892B-857FDBF415F7_"><navLabel><text>Blah</text></navLabel><content src="main-9.xhtml"/></navPoint>
<navPoint playOrder="10" id="Id_F38A984B-B739-4CF2-B943-B87D94524516_"><navLabel><text>Table des figures et tableaux</text></navLabel><content src="main-10.xhtml"/></navPoint>
</navMap></ncx>
---
Edit: epubcheck shows (non-fatal) errors in the EPUB generated by Abbyy, and fatal errors after I ) extracted, 2) edited (in LibreOffice), and 3) replaced XHTML files back into the EPUB :-/ The file displays OK in SumatraPDF, but crashes my e-reader.

Code:
-------- EPUB from PDF by Abbyy:
<repInfo uri="From.Abbyy.epub">
	<format>application/epub+zip</format>
	<version>2.0.1</version>
	<status>Not well-formed</status>
	<messages>
		 <message id="PKG-005" severity="error">PKG-005, ERROR, [The mimetype file has an extra field of length 17. The use of the extra field feature of the ZIP format is not permitted for the mimetype file.], /c:/From.Abbyy.epub</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: element "a" not allowed here; expected the element end-tag or element "address", "blockquote", "del", "div", "dl", "h1", "h2", "h3", "h4", "h5", "h6", "hr", "ins", "noscript", "ns:svg", "ol", "p", "pre", "script", "table" or "ul" (with xmlns:ns="http://www.w3.org/2000/svg")], main.xhtml (17-135)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: element "ul" not allowed here; expected the element end-tag or element "li"], main.xhtml (20-193)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: The "a" element cannot contain any nested "a" elements.], main-2.xhtml (48-22)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote2"], main-2.xhtml (8-526)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote8"], main-2.xhtml (17-476)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote8"], main-2.xhtml (17-1374)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote3"], main-4.xhtml (12-817)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote15"], main-6.xhtml (18-1233)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote5"], main-7.xhtml (11-419)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote5"], main-7.xhtml (13-139)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote24"], main-7.xhtml (30-759)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote27"], main-7.xhtml (33-426)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote23"], main-9.xhtml (24-880)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote1"], main-13.xhtml (10-354)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote6"], main-13.xhtml (24-1021)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote20"], main-13.xhtml (54-252)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote25"], main-14.xhtml (21-278)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote44"], main-14.xhtml (43-1912)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote60"], main-15.xhtml (43-573)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote90"], main-15.xhtml (65-1401)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote12"], main-17.xhtml (19-1466)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: Duplicate "footnote14"], main-17.xhtml (24-1397)</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: value of attribute "id" is invalid; must be an XML name without colons], toc.ncx (6-69)</message>
	</messages>
</repInfo>

-------- EPUB after editing in LibreOffice:
<repInfo uri="From.Abbyy.EDITED.epub">
	<format>application/epub+zip</format>
	<version>2.0.1</version>
	<status>Not well-formed</status>
	<messages>
		 <message id="RSC-016" severity="error">RSC-016, FATAL, [Fatal Error while parsing file: The entity "nbsp" was referenced, but not declared.], main-3.xhtml (59-33)</message>
		 <message id="HTM-004" severity="error">HTM-004, ERROR, [Irregular DOCTYPE: found "", expected "&lt;!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" 
"http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"&gt;".], main-1.xhtml</message>
		 <message id="RSC-005" severity="error">RSC-005, ERROR, [Error while parsing file: elements from namespace "" are not allowed], main-1.xhtml (2-7)</message>
	</messages>
</repInfo>

Last edited by Shohreh; Today at 09:00 AM.
Shohreh is offline   Reply With Quote
Old Today, 11:48 AM   #2
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,018
Karma: 5449552
Join Date: Nov 2009
Device: many
Just use Sigil to open the original epub. You can then run SpellCheck dialog in multiple languages at once (if the xhtml properly used lang attributes). And yes Sigil uses the same hunspell spellchecker system that LibeOffice uses. And it comes with both English and French dictionaries.

And you can run epubcheck as a plugin and use it Validation Window to find and fix errors.

Alternatively, you can use any unzip tool to open any epub. Then you can edit the files to your heart is content. Then use Sigil's FolderIn plugin to load the epub's files, run the epubcheck plugin, find and fix errors then save the epub properly (mimedata file uncompressed but stored first the remaining files added).

But just using Sigil to do the spellchecking and editing is easier.

Last edited by KevinH; Today at 12:10 PM.
KevinH is online now   Reply With Quote
Advert
Old Today, 01:27 PM   #3
Shohreh
Groupie
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 164
Karma: 248528
Join Date: Jan 2016
Device: none
Thanks.

For some reason, Sigil's spellcheck dialog remains empty. I'll read up on how to use it.

Turns out LibreOffice can generate an EPUB, but can't read and edit it. A work-around to using Sigli is simply to create a DOCX from the PDF, work on it in LO, and then output a new EPUB.
Shohreh is offline   Reply With Quote
Old Today, 01:33 PM   #4
KevinH
Sigil Developer
KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.KevinH ought to be getting tired of karma fortunes by now.
 
Posts: 8,018
Karma: 5449552
Join Date: Nov 2009
Device: many
Have you set the default Spellcheck languages in Preferences? Does your OPF have the correct dc:language metadata set? Does your xhtml properly use xml:lang and lang attributes on the html tag?
KevinH is online now   Reply With Quote
Old Today, 02:24 PM   #5
Shohreh
Groupie
Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.Shohreh ought to be getting tired of karma fortunes by now.
 
Posts: 164
Karma: 248528
Join Date: Jan 2016
Device: none
Yes to the first question.

There's no OPF in the EPUB file.

I'll google for a tutorial on how to set things up, as I don't know much about Sigil, and never used its spellchecker.
Attached Thumbnails
Click image for larger version

Name:	F6FDDFB7-3BD0-46AB-A387-8527AEF3A01D.png
Views:	4
Size:	89.8 KB
ID:	210501  
Shohreh is offline   Reply With Quote
Advert
Old Today, 02:36 PM   #6
Karellen
Wizard
Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.Karellen ought to be getting tired of karma fortunes by now.
 
Karellen's Avatar
 
Posts: 1,283
Karma: 6700678
Join Date: Sep 2021
Location: Australia
Device: Kobo Libra 2
Quote:
Originally Posted by Shohreh View Post
There's no OPF in the EPUB file.
What about that last file in the list- content.opf
Karellen is online now   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
[OCR] Extract text layer, fix errors, re-import? Shohreh PDF 0 Yesterday 12:26 PM
Import group of html files to epub shotsky Conversion 2 04-26-2020 01:11 PM
How to get Calibre to import 500ish .html index files? bounce Calibre 3 06-03-2019 04:31 PM
Completion popup - fix all html/beautify all files retiredbiker Editor 7 08-30-2018 01:44 AM
[Solved] I'm a little confused about the Fix HTML - all files Tool DoctorOhh Editor 3 04-02-2014 02:39 AM


All times are GMT -4. The time now is 03:23 PM.


MobileRead.com is a privately owned, operated and funded community.