Extract HTML files, fix, re-import?

Shohreh · Today, 03:56 AM

Hello,

I need to fix typos in an EPUB file that actually originated from the "OCR layer" in a PDF. The EPUB was created by Abbyy Finereader.

For some reason, LibreOffice can't open the file* while Sigil can after complaining**… but the latter doesn't seem to have an English/French spellchecker.

As an alternative, and provided LO's spellchecker is good enough, I'm thinking of 1) extracting the HTML files from the EPUB, 2) open them in LO to fix errors, and 3) re-import and replace the HTML files into the EPUB.

Before I experiment, should it work?

Thank you.

* LibreOffice: "The file 'output.epub' is corrupt and therefore cannot be opened. LibreOffice can try to repair the file. The corruption could be the result of document manipulation or of structural document damage due to data transmission. We recommend that you do not trust the content of the repaired document. Execution of macros is disabled for this document." followed by "The file 'output.epub' could not be repaired and therefore cannot be opened."

** Sigil: "Warning: This EPUB had HTML files that were not well formed or are missing a DOCTYPE, html, head or body elements. They were automatically fixed based on your Preference setting to Clean on Open."

--
Edit: Done. I found LibreOffice's spellchecker to be very good, and simple to use: Hit F7 to display the dialog box and run it, edit the typo in the text, click back on the dialog to hit its Resume button, repeat.

To edit the files:

Make a copy of the original EPUB, replacing its .epub extension with .zip
Unzip the HTML files
Edit in LO
In the zip, update the HTML files
Rename the extension from .zip to .epub
Voilà!

---
Edit: False hope. The file crashes when opened in my e-reader. After opening it in Sigil, OKing the error message mentioned above, saving a new version, and opening that in my e-reader… it still crashes. Could there be some references that must be updated in eg. toc.ncx, content.opf, and/or .\META-INF\container.xml?

Are those references in toc.ncx generated by hashing files, which then must re-generated somehow after editing?

Code:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="fr-FR">
<head><meta name="dtb:uid" content="Id_BC4AD4AE-749C-47E3-A5CE-DC60A911137F_"/></head>
<docTitle><text>Some title</text></docTitle>
<navMap>
<navPoint playOrder="1" id="Id_C492A9F4-4D63-47EE-9234-532796CC95EA_"><navLabel><text>Préface</text></navLabel><content src="main-1.xhtml"/></navPoint>
<navPoint playOrder="2" id="Id_E03E00CF-3CDD-48E5-8311-8F6DE332770F_"><navLabel><text>Avant-propos</text></navLabel><content src="main-2.xhtml"/></navPoint>
<navPoint playOrder="3" id="Id_C4FDB976-C07F-448A-BE62-4560285EEEAA_"><navLabel><text>PREMIÈRE PARTIE</text></navLabel><content src="main-3.xhtml"/></navPoint>
<navPoint playOrder="4" id="Id_A88231ED-D7FC-4804-8A24-4351BE895135_"><navLabel><text>Blah</text></navLabel><content src="main-4.xhtml"/></navPoint>
<navPoint playOrder="5" id="Id_C1C631F1-2D1C-485A-B140-AE634C0473CE_"><navLabel><text>TROISIÈME PARTIE</text></navLabel><content src="main-5.xhtml"/></navPoint>
<navPoint playOrder="6" id="Id_3248876C-83BE-4A77-9C11-44E6BC0C1AFB_"><navLabel><text>Blah</text></navLabel><content src="main-6.xhtml"/></navPoint>
<navPoint playOrder="7" id="Id_D6E9CD24-5C8D-4CCC-8F32-4793DEA2717B_"><navLabel><text>Blah</text></navLabel><content src="main-7.xhtml"/></navPoint>
<navPoint playOrder="8" id="Id_E57324AE-A971-4076-A55C-C5CB583F7831_"><navLabel><text>Blah</text></navLabel><content src="main-8.xhtml"/></navPoint>
<navPoint playOrder="9" id="Id_64E0916D-A264-4267-892B-857FDBF415F7_"><navLabel><text>Blah</text></navLabel><content src="main-9.xhtml"/></navPoint>
<navPoint playOrder="10" id="Id_F38A984B-B739-4CF2-B943-B87D94524516_"><navLabel><text>Table des figures et tableaux</text></navLabel><content src="main-10.xhtml"/></navPoint>
</navMap></ncx>

Today, 03:56 AM	#1
Shohreh Groupie Posts: 162 Karma: 248528 Join Date: Jan 2016 Device: none	[SOLVED] Extract HTML files, fix, re-import? Hello, I need to fix typos in an EPUB file that actually originated from the "OCR layer" in a PDF. The EPUB was created by Abbyy Finereader. For some reason, LibreOffice can't open the file* while Sigil can after complaining*… but the latter doesn't seem to have an English/French spellchecker. As an alternative, and provided LO's spellchecker is good enough, I'm thinking of 1) extracting the HTML files from the EPUB, 2) open them in LO to fix errors, and 3) re-import and replace the HTML files into the EPUB. Before I experiment, should it work? Thank you. LibreOffice: "The file 'output.epub' is corrupt and therefore cannot be opened. LibreOffice can try to repair the file. The corruption could be the result of document manipulation or of structural document damage due to data transmission. We recommend that you do not trust the content of the repaired document. Execution of macros is disabled for this document." followed by "The file 'output.epub' could not be repaired and therefore cannot be opened." ** Sigil: "Warning: This EPUB had HTML files that were not well formed or are missing a DOCTYPE, html, head or body elements. They were automatically fixed based on your Preference setting to Clean on Open." -- Edit: Done. I found LibreOffice's spellchecker to be very good, and simple to use: Hit F7 to display the dialog box and run it, edit the typo in the text, click back on the dialog to hit its Resume button, repeat. To edit the files: Make a copy of the original EPUB, replacing its .epub extension with .zip Unzip the HTML files Edit in LO In the zip, update the HTML files Rename the extension from .zip to .epub Voilà! --- Edit: False hope. The file crashes when opened in my e-reader. After opening it in Sigil, OKing the error message mentioned above, saving a new version, and opening that in my e-reader… it still crashes. Could there be some references that must be updated in eg. toc.ncx, content.opf, and/or .\META-INF\container.xml? Are those references in toc.ncx generated by hashing files, which then must re-generated somehow after editing? Code: <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1" xml:lang="fr-FR"> <head><meta name="dtb:uid" content="Id_BC4AD4AE-749C-47E3-A5CE-DC60A911137F_"/></head> <docTitle><text>Some title</text></docTitle> <navMap> <navPoint playOrder="1" id="Id_C492A9F4-4D63-47EE-9234-532796CC95EA_"><navLabel><text>Préface</text></navLabel><content src="main-1.xhtml"/></navPoint> <navPoint playOrder="2" id="Id_E03E00CF-3CDD-48E5-8311-8F6DE332770F_"><navLabel><text>Avant-propos</text></navLabel><content src="main-2.xhtml"/></navPoint> <navPoint playOrder="3" id="Id_C4FDB976-C07F-448A-BE62-4560285EEEAA_"><navLabel><text>PREMIÈRE PARTIE</text></navLabel><content src="main-3.xhtml"/></navPoint> <navPoint playOrder="4" id="Id_A88231ED-D7FC-4804-8A24-4351BE895135_"><navLabel><text>Blah</text></navLabel><content src="main-4.xhtml"/></navPoint> <navPoint playOrder="5" id="Id_C1C631F1-2D1C-485A-B140-AE634C0473CE_"><navLabel><text>TROISIÈME PARTIE</text></navLabel><content src="main-5.xhtml"/></navPoint> <navPoint playOrder="6" id="Id_3248876C-83BE-4A77-9C11-44E6BC0C1AFB_"><navLabel><text>Blah</text></navLabel><content src="main-6.xhtml"/></navPoint> <navPoint playOrder="7" id="Id_D6E9CD24-5C8D-4CCC-8F32-4793DEA2717B_"><navLabel><text>Blah</text></navLabel><content src="main-7.xhtml"/></navPoint> <navPoint playOrder="8" id="Id_E57324AE-A971-4076-A55C-C5CB583F7831_"><navLabel><text>Blah</text></navLabel><content src="main-8.xhtml"/></navPoint> <navPoint playOrder="9" id="Id_64E0916D-A264-4267-892B-857FDBF415F7_"><navLabel><text>Blah</text></navLabel><content src="main-9.xhtml"/></navPoint> <navPoint playOrder="10" id="Id_F38A984B-B739-4CF2-B943-B87D94524516_"><navLabel><text>Table des figures et tableaux</text></navLabel><content src="main-10.xhtml"/></navPoint> </navMap></ncx> Last edited by Shohreh; Today at 05:33 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[OCR] Extract text layer, fix errors, re-import?	Shohreh	PDF	0	Yesterday 12:26 PM
Import group of html files to epub	shotsky	Conversion	2	04-26-2020 01:11 PM
How to get Calibre to import 500ish .html index files?	bounce	Calibre	3	06-03-2019 04:31 PM
Completion popup - fix all html/beautify all files	retiredbiker	Editor	7	08-30-2018 01:44 AM
[Solved] I'm a little confused about the Fix HTML - all files Tool	DoctorOhh	Editor	3	04-02-2014 02:39 AM