Can't convert from zip/html to epub

Blrp · 08-02-2019, 04:29 PM

I have a .zip that contains a bunch of .html files. When I add it as a book in Calibre and convert to .epub, it just converts index.html to epub and ignores the rest. When I go to cmd and enter

Code:

ebook-convert index.html book.epub

it just says

Code:

IgnoreFile(u'blah.html is a binary file',)

a bunch of times and gives me the same result as before.

The .zip is from Runeberg, just go here and download the one with "All HTML files".

jackie_w · 08-02-2019, 05:27 PM

I'm guessing you added the downloaded .zip file to the calibre GUI as a book format.

Instead, unzip the .zip file on your PC then add just the index.html file as a book to calibre GUI. calibre will use index.html to pull in all the other .html files and create its own zip book format. Once you've done that do a calibre zip-to-whatever conversion.

I just did a zip-to-epub conversion and it converted OK for me - even if the epub does look a bit primitive due to no styling.

theducks · 08-02-2019, 09:46 PM

If the zip is not converting properly, I suspect the contents were added from different paths than index.html specified, so it can't find stuff where it thinks it is supposed to be.

1)fix the paths in index
or
2)Add each file to an editor session, setting the order (if needed) in the file list
Re-link images as needed

DNSB · 08-03-2019, 01:04 AM

I downloaded the file, converted it to epub and opened in it Sigil. Which was not very happy with it since quite a few files were not in the manifest's spine section. Looking at that segment in content.opf showed 4 files. I used the Modify Epub plugin to add unmanifested files to the manifest and that cleared up those errors.

The easier answer was to unpack the .zip file into a temp directory and then add the index.html file to calibre. Calibre will then parse that file and drag in the other files which are referenced. This file opened with a couple of minor errors from k9.html where there is a chunk of text wrapped in <blockquote></blockquote> tags without a block tag. Simply adding a <p> and </p> corrected that issue.

Blrp · 08-03-2019, 07:35 AM

Thanks for the help, guys. The files were put in the ePub in a seemingly random order; this was easy (though tedious) to fix through editing, but can I get the conversion to sort them properly? The files appear in the correct order order the way they get sorted by name in Windows (e.g. k9 comes before k10) and they also appear in the correct order in index.html, but Calibre decided the most logical order was k54 -> k53 -> k52c -> k0a -> k0b -> k1a -> ... -> k7 -> k33c -> k43b -> k34 -> ...

kovidgoyal · 08-03-2019, 12:16 PM

You will want to change the setting to add the files in depth first rder to breadth first order, see the note at https://manual.calibre-ebook.com/faq...specific-order

Blrp · 08-03-2019, 04:39 PM

Didn't work, and when I said "seemingly random order" I guess I shouldn't have hedged my statement. I imported index.html into Calibre twice with depth-first, then twice with breadth-first, and the resulting ePub had a different order every time. I then converted the last entry again to confirm that the problem lies not with adding the .html but with converting to ePub.

The resulting five ePubs started (after index) with k33c, k53, k33c, k54, k53... and I converted once again and got k52c, so there's definitely a pattern here. I think one I deleted started with k34. Also, the two that started with k33c had different orders after that.

kovidgoyal · 08-03-2019, 11:42 PM

I fond it extremely hard to believe that converting to EPUB would randomize file order, if that were the case there would literally be millions of bug reports about it. But feel free to attach a file that shows this behavior on conversion and I will take a look.

Blrp · 08-04-2019, 06:08 AM

Alright, I unzipped the first zip (nilsholg-html.zip) and added index.html to Calibre; the result was the second zip which I converted to ePub twice with different results (and with a weirdly large size difference). This was with the breadth-first setting.

kovidgoyal · 08-04-2019, 07:49 AM

these is because these are not HTML files, but HTML fragments, for example k0a.html contains:

Code:

<h1>Den kristna dagvisan</h1>

<p>Den signade dag, som vi nu här se
<br>av himmelen till oss nedkomma,
<br>han blive oss säll, han låte sig te
<br>oss allom till glädje och fromma.
<br>Ja, Herren, den högste, oss alla i dag
<br>för synder och sorger bevare.

<p>Men såsom en fågel mot himmelens höjd
<br>sig lyfter på lediga vingar,
<br>han lovar sin gud, är glad och förnöjd,
<br>när han över jorden sig svingar,
<br>så lyfter sig själen i hjärtelig fröjd
<br>till himlen med lovsång och böner.

<p>Ack, låtom oss lova och bedja vår Gud,
<br>när stunderna växla och skrida,
<br>så skola vi stärkas att hålla hans bud
<br>och vaka och tåligen lida.
<br>Ja, låtom oss verka med allvar och flit,
<br>så länge oss dagen förunnas.

<p>sv. Ps. 424: 1, 5, 6

calibre will look for an <html> tag and if it does not find it, will assume the file is not html and not add it to the spine. However since index.html is added to the spine, by virtue of being the top level file, the EPUB conversion will follow links in it and auto-fix the HTML framgments making them proper files. However this happens in random order.

So you need to either fix the files to be proper html yourself, just adding an opening <html> tag to the start of the file should be enough, or edit the opf file inside the calibre produced zip file and add all the extra html files to the <spine> section and then convert.

Blrp · 08-06-2019, 11:19 AM

Alright, thanks for the help. I made a quick and dirty java program for editing content.opf in the zip file. I'll drop it here just in case someone with the same problem happens to find this thread.

Spoiler:

Code:

import java.io.BufferedWriter;
import java.io.FilterInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStreamWriter;
import java.io.StringReader;
import java.io.StringWriter;
import java.nio.file.FileSystem;
import java.nio.file.FileSystems;
import java.nio.file.Files;
import java.nio.file.Path;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.List;
import java.util.Scanner;
import java.util.zip.ZipEntry;
import java.util.zip.ZipFile;

import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.TransformerException;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;

public class EditContentOPF {
	
	static String readFileInZip(String pathToZip, String pathInZip, String charsetName) throws IOException {
		try (ZipFile zip = new ZipFile(pathToZip)) {
			ZipEntry entry = zip.getEntry(pathInZip);
			if (entry == null) throw new RuntimeException("content.opf not found");
			try (InputStream is = zip.getInputStream(entry)) {
				try (@SuppressWarnings("resource") Scanner s = new Scanner(is, charsetName).useDelimiter("\\A")) { // eclipse bug? try-with should not have resource leak
					return s.hasNext() ? s.next() : "";
				}
			}
		}
	}
	
	static String editContentFileText(String text) throws ParserConfigurationException, SAXException, IOException, TransformerException {
		DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
		builder.setErrorHandler(null);
		Document document = builder.parse(new InputSource(new StringReader(text)));
		Element packageNode = document.getDocumentElement();
		
		// get list of items to reference in spine
		Node manifest = packageNode.getElementsByTagName("manifest").item(0);
		NodeList manifestItems = ((Element)manifest).getElementsByTagName("item");
		List<Element> manifestHtmlItems = new ArrayList<>();
		for (int i = 0; i < manifestItems.getLength(); i++) {
			Element item = (Element)manifestItems.item(i);
			String href = item.getAttribute("href");
			if (href.substring(href.lastIndexOf('.')).equals(".html"))
				manifestHtmlItems.add(item);
		}
		
		// we expect index.html to be first in the manifest
		Element indexItem = (Element)manifestHtmlItems.get(0);
		if (!indexItem.getAttribute("href").equals("index.html"))
			throw new RuntimeException("index.html is not first");
		String indexId = indexItem.getAttribute("id");
		manifestHtmlItems.remove(0);
		
		// we expect index.html to be the only item referenced in the spine
		Node spine = packageNode.getElementsByTagName("spine").item(0);
		NodeList spineChildren = ((Element)spine).getChildNodes();
		List<Element> spineNonTextChildren = new ArrayList<>();
		for (int i = 0; i < spineChildren.getLength(); i++) {
			if (spineChildren.item(i).getNodeType() != Node.TEXT_NODE)
				spineNonTextChildren.add((Element)spineChildren.item(i));
		}
		if (spineNonTextChildren.size() != 1)
			throw new RuntimeException("unexpected number of nodes in spine");
		if (!spineNonTextChildren.get(0).getAttribute("idref").equals(indexId))
			throw new RuntimeException("index.html is not referenced in spine");
		
		// add references in spine
		for (Element item : manifestHtmlItems) {
			String id = item.getAttribute("id");
			if (id.isEmpty())
				throw new RuntimeException("item has no id, href=" + item.getAttribute("href"));
			Element itemref = document.createElement("itemref");
			itemref.setAttribute("idref", id);
			spine.appendChild(itemref);
		}
		
		// turn document back into string
		StringWriter writer = new StringWriter();
		TransformerFactory.newInstance().newTransformer().transform(new DOMSource(document), new StreamResult(writer));
		return writer.toString();
	}
	
	static void writeToFileInZip(String pathToZip, String pathInZip, String text, String charsetName) throws IOException {
		String[] lines = text.split("\\R");
		try (FileSystem fs = FileSystems.newFileSystem(Paths.get(pathToZip), null)) {
			Path fullPath = fs.getPath(pathInZip);
			Files.delete(fullPath);
			try (BufferedWriter bw = new BufferedWriter(new OutputStreamWriter(Files.newOutputStream(fullPath), charsetName))) {
				for (String line : lines) {
					bw.write(line);
					bw.newLine();
				}
			}
		}
	}
	
	public static String input() {
		try (Scanner sc = new Scanner( // this yields a scanner of System.in that, when closed, does not close System.in
				new FilterInputStream(System.in) {
					@Override
					public void close() throws IOException {}
				})) {
			return sc.nextLine();
		}
	}
	
	public static void main(String[] args) throws ParserConfigurationException, SAXException, IOException, TransformerException {
		System.out.print("enter zip path: ");
		String pathToZip = input();
		String charsetName = "utf-8";
		String pathInZip = "content.opf";
		
		String contentFileText = readFileInZip(pathToZip, pathInZip, charsetName);
		
		String editedContentFileText = editContentFileText(contentFileText);
				
		writeToFileInZip(pathToZip, pathInZip, editedContentFileText, charsetName);
	}
}

08-02-2019, 04:29 PM	#1
Blrp Member Posts: 16 Karma: 10 Join Date: Jul 2014 Device: none	Can't convert from zip/html to epub I have a .zip that contains a bunch of .html files. When I add it as a book in Calibre and convert to .epub, it just converts index.html to epub and ignores the rest. When I go to cmd and enter Code: ebook-convert index.html book.epub it just says Code: IgnoreFile(u'blah.html is a binary file',) a bunch of times and gives me the same result as before. The .zip is from Runeberg, just go here and download the one with "All HTML files".

08-02-2019, 05:27 PM	#2
jackie_w Grand Sorcerer Posts: 6,310 Karma: 16800000 Join Date: Sep 2009 Location: UK Device: ClaraHD, Forma, Libra2, Clara2E, LibraCol, PBTouchHD3	I'm guessing you added the downloaded .zip file to the calibre GUI as a book format. Instead, unzip the .zip file on your PC then add just the index.html file as a book to calibre GUI. calibre will use index.html to pull in all the other .html files and create its own zip book format. Once you've done that do a calibre zip-to-whatever conversion. I just did a zip-to-epub conversion and it converted OK for me - even if the epub does look a bit primitive due to no styling. Last edited by jackie_w; 08-02-2019 at 05:32 PM.

08-04-2019, 07:49 AM	#10
kovidgoyal creator of calibre Posts: 46,056 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	these is because these are not HTML files, but HTML fragments, for example k0a.html contains: Code: <h1>Den kristna dagvisan</h1> <p>Den signade dag, som vi nu här se <br>av himmelen till oss nedkomma, <br>han blive oss säll, han låte sig te <br>oss allom till glädje och fromma. <br>Ja, Herren, den högste, oss alla i dag <br>för synder och sorger bevare. <p>Men såsom en fågel mot himmelens höjd <br>sig lyfter på lediga vingar, <br>han lovar sin gud, är glad och förnöjd, <br>när han över jorden sig svingar, <br>så lyfter sig själen i hjärtelig fröjd <br>till himlen med lovsång och böner. <p>Ack, låtom oss lova och bedja vår Gud, <br>när stunderna växla och skrida, <br>så skola vi stärkas att hålla hans bud <br>och vaka och tåligen lida. <br>Ja, låtom oss verka med allvar och flit, <br>så länge oss dagen förunnas. <p>sv. Ps. 424: 1, 5, 6 calibre will look for an <html> tag and if it does not find it, will assume the file is not html and not add it to the spine. However since index.html is added to the spine, by virtue of being the top level file, the EPUB conversion will follow links in it and auto-fix the HTML framgments making them proper files. However this happens in random order. So you need to either fix the files to be proper html yourself, just adding an opening <html> tag to the start of the file should be enough, or edit the opf file inside the calibre produced zip file and add all the extra html files to the <spine> section and then convert.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Produced ePub from ZIP(html), seen as blank pages in KoboTouch	Fackerman	Conversion	1	07-29-2012 11:03 PM
Convert EPUB to HTML Zip extra meta text	meme	Conversion	2	05-28-2012 02:34 PM
Convert HTML to MOBI (HTML recognized as ZIP file)	pdubois	Conversion	1	01-25-2011 01:55 PM
Complex HTML archive (ZIP), how to convert	Mixx	Calibre	10	09-28-2010 01:29 PM
Convert from HTML (zip) no longer working	alhscw	Calibre	2	08-03-2010 02:07 PM

08-02-2019, 09:46 PM	#3
theducks Well trained by Cats Posts: 31,568 Karma: 62544528 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	If the zip is not converting properly, I suspect the contents were added from different paths than index.html specified, so it can't find stuff where it thinks it is supposed to be. 1)fix the paths in index or 2)Add each file to an editor session, setting the order (if needed) in the file list Re-link images as needed

08-03-2019, 01:04 AM	#4
DNSB Bibliophagist Posts: 50,734 Karma: 178402706 Join Date: Jul 2010 Location: Vancouver Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos	I downloaded the file, converted it to epub and opened in it Sigil. Which was not very happy with it since quite a few files were not in the manifest's spine section. Looking at that segment in content.opf showed 4 files. I used the Modify Epub plugin to add unmanifested files to the manifest and that cleared up those errors. The easier answer was to unpack the .zip file into a temp directory and then add the index.html file to calibre. Calibre will then parse that file and drag in the other files which are referenced. This file opened with a couple of minor errors from k9.html where there is a chunk of text wrapped in <blockquote></blockquote> tags without a block tag. Simply adding a <p> and </p> corrected that issue.

08-03-2019, 07:35 AM	#5
Blrp Member Posts: 16 Karma: 10 Join Date: Jul 2014 Device: none	Thanks for the help, guys. The files were put in the ePub in a seemingly random order; this was easy (though tedious) to fix through editing, but can I get the conversion to sort them properly? The files appear in the correct order order the way they get sorted by name in Windows (e.g. k9 comes before k10) and they also appear in the correct order in index.html, but Calibre decided the most logical order was k54 -> k53 -> k52c -> k0a -> k0b -> k1a -> ... -> k7 -> k33c -> k43b -> k34 -> ...

08-03-2019, 12:16 PM	#6
kovidgoyal creator of calibre Posts: 46,056 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	You will want to change the setting to add the files in depth first rder to breadth first order, see the note at https://manual.calibre-ebook.com/faq...specific-order

08-03-2019, 04:39 PM	#7
Blrp Member Posts: 16 Karma: 10 Join Date: Jul 2014 Device: none	Didn't work, and when I said "seemingly random order" I guess I shouldn't have hedged my statement. I imported index.html into Calibre twice with depth-first, then twice with breadth-first, and the resulting ePub had a different order every time. I then converted the last entry again to confirm that the problem lies not with adding the .html but with converting to ePub. The resulting five ePubs started (after index) with k33c, k53, k33c, k54, k53... and I converted once again and got k52c, so there's definitely a pattern here. I think one I deleted started with k34. Also, the two that started with k33c had different orders after that.

08-03-2019, 11:42 PM	#8
kovidgoyal creator of calibre Posts: 46,056 Karma: 29579868 Join Date: Oct 2006 Location: Mumbai, India Device: Various	I fond it extremely hard to believe that converting to EPUB would randomize file order, if that were the case there would literally be millions of bug reports about it. But feel free to attach a file that shows this behavior on conversion and I will take a look.