PDF to ePub - impossible to unwrap line breaks on this file?

barbiedolphin · 05-15-2024, 05:56 PM

This is a very important historical Bulgarian translation of the Bible which I haven't been able to find anywhere else as a singular file - thankfully it's not OCR and even has a table of contents, but the line breaks seem impossible to unwrap upon conversion, no matter the settings.

Can anyone help? Ideally I would've just said "screw it" and cleaned up the line breaks with regex into a raw TXT file, but then I'd lose the table of contents.

Any help would be very much appreciated, no matter how hacky the workaround! Looking forward to finally uploading this one to archive.org, so no one else has to scour the web to find it, nor torture themselves converting it.

BetterRed · 05-15-2024, 07:22 PM

Quote:

Originally Posted by barbiedolphin

This is a very important historical Bulgarian translation of the Bible which I haven't been able to find anywhere else as a singular file - thankfully it's not OCR and even has a table of contents, but the line breaks seem impossible to unwrap upon conversion, no matter the settings.

Can anyone help? Ideally I would've just said "screw it" and cleaned up the line breaks with regex into a raw TXT file, but then I'd lose the table of contents.

Any help would be very much appreciated, no matter how hacky the workaround! Looking forward to finally uploading this one to archive.org, so no one else has to scour the web to find it, nor torture themselves converting it.

Try this - it's an MS Word PDF->DOCX conversion. It's in a ZIP because I can't upload DOCXs bigger than 5MB at MR Ψ³

BR

barbiedolphin · 05-16-2024, 08:15 AM

Quote:

Originally Posted by BetterRed

Try this - it's an MS Word PDF->DOCX conversion. It's in a ZIP because I can't upload DOCXs bigger than 5MB at MR Ψ³

BR

Thanks for the reply! I tried converting this DOCX into ePub, and while a lot of lines did get unwrapped properly:
- the numbered verses seemed to be treated and converted as a numbered list (which some readers render as bullet points)
- some of the verses merged with one another (too much unwrapping)

I also tried replacing the dot in each verse number (e.g. "13.") with a similar-looking unicode dot so it wouldn't be treated as a numbered list, but that didn't seem to work.

Just now I also found this particular text in various exotic formats:
http://eubible.com/download/download.htm
They seem to be viewable with software called "Simple Bible Reader", with which I was able to convert one into a "LOGOS Import File" - basically a DOCX with weird verse numbering, which I fixed using regex.
I then converted that DOCX into ePub, which went flawlessly!

BetterRed · 05-16-2024, 09:56 PM

Quote:

Originally Posted by barbiedolphin

I then converted that DOCX into ePub, which went flawlessly!

Glad you found a solution

BR

barbiedolphin · 05-19-2024, 07:35 PM

Ughhh, it turned out the weird file formats I found/converted actually contained various text issues. So ultimately, I did need to copy the text from the original ugly PDF and tardwrangle it with regex and calibre until I finally got what I needed (incl. a table of contents).

Here's where it ultimately ended up, alongside other editions:
https://archive.org/details/bibliya-...o-izdanie-1885

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDF - relation between "line un-wrapping factor" and "unwrap line" (heuristic proc.)	dr_Fell	Conversion	1	10-16-2017 11:56 PM
Line breaks when converting to pdf	maffia	Conversion	2	05-05-2015 04:27 AM
PDF has random line breaks	bsabiston	Conversion	1	09-20-2013 07:43 PM
Ignoring line breaks in pdf file	mike_bike_kite	Calibre	0	06-14-2010 10:37 AM
PDF line unwrap	miquel	Calibre	15	05-26-2010 06:35 PM