Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 05-15-2024, 05:56 PM   #1
barbiedolphin
Member
barbiedolphin is on a distinguished road
 
Posts: 21
Karma: 50
Join Date: Jan 2019
Device: none
PDF to ePub - impossible to unwrap line breaks on this file?

This is a very important historical Bulgarian translation of the Bible which I haven't been able to find anywhere else as a singular file - thankfully it's not OCR and even has a table of contents, but the line breaks seem impossible to unwrap upon conversion, no matter the settings.

Can anyone help? Ideally I would've just said "screw it" and cleaned up the line breaks with regex into a raw TXT file, but then I'd lose the table of contents.

Any help would be very much appreciated, no matter how hacky the workaround! Looking forward to finally uploading this one to archive.org, so no one else has to scour the web to find it, nor torture themselves converting it.
Attached Files
File Type: pdf Tsarigradskata-Bibliya 1914.pdf (10.92 MB, 80 views)

Last edited by barbiedolphin; 05-15-2024 at 06:04 PM.
barbiedolphin is offline   Reply With Quote
Old 05-15-2024, 07:22 PM   #2
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,997
Karma: 27620706
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by barbiedolphin View Post
This is a very important historical Bulgarian translation of the Bible which I haven't been able to find anywhere else as a singular file - thankfully it's not OCR and even has a table of contents, but the line breaks seem impossible to unwrap upon conversion, no matter the settings.

Can anyone help? Ideally I would've just said "screw it" and cleaned up the line breaks with regex into a raw TXT file, but then I'd lose the table of contents.

Any help would be very much appreciated, no matter how hacky the workaround! Looking forward to finally uploading this one to archive.org, so no one else has to scour the web to find it, nor torture themselves converting it.
Try this - it's an MS Word PDF->DOCX conversion. It's in a ZIP because I can't upload DOCXs bigger than 5MB at MR Ψ³

BR
Attached Files
File Type: zip Tsarigradskata-Bibliya 1914.zip (5.99 MB, 106 views)
BetterRed is offline   Reply With Quote
Old 05-16-2024, 08:15 AM   #3
barbiedolphin
Member
barbiedolphin is on a distinguished road
 
Posts: 21
Karma: 50
Join Date: Jan 2019
Device: none
Quote:
Originally Posted by BetterRed View Post
Try this - it's an MS Word PDF->DOCX conversion. It's in a ZIP because I can't upload DOCXs bigger than 5MB at MR Ψ³

BR
Thanks for the reply! I tried converting this DOCX into ePub, and while a lot of lines did get unwrapped properly:
- the numbered verses seemed to be treated and converted as a numbered list (which some readers render as bullet points)
- some of the verses merged with one another (too much unwrapping)

I also tried replacing the dot in each verse number (e.g. "13.") with a similar-looking unicode dot so it wouldn't be treated as a numbered list, but that didn't seem to work.

Just now I also found this particular text in various exotic formats:
http://eubible.com/download/download.htm
They seem to be viewable with software called "Simple Bible Reader", with which I was able to convert one into a "LOGOS Import File" - basically a DOCX with weird verse numbering, which I fixed using regex.
I then converted that DOCX into ePub, which went flawlessly!
Attached Files
File Type: docx simple bible reader conversion.docx (1.84 MB, 26 views)
File Type: docx simple bible reader conversion fixed.docx (1.54 MB, 37 views)
File Type: epub Цариградската .epub (1.55 MB, 32 views)

Last edited by barbiedolphin; 05-16-2024 at 10:31 AM.
barbiedolphin is offline   Reply With Quote
Old 05-16-2024, 09:56 PM   #4
BetterRed
null operator (he/him)
BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.BetterRed ought to be getting tired of karma fortunes by now.
 
Posts: 20,997
Karma: 27620706
Join Date: Mar 2012
Location: Sydney Australia
Device: none
Quote:
Originally Posted by barbiedolphin View Post
I then converted that DOCX into ePub, which went flawlessly!
Glad you found a solution

BR
BetterRed is offline   Reply With Quote
Old 05-19-2024, 07:35 PM   #5
barbiedolphin
Member
barbiedolphin is on a distinguished road
 
Posts: 21
Karma: 50
Join Date: Jan 2019
Device: none
Ughhh, it turned out the weird file formats I found/converted actually contained various text issues. So ultimately, I did need to copy the text from the original ugly PDF and tardwrangle it with regex and calibre until I finally got what I needed (incl. a table of contents).

Here's where it ultimately ended up, alongside other editions:
https://archive.org/details/bibliya-...o-izdanie-1885
Attached Files
File Type: epub Цариградска Би.epub (1.61 MB, 35 views)
File Type: zip Цариградска Би .zip (1.30 MB, 43 views)
barbiedolphin is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
PDF - relation between "line un-wrapping factor" and "unwrap line" (heuristic proc.) dr_Fell Conversion 1 10-16-2017 11:56 PM
Line breaks when converting to pdf maffia Conversion 2 05-05-2015 04:27 AM
PDF has random line breaks bsabiston Conversion 1 09-20-2013 07:43 PM
Ignoring line breaks in pdf file mike_bike_kite Calibre 0 06-14-2010 10:37 AM
PDF line unwrap miquel Calibre 15 05-26-2010 06:35 PM


All times are GMT -4. The time now is 04:04 AM.


MobileRead.com is a privately owned, operated and funded community.