PDF line unwrap

miquel · 05-18-2010, 03:43 PM

Hi,
I wanted to give a hand in auto-detecting line breaks, headers and footers in PDFs, so I've been tinkering with the code.

Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion?

Should this be happening at ./ebooks/conversion/preprocess.py:252 ?

Thanks!
Miquel

Starson17 · 05-18-2010, 03:57 PM

Quote:

Originally Posted by miquel

Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion?

Should this be happening at ./ebooks/conversion/preprocess.py:252 ?

I'm not 100% sure of what you're asking. FYI, here is where unwrap_factor and opt_unwrap_factor are used in the code (courtesy of UltraEdit - recent code, but not latest.).

Code:

Find 'unwrap_factor' in ' src\calibre\ebooks\conversion\preprocess.py' :
 src\calibre\ebooks\conversion\preprocess.py/252:         if getattr(self.extra_opts, 'unwrap_factor', 0.0) > 0.01:
 src\calibre\ebooks\conversion\preprocess.py/253:             length = line_length(html, getattr(self.extra_opts, 'unwrap_factor'))
Found 'unwrap_factor' 2 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\html\input.py' :
 src\calibre\ebooks\html\input.py/266:         OptionRecommendation(name='unwrap_factor', recommended_value=0.0,
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\pdb\pdf\reader.py' :
 src\calibre\ebooks\pdb\pdf\reader.py/24:         setattr(self.options, 'unwrap_factor', 0.5)
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\ebooks\pdf\input.py' :
 src\calibre\ebooks\pdf\input.py/25:         OptionRecommendation(name='unwrap_factor', recommended_value=0.5,
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.py' :
 src\calibre\gui2\convert\pdf_input.py/17:             ['no_images', 'unwrap_factor'])
Found 'unwrap_factor' 1 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.ui' :
 src\calibre\gui2\convert\pdf_input.ui/23:       <cstring>opt_unwrap_factor</cstring>
 src\calibre\gui2\convert\pdf_input.ui/41:     <widget class="QDoubleSpinBox" name="opt_unwrap_factor">
Found 'unwrap_factor' 2 time(s).
----------------------------------------
Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input_ui.py' :
 src\calibre\gui2\convert\pdf_input_ui.py/23:         self.opt_unwrap_factor = QtGui.QDoubleSpinBox(Form)
 src\calibre\gui2\convert\pdf_input_ui.py/24:         self.opt_unwrap_factor.setMaximum(1.0)
 src\calibre\gui2\convert\pdf_input_ui.py/25:         self.opt_unwrap_factor.setSingleStep(0.01)
 src\calibre\gui2\convert\pdf_input_ui.py/26:         self.opt_unwrap_factor.setProperty("value", 0.5)
 src\calibre\gui2\convert\pdf_input_ui.py/27:         self.opt_unwrap_factor.setObjectName("opt_unwrap_factor")
 src\calibre\gui2\convert\pdf_input_ui.py/28:         self.gridLayout.addWidget(self.opt_unwrap_factor, 0, 1, 1, 1)
 src\calibre\gui2\convert\pdf_input_ui.py/32:         self.label_2.setBuddy(self.opt_unwrap_factor)
Found 'unwrap_factor' 8 time(s).
Search complete, found 'unwrap_factor' 16 time(s). (7 files.)

miquel · 05-19-2010, 10:48 AM

Hey, thanks for checking!
Yes, I grepped with the same results. The only place where the unwrap_factor property seems to be read is in preprocess.py. Problem is, I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice.

I'm going to do some more homework then, and see why it never gets to the print statements when converting from PDF. I just wanted to make sure unwrapping hadn't been disabled for some reason, and I was on a wild goose chase!

Miquel

Starson17 · 05-19-2010, 02:14 PM

Quote:

Originally Posted by miquel

I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice.

AFAIK, it's still used, although I can't confirm whether you've got the right variable name. I assume you're using calibre-debug -g to start Calibre and see your print statements?

miquel · 05-21-2010, 04:37 PM

Oh darn... Mea culpa... I wasn't starting with calibre-debug. The unwrap code is indeed called, and I have to do a better job of reading the documentation :S

Thanks a lot!

miquel · 05-23-2010, 06:05 PM

Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597

Starson17 · 05-23-2010, 06:38 PM

Quote:

Originally Posted by miquel

Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597

Read Kovid's comments on your ticket regarding the new pdf conversion engine.

DoctorOhh · 05-23-2010, 07:25 PM

Quote:

Originally Posted by miquel

Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think!
Patch's here: http://bugs.calibre-ebook.com/ticket/5597

I have no knowledge in this area but could the method this fellow took in creating this extension for openoffice.org's Writer be applied to cleaning up PDF file conversions.

From his page:

Quote:

Do you have problems with
texts having unwanted
line breaks like
this one?

This happens because there are some unwanted paragraph marks along the text. If we take the text from a PDF, inevitably we will get a paragraph mark at each end of line.

Now, or you delete them one by one with a lot of patience, or you can use the macro MyTXTcleaner that will do the work for you.

miquel · 05-25-2010, 08:51 AM

Hey there,
Yup, I got Kovid's comment, but haven't had a chance to look into the new pdf conversion engine yet. I'll try and port this functionality there, if the new engine doesn't already support it. Also, there might be lessons to be learned or reused from MyTXTcleaner (thx dwanthny).

The other thing I was looking into was autodetecting the regex for headers and footers, btw, which is another match for this engine.

Talk to you soon with more news!
Miquel

Starson17 · 05-25-2010, 09:55 AM

Quote:

Originally Posted by miquel

Talk to you soon with more news!
Miquel

Contributions are always very welcome. PDF conversion is one area that certainly needs work. There are many users looking forward to the new PDF conversion engine (although I think the new 0.7.x release is even more important).

kovidgoyal · 05-25-2010, 11:00 AM

As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.

Starson17 · 05-25-2010, 11:18 AM

Quote:

Originally Posted by kovidgoyal

As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.

I was hoping to encourage Miquel to work on it

Look at how well it worked with Charles - the fence got whitewashed and all I had to do was point to where it needed a few touchups.

I don't have much confidence that PDF conversion will ever be very good. I'd personally choose to work on something that I thought would have a decent chance of being successful, rather than something that will always have problems. Perhaps I'm wrong about how successful a PDF conversion can be, but I've disliked PDFs for a very long time.

kovidgoyal · 05-25-2010, 11:23 AM

Yeah PDF conversions will never get to thepoint of say LIT or MOBI conversions. But the existing level can be improved significantly by doing context aware line unwrapping. WHich means using things like the statistics for line lengths on the whole page and font size changes, spacing between lines, etc. to detect whether a line should be unwrapped or not.

And by doing all this you get header and footer and multicolumn support for free.

miquel · 05-26-2010, 04:25 PM

Hello? Right here guys

Like it or not, there's plenty of pdf books out there, and conversion is really not up to par today, so, who cares whose fence it is?
Anyway, I already said I'd have a look

miquel · 05-26-2010, 05:18 PM

OK Kovid, I'd like to confirm a couple of things with you, please
The new pdf engine:

1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will)

2. PDFDocument takes the xml and generates the html that's used as a base for conversion

3. The rest of ebook conversion takes the html into whatever other format is needed

My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there.

Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter

Did I get it right? What did you have in mind?
Thanks!

05-18-2010, 03:43 PM	#1
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	PDF line unwrap Hi, I wanted to give a hand in auto-detecting line breaks, headers and footers in PDFs, so I've been tinkering with the code. Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion? Should this be happening at ./ebooks/conversion/preprocess.py:252 ? Thanks! Miquel

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
possible bug about.pdf Unwrap	zambosky	Calibre	5	06-20-2010 09:53 AM
Line Spacing on PDF to Epub conversion	poodlemama	Calibre	2	05-03-2010 08:28 PM
PDF Line Un-Wrap Factor bug?	jotekman	Calibre	2	03-15-2010 11:43 AM
PDF line spacing	jjansen	Calibre	3	03-08-2010 11:46 AM
PDF to ePub (New line problem)	Dark123	Calibre	3	02-13-2010 08:41 PM

05-19-2010, 10:48 AM	#3
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	Hey, thanks for checking! Yes, I grepped with the same results. The only place where the unwrap_factor property seems to be read is in preprocess.py. Problem is, I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice. I'm going to do some more homework then, and see why it never gets to the print statements when converting from PDF. I just wanted to make sure unwrapping hadn't been disabled for some reason, and I was on a wild goose chase! Miquel

05-21-2010, 04:37 PM	#5
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	Oh darn... Mea culpa... I wasn't starting with calibre-debug. The unwrap code is indeed called, and I have to do a better job of reading the documentation :S Thanks a lot!

05-23-2010, 06:05 PM	#6
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	Hi again, Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think! Patch's here: http://bugs.calibre-ebook.com/ticket/5597

05-25-2010, 08:51 AM	#9
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	Hey there, Yup, I got Kovid's comment, but haven't had a chance to look into the new pdf conversion engine yet. I'll try and port this functionality there, if the new engine doesn't already support it. Also, there might be lessons to be learned or reused from MyTXTcleaner (thx dwanthny). The other thing I was looking into was autodetecting the regex for headers and footers, btw, which is another match for this engine. Talk to you soon with more news! Miquel

05-25-2010, 11:00 AM	#11
kovidgoyal creator of calibre Posts: 44,499 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.

05-25-2010, 11:23 AM	#13
kovidgoyal creator of calibre Posts: 44,499 Karma: 24495778 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Yeah PDF conversions will never get to thepoint of say LIT or MOBI conversions. But the existing level can be improved significantly by doing context aware line unwrapping. WHich means using things like the statistics for line lengths on the whole page and font size changes, spacing between lines, etc. to detect whether a line should be unwrapped or not. And by doing all this you get header and footer and multicolumn support for free.

05-26-2010, 04:25 PM	#14
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	Hello? Right here guys Like it or not, there's plenty of pdf books out there, and conversion is really not up to par today, so, who cares whose fence it is? Anyway, I already said I'd have a look

05-26-2010, 05:18 PM	#15
miquel Junior Member Posts: 7 Karma: 10 Join Date: May 2010 Location: Heidelberg, Germany Device: Amazon Kindle 2	OK Kovid, I'd like to confirm a couple of things with you, please The new pdf engine: 1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will) 2. PDFDocument takes the xml and generates the html that's used as a base for conversion 3. The rest of ebook conversion takes the html into whatever other format is needed My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there. Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter Did I get it right? What did you have in mind? Thanks!

Advert

Advert