05-18-2010, 03:43 PM | #1 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
PDF line unwrap
Hi,
I wanted to give a hand in auto-detecting line breaks, headers and footers in PDFs, so I've been tinkering with the code. Now, am I a bit slow, or is opt_unwrap_factor picked up in the gui, and never carried over for conversion? Should this be happening at ./ebooks/conversion/preprocess.py:252 ? Thanks! Miquel |
05-18-2010, 03:57 PM | #2 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
Find 'unwrap_factor' in ' src\calibre\ebooks\conversion\preprocess.py' : src\calibre\ebooks\conversion\preprocess.py/252: if getattr(self.extra_opts, 'unwrap_factor', 0.0) > 0.01: src\calibre\ebooks\conversion\preprocess.py/253: length = line_length(html, getattr(self.extra_opts, 'unwrap_factor')) Found 'unwrap_factor' 2 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\ebooks\html\input.py' : src\calibre\ebooks\html\input.py/266: OptionRecommendation(name='unwrap_factor', recommended_value=0.0, Found 'unwrap_factor' 1 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\ebooks\pdb\pdf\reader.py' : src\calibre\ebooks\pdb\pdf\reader.py/24: setattr(self.options, 'unwrap_factor', 0.5) Found 'unwrap_factor' 1 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\ebooks\pdf\input.py' : src\calibre\ebooks\pdf\input.py/25: OptionRecommendation(name='unwrap_factor', recommended_value=0.5, Found 'unwrap_factor' 1 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.py' : src\calibre\gui2\convert\pdf_input.py/17: ['no_images', 'unwrap_factor']) Found 'unwrap_factor' 1 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input.ui' : src\calibre\gui2\convert\pdf_input.ui/23: <cstring>opt_unwrap_factor</cstring> src\calibre\gui2\convert\pdf_input.ui/41: <widget class="QDoubleSpinBox" name="opt_unwrap_factor"> Found 'unwrap_factor' 2 time(s). ---------------------------------------- Find 'unwrap_factor' in ' src\calibre\gui2\convert\pdf_input_ui.py' : src\calibre\gui2\convert\pdf_input_ui.py/23: self.opt_unwrap_factor = QtGui.QDoubleSpinBox(Form) src\calibre\gui2\convert\pdf_input_ui.py/24: self.opt_unwrap_factor.setMaximum(1.0) src\calibre\gui2\convert\pdf_input_ui.py/25: self.opt_unwrap_factor.setSingleStep(0.01) src\calibre\gui2\convert\pdf_input_ui.py/26: self.opt_unwrap_factor.setProperty("value", 0.5) src\calibre\gui2\convert\pdf_input_ui.py/27: self.opt_unwrap_factor.setObjectName("opt_unwrap_factor") src\calibre\gui2\convert\pdf_input_ui.py/28: self.gridLayout.addWidget(self.opt_unwrap_factor, 0, 1, 1, 1) src\calibre\gui2\convert\pdf_input_ui.py/32: self.label_2.setBuddy(self.opt_unwrap_factor) Found 'unwrap_factor' 8 time(s). Search complete, found 'unwrap_factor' 16 time(s). (7 files.) |
|
Advert | |
|
05-19-2010, 10:48 AM | #3 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
Hey, thanks for checking!
Yes, I grepped with the same results. The only place where the unwrap_factor property seems to be read is in preprocess.py. Problem is, I've added print statements around that and they don't get shown. That's why I'm asking if it's used in practice. I'm going to do some more homework then, and see why it never gets to the print statements when converting from PDF. I just wanted to make sure unwrapping hadn't been disabled for some reason, and I was on a wild goose chase! Miquel |
05-19-2010, 02:14 PM | #4 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
AFAIK, it's still used, although I can't confirm whether you've got the right variable name. I assume you're using calibre-debug -g to start Calibre and see your print statements?
|
05-21-2010, 04:37 PM | #5 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
Oh darn... Mea culpa... I wasn't starting with calibre-debug. The unwrap code is indeed called, and I have to do a better job of reading the documentation :S
Thanks a lot! |
Advert | |
|
05-23-2010, 06:05 PM | #6 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
Hi again,
Starson17, thanks a lot for your help on this thread. I've submitted a patch with a different approach to line unwrapping, let's see what people think! Patch's here: http://bugs.calibre-ebook.com/ticket/5597 |
05-23-2010, 06:38 PM | #7 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
|
|
05-23-2010, 07:25 PM | #8 | ||
US Navy, Retired
Posts: 9,867
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
Quote:
From his page: Quote:
|
||
05-25-2010, 08:51 AM | #9 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
Hey there,
Yup, I got Kovid's comment, but haven't had a chance to look into the new pdf conversion engine yet. I'll try and port this functionality there, if the new engine doesn't already support it. Also, there might be lessons to be learned or reused from MyTXTcleaner (thx dwanthny). The other thing I was looking into was autodetecting the regex for headers and footers, btw, which is another match for this engine. Talk to you soon with more news! Miquel |
05-25-2010, 09:55 AM | #10 |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Contributions are always very welcome. PDF conversion is one area that certainly needs work. There are many users looking forward to the new PDF conversion engine (although I think the new 0.7.x release is even more important).
|
05-25-2010, 11:00 AM | #11 |
creator of calibre
Posts: 44,499
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
As it's going to be a little while before i can work on the new engine, if you want to work on it, feel free.
|
05-25-2010, 11:18 AM | #12 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Look at how well it worked with Charles - the fence got whitewashed and all I had to do was point to where it needed a few touchups. I don't have much confidence that PDF conversion will ever be very good. I'd personally choose to work on something that I thought would have a decent chance of being successful, rather than something that will always have problems. Perhaps I'm wrong about how successful a PDF conversion can be, but I've disliked PDFs for a very long time. |
|
05-25-2010, 11:23 AM | #13 |
creator of calibre
Posts: 44,499
Karma: 24495778
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yeah PDF conversions will never get to thepoint of say LIT or MOBI conversions. But the existing level can be improved significantly by doing context aware line unwrapping. WHich means using things like the statistics for line lengths on the whole page and font size changes, spacing between lines, etc. to detect whether a line should be unwrapped or not.
And by doing all this you get header and footer and multicolumn support for free. |
05-26-2010, 04:25 PM | #14 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
Hello? Right here guys
Like it or not, there's plenty of pdf books out there, and conversion is really not up to par today, so, who cares whose fence it is? Anyway, I already said I'd have a look |
05-26-2010, 05:18 PM | #15 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Location: Heidelberg, Germany
Device: Amazon Kindle 2
|
OK Kovid, I'd like to confirm a couple of things with you, please
The new pdf engine: 1. Takes the pdf file, and passes it to the C plugin implementation of PDF reflow. That returns an xml with the pdf's draw commands (a pdf in xml if you will) 2. PDFDocument takes the xml and generates the html that's used as a base for conversion 3. The rest of ebook conversion takes the html into whatever other format is needed My plan would then be to hack into PDFDocument, take the xml, do the unwrapping and header+footer detection, and end up making the html there. Is that what you had in mind? Or did you intend the reflow plugin to, you know, reflow (ie unwrap) the pdf? I personally prefer pdfreflow being a pdf-to-xml-that-we-can-work-on-in-python converter Did I get it right? What did you have in mind? Thanks! |
Tags |
conversion, linebreak, pdf, unwrap |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
possible bug about.pdf Unwrap | zambosky | Calibre | 5 | 06-20-2010 09:53 AM |
Line Spacing on PDF to Epub conversion | poodlemama | Calibre | 2 | 05-03-2010 08:28 PM |
PDF Line Un-Wrap Factor bug? | jotekman | Calibre | 2 | 03-15-2010 11:43 AM |
PDF line spacing | jjansen | Calibre | 3 | 03-08-2010 11:46 AM |
PDF to ePub (New line problem) | Dark123 | Calibre | 3 | 02-13-2010 08:41 PM |