PDFRead 1.7 released

ashkulz · 04-25-2007, 09:02 AM

UPDATE: PDFRead 1.7 has been released. The changes for 1.7 and batch conversion instructions are mentioned first, then followed by the inital release announcement for 1.6.

I've released PDFRead 1.7, which has minor bug fixes and enhancements. Changes in this release:

add a "landscape-half" mode which splits a page into two even halves (gdxf's suggestion)
if the output document does not have the proper file extension, then append it automatically.
remove imagemagick and use pngnq for color reduction.
fix the problems if the PDF has an incorrect TOC referring to an invalid page. Also added option --no-toc to disable TOC generation.

Also, batch conversion can now be done on Windows for all PDFs in a folder.

Download the file attached to linked post and rename it as pdfread-batch.bat
Open up the renamed file, and change the set OPT= line to use the appropriate profile. In case you have installed in a non-default location, change the set LOC= line too.
Copy the batch file into a directory where you want to convert, and double click on it. Please do not put the directory anywhere on the Desktop or My Documents, it can cause some problems. Put it somewhere in the root of your drive ( C:, D: )
The filename will be used as the book title, so be sure to name files properly. Please ensure that the filename does not contain special characters not present in UTF-8. A ebook with be created with the same name (but with given extension ie. sample.pdf => sample.lrf).

In case you want to customize further:

Do a normal conversion with your custom params for a single file and copy the command line options to a text file. Some advice on how to copy the options from the window:

Quote:

Originally Posted by alex_d

To copy text from a CMD window, right-click on the title bar (the bar that has the X and minimize buttons), choose properities, and then enable QuickEdit mode. This lets you highlight text and copy it by right-clicking on it. Copy everything, even if you have to scroll up.

Copy the command line parameters and replace the set OPT= mentioned above. Do NOT include the input filename, the title (-t option) or the call to pdfread, just the options. The value should be valid command line options.

People on OS X/Linux can hack together a similiar script very easily, so I won't bother to post it. If you do want such a script, let me know.

Original announcement follows

After a long wait, PDFRead 1.6 has been released. You can download from PDFRead @ SourceForge.

The focus on this release has been to rewrite the code for better maintainability. It can now be easily integrated into other tools. PDFRead now has a plugin based architecture, which will allow new features to be added easily -- which I've already done for this release.

Lots of new image processing options have been added to PDFRead. unpaper integration ensures that bad scans will be cleaned up properly. The new cropping algorithm removes whitespace very agressively, even from the middle of the page without any loss of content. All images are now run through an edge-enhancement filter, which is the same one used by both rbmake and RasterFarian.

Support for the TIFF and IMGLIST input formats has been added. The IMGLIST format is a simple text file containing a list of images which are to be considered as a single document.

Batch support is not directly present for Windows, but can be achieved via a batch file. The command line used to convert each book (using the current settings) is printed before conversion. You can then copy this to tweak your conversion settings. Users of Linux/OS X are assumed to be familiar with the command-line, and the batch support can be achieved by scripting.

You can also specify a range of pages for conversion. This has the side-effect of giving a preview feature, as specifying the same page as the start and end page will run the processing only for that page.

The Windows GUI has been revamped: there are now tooltips everywhere, and there is no "advanced" page anymore. If you do want to control those parameters, please use the command line directly.

Lots of other minor tweaks have gone into this release.

The detailed changelog for this release:

revamped the Windows GUI: added tooltips, preview feature and show the command line options when executed (useful for batch execution).
add support for TIFF and a list of page images for input.
add unpaper support for image cleanup.
add extremely agressive whitespace detection, even in the middle of the page text.
added an edge-enhancement filter, similiar to rbmake and RasterFarian.
allow all processing stages to be selectively disabled.
allow a page range to be specified for conversion.
tweak the prs-500 profile to rotate right instead of left (thanks gdxf)
add an optional step to optimize generated PNG images via OptiPNG.
removed the dependency on xpdf.
removed the autocontrast and ghostscript cropping features (no longer useful).
fix problem where the IMP file was not created if the latest eBook Publisher was not installed.
complete overhaul of the code for better maintainability.

Some screenshots of the effect of the various image processing options are also attached.

Azayzel · 04-25-2007, 10:27 AM

Wow, you've been quite busy! Have to give this one a whirl and see how things turn out. I have a watermarked PDF I hope PDFRead works well on, we'll see. Thanks!

kovidgoyal · 04-25-2007, 01:11 PM

Cool now that I've finished the HTML,TXT -> LRF converters, I can look into integrating PDFRead into libprs500. There is one concern: http://www.py2exe.org/index.cgi/Py2E...ssInteractions

Would you be willing to fix that in your code?

EDIT: More information http://sourceforge.net/tracker/index...70&atid=105470

ashkulz · 04-25-2007, 01:21 PM

Quote:

Originally Posted by kovidgoyal

Cool now that I've finished the HTML,TXT -> LRF converters, I can look into integrating PDFRead into libprs500. There is one concern: http://www.py2exe.org/index.cgi/Py2E...ssInteractions

Would you be willing to fix that in your code?

I don't see how it affects pdfread. I use explicit pipes when I'm calling other executables (gs, convert, etc), so there shouldn't be a problem in pdfread. If you mean that you may have problem when calling pdfread as a console application, I'd suggest you not to do it that way. Just import pdfread and call the convert() function -- and you're set to go. You should also replace the variable P_STREAM in common.py with any valid stream -- it currently points to sys.stdout, so that's only one place for you to replace streams.

kovidgoyal · 04-25-2007, 01:39 PM

OK...anyway I just realized that this bug has been squashed in python 2.5.1

ashkulz · 04-26-2007, 10:04 AM

For those on Windows, there's a quick way to convert all PDFs in a folder.

Download the attached file and rename it as pdfread-batch.bat
Open up the renamed file, and change the set OPT= line to use the appropriate profile. You may also have to change the EXT= line if you are using a different profile. In case you have installed in a non-default location, change the set LOC= line too.
Copy the batch file into a directory where you want to convert, and double click on it. The filename will be used as the book title, so be sure to name files properly. A ebook with be created with the same name (but with given extension ie. sample.pdf => sample.lrf).

In case you want to customize further:

Do a normal conversion with your custom params for a single file and copy the command line options to a text file. Some advice on how to copy the options from the window:

Quote:

Originally Posted by alex_d

To copy text from a CMD window, right-click on the title bar (the bar that has the X and minimize buttons), choose properities, and then enable QuickEdit mode. This lets you highlight text and copy it by right-clicking on it. Copy everything, even if you have to scroll up.

Copy the command line parameters and replace the set OPT= mentioned above. Do NOT include the input filename, the title (-t option) or the call to pdfread, just the options. The value should be valid command line options.

People on OS X/Linux can hack together a similiar script very easily, so I won't bother to post it. If you do want such a script, let me know.

Gravitas · 04-26-2007, 10:43 AM

I'm being such a muppet (so much so that I changed my title and avatar to match), but I was trying to use this software last night and couldn't get any lrf files from it. i also couldn't get the files it did produce (png) into the folder I specified in my output path - they all went into a temp folder. Even when I used the .lrf file extension in the name of the book.

I'm not usually such a muppet IT-wise (thank god, as I'm an IT Manager looking after a MPLS Citrix network over 36 sites with 600 users) so I reckon it's the excitement of finally getting my hands on my Reader tomorrow, that is shortcircuiting my brain.

Any idea what I'm doing wrong? - I have every confidence that you guys will get me using this stuff properly,as you all sorted me out using BD

oh, and I'm using Windows

ashkulz · 04-26-2007, 12:15 PM

Gravitas: did you use the prs500 or the prs500-l profile? The LRF is produced only if the profile is one of the above two. Otherwise, depending on the profile it will produce output targeted for another device.

If it still doesn't work, can you post some a screenshot of the settings before pressing Convert and the explorer view of the output folder?

Don't worry, we all have those days every now and then

Gravitas · 04-26-2007, 12:19 PM

I was using the prs500 profile. I'll have another go when I get home and post some screenies.

EDIT

Ok here are my screenies, I'm sure I've done something blindly obviously wrong

[IMG]

[/IMG]

kovidgoyal · 04-26-2007, 06:15 PM

Works pretty well for me. Minor point:
the spelling of portrait is 'portrait not potrait (-m option)

kovidgoyal · 04-26-2007, 07:22 PM

Hmm problems
The following cmdline cause an exception

Code:

python pdfread.py -p prs500 -o /home/kovid/temp/test.lrf -t 'Guide to NumPy' -a 'Travis Oliphant' -f lrf -i pdf -m potrait  /home/kovid/documents/text/notes/NumPy/numpybook.pdf --last-page=2

Creating BBeB file ... Traceback (most recent call last):
  File "/home/kovid/build/pdfread-1.6/pdfread.py", line 204, in <module>
    main()
  File "/home/kovid/build/pdfread-1.6/pdfread.py", line 90, in main
    delete = output.generate(input.toc)
  File "/home/kovid/build/pdfread-1.6/output.py", line 211, in generate
    imagenum = toc_map[int(page_)]
KeyError: 12

Probably because the TOC refers to pages not included.

Also, this is my first time rasterizing a PDF (I usually have access to the LaTeX sources). Is the font rasterization always so bad? I've attached samples to show you what I mean.

gdxf · 04-27-2007, 12:43 AM

I followed the batch mode instructions to run batch conversion in windows, but had encountered this notice in the command line:

"Unable to determine total number of pages in document
Please enter number of pages: "

When I put in a page number, it results in a blank lrf file.

Here is what the screen says:

"Unable to determine total number of pages in document
Please enter number of pages: 1

Temporary directory: c:\docume~1........

Page 1/1: EXTRACT RASTERIZE BLANK

Creating BBeB file ... done.
Unable to determine total number of pages in document
Please enter number of pages: 1

Temporary directory: c:\docume~1\.........

Page 1/1: EXTRACT RASTERIZE BLANK

Creating BBeB file ... done.
Press any key to continue . . ."

Quote:

Originally Posted by ashkulz

For those on Windows, there's a quick way to convert all PDFs in a folder.

ashkulz · 04-27-2007, 05:40 AM

Okay, I've discovered the problem that bit Gravitas and kovidgoyal. The PDF file is incorrect, as it contains a TOC reference for a page that doesn't exist. I've fixed that, and will be making another release tomorrow.

Gravitas · 04-27-2007, 06:07 AM

Quote:

Originally Posted by ashkulz

Okay, I've discovered the problem that bit Gravitas and kovidgoyal. The PDF file is incorrect, as it contains a TOC reference for a page that doesn't exist. I've fixed that, and will be making another release tomorrow.

What a star

ashkulz · 04-27-2007, 12:06 PM

Okay, I've released 1.7. Changes in this release:

add a "landscape-half" mode which splits a page into two even halves (gdxf's suggestion)
if the output document does not have the proper file extension, then append it automatically.
remove imagemagick and use pngnq for color reduction.
fix the problems if the PDF has an incorrect TOC referring to an invalid page. Also added option --no-toc to disable TOC generation.

If you are on OS X or Linux, please recheck the installation instructions -- there have been changes since the last release.

EDIT: I'm going away for the weekend (it's a long weekend), so I may not respond quickly for a few days

04-25-2007, 01:11 PM	#3
kovidgoyal creator of calibre Posts: 44,565 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Cool now that I've finished the HTML,TXT -> LRF converters, I can look into integrating PDFRead into libprs500. There is one concern: http://www.py2exe.org/index.cgi/Py2E...ssInteractions Would you be willing to fix that in your code? EDIT: More information http://sourceforge.net/tracker/index...70&atid=105470 Last edited by kovidgoyal; 04-25-2007 at 01:27 PM.

04-26-2007, 12:19 PM	#9
Gravitas Muppet Posts: 123 Karma: 107 Join Date: Apr 2007 Location: Nottingham, England, UK Device: Zen Vision :M / Nokia 5800 musicXpress / Sony PRS500	I was using the prs500 profile. I'll have another go when I get home and post some screenies. EDIT Ok here are my screenies, I'm sure I've done something blindly obviously wrong [IMG] [/IMG] Last edited by Gravitas; 04-26-2007 at 06:45 PM. Reason: added screenies

04-27-2007, 12:06 PM	#15
ashkulz Addict Posts: 350 Karma: 705 Join Date: Dec 2006 Location: Mumbai, India Device: Kindle 1/REB 1200	Okay, I've released 1.7. Changes in this release: add a "landscape-half" mode which splits a page into two even halves (gdxf's suggestion) if the output document does not have the proper file extension, then append it automatically. remove imagemagick and use pngnq for color reduction. fix the problems if the PDF has an incorrect TOC referring to an invalid page. Also added option --no-toc to disable TOC generation. If you are on OS X or Linux, please recheck the installation instructions -- there have been changes since the last release. EDIT: I'm going away for the weekend (it's a long weekend), so I may not respond quickly for a few days Last edited by ashkulz; 04-27-2007 at 12:09 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
PDFRead 1.8.2 released!	nrapallo	Workshop	372	12-29-2011 12:26 PM
Need help using PDFRead	daithi81	Workshop	8	10-16-2009 10:33 AM
Hacks Kindle 2 and PDFRead 1.8	daffy4u	Amazon Kindle	38	05-06-2009 10:38 AM
Need help with PDFRead	pfisterfarm	PDF	8	03-23-2009 10:19 AM
PDFRead v5 available on Sourceforge	Alexander Turcic	PDF	3	04-08-2007 07:31 AM

04-25-2007, 10:27 AM	#2
Azayzel Cache Ninja! Posts: 643 Karma: 1002300 Join Date: Jan 2007 Location: Tokyo, Japan Device: PRS-500, HTC Shift, iPod Touch, iPaq 4150, TC1100, Panasonic WordsGear	Wow, you've been quite busy! Have to give this one a whirl and see how things turn out. I have a watermarked PDF I hope PDFRead works well on, we'll see. Thanks!

04-25-2007, 01:39 PM	#5
kovidgoyal creator of calibre Posts: 44,565 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	OK...anyway I just realized that this bug has been squashed in python 2.5.1

04-26-2007, 10:43 AM	#7
Gravitas Muppet Posts: 123 Karma: 107 Join Date: Apr 2007 Location: Nottingham, England, UK Device: Zen Vision :M / Nokia 5800 musicXpress / Sony PRS500	I'm being such a muppet (so much so that I changed my title and avatar to match), but I was trying to use this software last night and couldn't get any lrf files from it. i also couldn't get the files it did produce (png) into the folder I specified in my output path - they all went into a temp folder. Even when I used the .lrf file extension in the name of the book. I'm not usually such a muppet IT-wise (thank god, as I'm an IT Manager looking after a MPLS Citrix network over 36 sites with 600 users) so I reckon it's the excitement of finally getting my hands on my Reader tomorrow, that is shortcircuiting my brain. Any idea what I'm doing wrong? - I have every confidence that you guys will get me using this stuff properly,as you all sorted me out using BD oh, and I'm using Windows

04-26-2007, 12:15 PM	#8
ashkulz Addict Posts: 350 Karma: 705 Join Date: Dec 2006 Location: Mumbai, India Device: Kindle 1/REB 1200	Gravitas: did you use the prs500 or the prs500-l profile? The LRF is produced only if the profile is one of the above two. Otherwise, depending on the profile it will produce output targeted for another device. If it still doesn't work, can you post some a screenshot of the settings before pressing Convert and the explorer view of the output folder? Don't worry, we all have those days every now and then

04-26-2007, 06:15 PM	#10
kovidgoyal creator of calibre Posts: 44,565 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Works pretty well for me. Minor point: the spelling of portrait is 'portrait not potrait (-m option)

04-27-2007, 05:40 AM	#13
ashkulz Addict Posts: 350 Karma: 705 Join Date: Dec 2006 Location: Mumbai, India Device: Kindle 1/REB 1200	Okay, I've discovered the problem that bit Gravitas and kovidgoyal. The PDF file is incorrect, as it contains a TOC reference for a page that doesn't exist. I've fixed that, and will be making another release tomorrow.

Advert

Advert