Looking for K2pdfopt Help

CWNichols · 04-24-2023, 05:22 AM

Hello,

I've got a scan of an old book in PDF that I'd like to read on my Kindle (5th Gen.). For whatever reason, once on device pages appear blurry or blank (based on a post by Willus in another thread I suspect it's because the pdf may be composed of JPX images?). When I run it through K2pdfopt I can view the pages, the only problem is it seems no matter what setting adjustments I make to the conversion, I'm still getting some pages that are cut up in random ways (see attached sample, converted with setting bp 3, -1).

Is there a way to adjust the settings in K2pdfopt so that it simply converts the pdf to display as it is successfully doing, but without making any other crop or margin type adjustments at all?

thanks so much!

VladimirS · 04-25-2023, 06:42 AM

Whics conversion parameters did you applied?

CWNichols · 04-25-2023, 06:48 AM

Quote:

Originally Posted by VladimirS

Whics conversion parameters did you applied?

Hi VladimirS, thanks for the reply—the only setting I changed was the one indicated in OP (break page 3, -1)

Thanks!

VladimirS · 04-26-2023, 04:19 AM

You can allways make exact copy of original book using :

k2pdfopt -mode copy filename.pdf.

This will make exact copy, without any modifications, without any cropping or similar adjustments.

CWNichols · 04-26-2023, 06:27 AM

Quote:

Originally Posted by VladimirS

You can allways make exact copy of original book using :

k2pdfopt -mode copy filename.pdf.

This will make exact copy, without any modifications, without any cropping or similar adjustments.

Wow that actually seemed to have worked!! I did notice it effected the color (increased contrast, which is not a big deal). Is it possible to copy to black and white? Also add OCR?

Thanks so much for your help!!

VladimirS · 04-27-2023, 04:42 AM

No problem, we are here to help each other.

If you do not want to change contrast, and add OCR, you can run K2pdfopt as follows

k2pdfopt -mode copy -cmax 1.0 -g 1.0 -ocr filename.pdf

CWNichols · 04-28-2023, 06:54 AM

Quote:

Originally Posted by VladimirS

No problem, we are here to help each other.

If you do not want to change contrast, and add OCR, you can run K2pdfopt as follows

k2pdfopt -mode copy -cmax 1.0 -g 1.0 -ocr filename.pdf

Hi Vlad, I guess I should've pointed out that I'm using the executable form of K2pdfopt (in other words—for example, I simply input "mo", hit enter, then "copy" and hit enter). Therefore I am unable to enter the commands as you've relayed them. Nonetheless, I have managed to figure out how to enter all those commands in this form.

The only issue I'm having is with the OCR. It almost seems as if only every other page is navigable by cursor on the kindle (the cursor get's stuck on the outside of the page—in spite of the text being highlightable on the desktop). Any thoughts on how to resolve that? I did notice their are several OCR options, I've been doing mupdf with default settings.

Cheers mate

Steven

VladimirS · 04-29-2023, 05:50 AM

OCR is better as days go by, but it has it's limits.

Good OCR job depends on original photo (individual pdf page).
Page needs to be without distorsions, flat, with good contrast between
text and background. Scanning of books and artefacts of scanning
makes OCR job more difficult. Many small imperfections, shades of background ...

If I find some possible solution, I will notice you. Cheers

willus · 04-29-2023, 09:45 PM

Hi all--sorry to have missed this thread. It's been a busy week. Thank you VladimirS for your help.

A few things:
1. To force no contrast change, actually -cmax -1 is better than -cmax 1, but it hardly matters.
2. If you want an exact replica, also turn off sharpening: -s-
3. Is this being run on Linux? Mac? What version? For the latest versions of k2pdfopt, you can actually just directly enter the command-line arguments into the "Enter option above" prompt. It should work.
4. Are you doing tesseract OCR? You might try different detection options and see if one works better than another. E.g. -ocrd p will entirely use Tesseract's own algorithm for finding text on the page, whereas -ocrd l will have k2pdfopt submit the OCR graphics line by line.

If you want to PM me the PDF source you are trying to convert, I'd be happy to recommend options.

CWNichols · 05-07-2023, 02:22 PM

Quote:

Originally Posted by VladimirS

OCR is better as days go by, but it has it's limits.

Good OCR job depends on original photo (individual pdf page).
Page needs to be without distorsions, flat, with good contrast between
text and background. Scanning of books and artefacts of scanning
makes OCR job more difficult. Many small imperfections, shades of background ...

If I find some possible solution, I will notice you. Cheers

Thanks again Vlad

CWNichols · 05-07-2023, 02:27 PM

Quote:

Originally Posted by willus

Hi all--sorry to have missed this thread. It's been a busy week. Thank you VladimirS for your help.

A few things:
1. To force no contrast change, actually -cmax -1 is better than -cmax 1, but it hardly matters.
2. If you want an exact replica, also turn off sharpening: -s-
3. Is this being run on Linux? Mac? What version? For the latest versions of k2pdfopt, you can actually just directly enter the command-line arguments into the "Enter option above" prompt. It should work.
4. Are you doing tesseract OCR? You might try different detection options and see if one works better than another. E.g. -ocrd p will entirely use Tesseract's own algorithm for finding text on the page, whereas -ocrd l will have k2pdfopt submit the OCR graphics line by line.

If you want to PM me the PDF source you are trying to convert, I'd be happy to recommend options.

Hi Willus, thank you very much for taking the time to help and provide those additional tips. I am using the executable version (that gives the "enter option above" prompt) on Mac OS 10.15.7 (Catalina).

It seems I was getting the best results with mupdf OCR.

Will PM you the specific pdf I'm seeking to convert in this particular instance.

Thanks again!

willus · 05-17-2023, 12:50 AM

I found these settings worked pretty well:

Code:

k2pdfopt -p 50-52 -mode copy -c- -m .25,.42,.25,1 -t -om 0.25 -as -ocr t input.pdf -o output.pdf

The -p 50-52 just tries it out on pages 50-52 as a trial since the book is very long (you can remove this once you want to try it on the whole book).
The -mode copy defaults to copying the source page size
The -m arguments ignore the left 0.25 inches, the top 0.42 inches, the right 0.25 inches, and the bottom 1 inch of each source page
The -t trims to the text
The -om adds a small blank border to the output edges (0.25 in)
The -as auto-straightens (de-skews) each page
The -ocr t uses Tesseract OCR

04-25-2023, 06:42 AM	#2
VladimirS Connoisseur Posts: 63 Karma: 302424 Join Date: Aug 2019 Location: Serbia, former Yugoslavia Device: Pocketbook InkPad 3	Whics conversion parameters did you applied? Last edited by VladimirS; 04-25-2023 at 06:45 AM.

05-17-2023, 12:50 AM	#12
willus Fuzzball, the purple cat Posts: 1,283 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	I found these settings worked pretty well: Code: k2pdfopt -p 50-52 -mode copy -c- -m .25,.42,.25,1 -t -om 0.25 -as -ocr t input.pdf -o output.pdf The -p 50-52 just tries it out on pages 50-52 as a trial since the book is very long (you can remove this once you want to try it on the whole book). The -mode copy defaults to copying the source page size The -m arguments ignore the left 0.25 inches, the top 0.42 inches, the right 0.25 inches, and the bottom 1 inch of each source page The -t trims to the text The -om adds a small blank border to the output edges (0.25 in) The -as auto-straightens (de-skews) each page The -ocr t uses Tesseract OCR

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
KOReader / k2pdfopt integration	Thelele	KOReader	7	11-22-2021 02:41 PM
k2pdfopt for yellow/brown pages	mike.foster	PDF	9	01-19-2018 05:50 PM
k2pdfopt segmentation fault	mike.foster	PDF	3	11-23-2015 10:39 PM
Touch pdf and k2pdfopt on kobo	metita	Kobo Reader	7	08-23-2015 05:14 AM
Problem with reconverting k2PDFOpt to EBUB	ittiandro	Conversion	7	08-19-2014 06:43 PM

04-26-2023, 04:19 AM	#4
VladimirS Connoisseur Posts: 63 Karma: 302424 Join Date: Aug 2019 Location: Serbia, former Yugoslavia Device: Pocketbook InkPad 3	You can allways make exact copy of original book using : k2pdfopt -mode copy filename.pdf. This will make exact copy, without any modifications, without any cropping or similar adjustments.

04-27-2023, 04:42 AM	#6
VladimirS Connoisseur Posts: 63 Karma: 302424 Join Date: Aug 2019 Location: Serbia, former Yugoslavia Device: Pocketbook InkPad 3	No problem, we are here to help each other. If you do not want to change contrast, and add OCR, you can run K2pdfopt as follows k2pdfopt -mode copy -cmax 1.0 -g 1.0 -ocr filename.pdf

04-29-2023, 05:50 AM	#8
VladimirS Connoisseur Posts: 63 Karma: 302424 Join Date: Aug 2019 Location: Serbia, former Yugoslavia Device: Pocketbook InkPad 3	OCR is better as days go by, but it has it's limits. Good OCR job depends on original photo (individual pdf page). Page needs to be without distorsions, flat, with good contrast between text and background. Scanning of books and artefacts of scanning makes OCR job more difficult. Many small imperfections, shades of background ... If I find some possible solution, I will notice you. Cheers

04-29-2023, 09:45 PM	#9
willus Fuzzball, the purple cat Posts: 1,283 Karma: 11087488 Join Date: Jun 2011 Location: California Device: iPad	Hi all--sorry to have missed this thread. It's been a busy week. Thank you VladimirS for your help. A few things: 1. To force no contrast change, actually -cmax -1 is better than -cmax 1, but it hardly matters. 2. If you want an exact replica, also turn off sharpening: -s- 3. Is this being run on Linux? Mac? What version? For the latest versions of k2pdfopt, you can actually just directly enter the command-line arguments into the "Enter option above" prompt. It should work. 4. Are you doing tesseract OCR? You might try different detection options and see if one works better than another. E.g. -ocrd p will entirely use Tesseract's own algorithm for finding text on the page, whereas -ocrd l will have k2pdfopt submit the OCR graphics line by line. If you want to PM me the PDF source you are trying to convert, I'd be happy to recommend options.