05-12-2023, 03:22 PM | #1 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
[SOLVED] Hard trimming PDFs?
Hello,
I need to hard-trim PDFs, ie. the stuff outside the mediabox should really be gone from the output file. I tried the following, but they only perform visual trimming, ie. it's displayed as expected but the data's actually still in the file: Code:
cpdf.exe -crop "0 0 400pt 600pt" input.pdf 1-50 -o output.pdf input.pdf pdfcpu.exe box add -- "media:[0 0 400 600]" input.pdf output.pdf mutool.exe trim -b mediabox -o output.pdf input.pdf Thank you. Last edited by Shohreh; 05-14-2023 at 09:49 AM. |
05-12-2023, 04:48 PM | #2 |
Grand Sorcerer
Posts: 6,224
Karma: 16536676
Join Date: Sep 2009
Location: UK
Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3
|
With the caveat that I haven't attempted to crop PDFs for more than 5 years ...
At that time I seem to remember that GhostScript had the ability (commandline only) to, using your terms, "hard trim" a PDF which had previously been "visually trimmed" with some other utility (I used to use briss for the visual trimming). Here's an old example I noted at the time: Code:
gswin64c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=hard_trim.pdf visual_trim.pdf |
Advert | |
|
05-12-2023, 05:52 PM | #3 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Calibre still displays the cropped data in the EPUB, so it's still in the file, but I'll look into GS.
Briss got stuck on that ~400 page PDF, which is partly why I tried CLI apps. Thank you. -- Edit: Code:
mutool.exe pages soft.cropped.pdf 20 soft.cropped.pdf: <page pagenum="20"> <MediaBox l="0" b="0" r="424" t="600" /> <CropBox l="0" b="0" r="424" t="600" /> <Rotate v="0" /> </page> mutool.exe pages hard_trim.pdf 20 hard_trim.pdf: <page pagenum="20"> <MediaBox l="0" b="0" r="424" t="600" /> <CropBox l="0" b="0" r="424" t="600" /> <Rotate v="0" /> </page> Last edited by Shohreh; 05-12-2023 at 06:18 PM. |
05-13-2023, 06:37 AM | #4 |
Addict
Posts: 303
Karma: 2228060
Join Date: Dec 2013
Location: LaVernia, Texas
Device: kindle epub readers on android
|
pdfjam works well for me in linux. it is command line driven in terminal. no gui.
|
05-13-2023, 09:09 AM | #5 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
How would you use it to permanently (not just hide) headers and footers?
https://github.com/rrthomas/pdfjam What about adding "redaction annotations" in each page, and then have those sections entirely removed from the PDF? https://pspdfkit.com/guides/processo...tion/overview/ Yet another possibility: What about a script in PyMuPDF that would read each page, create a new one that's cropped, and save that into a new PDF? https://pypdf2.readthedocs.io/en/3.0...nsforming.html Last edited by Shohreh; 05-13-2023 at 09:12 AM. |
Advert | |
|
05-13-2023, 09:43 AM | #6 |
the rook, bossing Never.
Posts: 12,352
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
I've used K2pdfopt, imagemagik and also the GIMP.
|
05-13-2023, 01:07 PM | #7 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
I'd rather convert the PDF to EPUB with Calibre since my e-reader doesn't handle PDFs very well. I tried k2pdfopt, and didn't like it.
Besides, if the only thing is to remove the headers and footers, it's worth investigating. |
05-13-2023, 02:11 PM | #8 |
the rook, bossing Never.
Posts: 12,352
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
PDFs vary in ability to convert. If convertible at all, Word, Writer or other tools are far better than Calibre for PDFs. Then convert a docx to epub.
Unless you OCR, all you can do with an image based PDF is crop, resize, contrast/brightness/bit-depth. I'd only use Calibre to catalogue and transfer existing PDFS as PDFs to ereaders or tablet that can manage them. |
05-13-2023, 03:46 PM | #9 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
The book I'm playing with converts just fine to EPUB.
The only thing I'd need to get a near perfect EPUB is removing headers and footers… which is the perfect occasion to dig and understand why people bother with regex in the HTML at all if you can just remove the data from the source PDF before running Calibre. There's got to be a way to either remove everything that's outside the mediabox, or mark some sections as redaction annotions and remove them all. |
05-14-2023, 09:48 AM | #10 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
PyMuPDF to the rescue…
Code:
#https://artifex.com/blog/advanced-text-manipulation-using-pymupdf import fitz doc = fitz.open("original.pdf") page = doc[18] #print(page.get_text()) rect = fitz.Rect(0,0,424,50) page.add_redact_annot(rect) page.apply_redactions() doc.save("redacted.pdf") #ebook-convert.exe redacted.pdf redacted.epub Another useful tool would be a PDF viewer that lets the user select a rectangle and display its coordinates, ready to be copy-pasted into the command line. Last edited by Shohreh; 05-14-2023 at 10:11 AM. |
05-14-2023, 02:37 PM | #11 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Parse the whole PDF, ignoring the first page of each chapter:
Code:
import fitz doc = fitz.open("original.pdf") rect = fitz.Rect(0,0,424,50) exclude = [range(1, 14), 17,24,97,155,186,232,258,297,322,343,404] for index in range(1,doc.page_count+1): if index not in exclude: page = doc[index] page.add_redact_annot(rect) page.apply_redactions() doc.save("redacted.pdf") |
05-14-2023, 05:45 PM | #12 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
And do the same for the footer on all the pages:
Code:
import fitz import sys doc = fitz.open("original.pdf") """ #Until a PDF viewer comes along… here's how to find where to draw a box around the header+footer #left,top,right,bottom rect = fitz.Rect(0,560,424,600) page = doc[13] shape = page.new_shape() shape.drawRect(rect) shape.finish(color=None) shape.commit() """ rect_header = fitz.Rect(0,0,424,50) rect_footer = fitz.Rect(0,560,424,600) exclude = [range(1, 14), 17,24,97,155,186,232,258,297,322,343,404] for index in range(1,doc.page_count): print(index) page = doc[index] #remove header only on non-chapter pages if index not in exclude: page.add_redact_annot(rect_header) #remove footer on all pages page.add_redact_annot(rect_footer) page.apply_redactions() doc.save("redacted.pdf") |
05-14-2023, 06:07 PM | #13 | |
Junior Member
Posts: 6
Karma: 55624
Join Date: May 2023
Location: France
Device: Kobo by Fnac Nia 6" 8 Go
|
Quote:
|
|
05-14-2023, 06:41 PM | #14 |
Groupie
Posts: 181
Karma: 304158
Join Date: Jan 2016
Device: none
|
Cutting out.
Looks like adding and deleting redaction annotions is the way to go. The commands above do work… but only on the screen: The data's still in the file, and thus included in the EPUB generated by Calibre. |
05-16-2023, 03:09 PM | #15 |
Junior Member
Posts: 6
Karma: 55624
Join Date: May 2023
Location: France
Device: Kobo by Fnac Nia 6" 8 Go
|
Have you tried resizing the code to make it converge with what you are explaining? I think the problem might be that you have a slightly old device. Or maybe you didn't upgrade... I'd really like to know how it's going
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Trimming covers going wrong | ownedbycats | Calibre | 5 | 07-26-2022 05:03 AM |
CBR to PDF Conversion and Trimming | stexxe | Conversion | 3 | 07-05-2011 02:51 PM |
Trimming Covers | hmf | Library Management | 5 | 03-15-2011 04:44 AM |
problems with individuating and trimming the ebooks covers | killa | Calibre | 1 | 12-11-2010 11:59 AM |
TRIMMING MY SHORT 'N CURLIES!!!!! | recluse | Lounge | 19 | 04-08-2010 01:24 PM |