Hard trimming PDFs?

Shohreh · 05-12-2023, 02:22 PM

Hello,

I need to hard-trim PDFs, ie. the stuff outside the mediabox should really be gone from the output file.

I tried the following, but they only perform visual trimming, ie. it's displayed as expected but the data's actually still in the file:

Code:

cpdf.exe -crop "0 0 400pt 600pt" input.pdf 1-50 -o output.pdf input.pdf

pdfcpu.exe box add -- "media:[0 0 400 600]" input.pdf output.pdf

mutool.exe trim -b mediabox -o output.pdf input.pdf

Is there a tool, preferably open-source, that supports hard-trimming?

Thank you.

jackie_w · 05-12-2023, 03:48 PM

With the caveat that I haven't attempted to crop PDFs for more than 5 years ...

At that time I seem to remember that GhostScript had the ability (commandline only) to, using your terms, "hard trim" a PDF which had previously been "visually trimmed" with some other utility (I used to use briss for the visual trimming).

Here's an old example I noted at the time:

Code:

gswin64c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=hard_trim.pdf visual_trim.pdf

It may be too old to be useful, but I offer it as an option to look into if you wish.

Shohreh · 05-12-2023, 04:52 PM

Calibre still displays the cropped data in the EPUB, so it's still in the file, but I'll look into GS.

Briss got stuck on that ~400 page PDF, which is partly why I tried CLI apps.

Thank you.

--
Edit:

Code:

mutool.exe pages soft.cropped.pdf 20
soft.cropped.pdf:
<page pagenum="20">
<MediaBox l="0" b="0" r="424" t="600" />
<CropBox l="0" b="0" r="424" t="600" />
<Rotate v="0" />
</page>

mutool.exe pages hard_trim.pdf 20
hard_trim.pdf:
<page pagenum="20">
<MediaBox l="0" b="0" r="424" t="600" />
<CropBox l="0" b="0" r="424" t="600" />
<Rotate v="0" />
</page>

rjwse@aol.com · 05-13-2023, 05:37 AM

pdfjam works well for me in linux. it is command line driven in terminal. no gui.

Shohreh · 05-13-2023, 08:09 AM

How would you use it to permanently (not just hide) headers and footers?

https://github.com/rrthomas/pdfjam

What about adding "redaction annotations" in each page, and then have those sections entirely removed from the PDF?

https://pspdfkit.com/guides/processo...tion/overview/

Yet another possibility: What about a script in PyMuPDF that would read each page, create a new one that's cropped, and save that into a new PDF?

https://pypdf2.readthedocs.io/en/3.0...nsforming.html

Quoth · 05-13-2023, 08:43 AM

I've used K2pdfopt, imagemagik and also the GIMP.

Shohreh · 05-13-2023, 12:07 PM

I'd rather convert the PDF to EPUB with Calibre since my e-reader doesn't handle PDFs very well. I tried k2pdfopt, and didn't like it.

Besides, if the only thing is to remove the headers and footers, it's worth investigating.

Quoth · 05-13-2023, 01:11 PM

PDFs vary in ability to convert. If convertible at all, Word, Writer or other tools are far better than Calibre for PDFs. Then convert a docx to epub.
Unless you OCR, all you can do with an image based PDF is crop, resize, contrast/brightness/bit-depth.

I'd only use Calibre to catalogue and transfer existing PDFS as PDFs to ereaders or tablet that can manage them.

Shohreh · 05-13-2023, 02:46 PM

The book I'm playing with converts just fine to EPUB.

The only thing I'd need to get a near perfect EPUB is removing headers and footers… which is the perfect occasion to dig and understand why people bother with regex in the HTML at all if you can just remove the data from the source PDF before running Calibre.

There's got to be a way to either remove everything that's outside the mediabox, or mark some sections as redaction annotions and remove them all.

Shohreh · 05-14-2023, 08:48 AM

PyMuPDF to the rescue…

Code:

#https://artifex.com/blog/advanced-text-manipulation-using-pymupdf
import fitz

doc = fitz.open("original.pdf")
page = doc[18]

#print(page.get_text())
rect = fitz.Rect(0,0,424,50)
page.add_redact_annot(rect)
page.apply_redactions()

doc.save("redacted.pdf")

#ebook-convert.exe redacted.pdf redacted.epub

Hard to believe no ready-to-use command-line tool can add and delete redaction annotations.

Another useful tool would be a PDF viewer that lets the user select a rectangle and display its coordinates, ready to be copy-pasted into the command line.

Shohreh · 05-14-2023, 01:37 PM

Parse the whole PDF, ignoring the first page of each chapter:

Code:

import fitz

doc = fitz.open("original.pdf")
rect = fitz.Rect(0,0,424,50)
exclude = [range(1, 14), 17,24,97,155,186,232,258,297,322,343,404]
for index in range(1,doc.page_count+1):
	if index not in exclude:
		page = doc[index]
		page.add_redact_annot(rect)
		page.apply_redactions()
doc.save("redacted.pdf")

Shohreh · 05-14-2023, 04:45 PM

And do the same for the footer on all the pages:

Code:

import fitz
import sys

doc = fitz.open("original.pdf")

"""
#Until a PDF viewer comes along… here's how to find where to draw a box around the header+footer
#left,top,right,bottom
rect = fitz.Rect(0,560,424,600)
page = doc[13]
shape = page.new_shape()
shape.drawRect(rect)
shape.finish(color=None)
shape.commit()
"""

rect_header = fitz.Rect(0,0,424,50)
rect_footer = fitz.Rect(0,560,424,600)
exclude = [range(1, 14), 17,24,97,155,186,232,258,297,322,343,404]
for index in range(1,doc.page_count):
	print(index)
	page = doc[index]

	#remove header only on non-chapter pages
	if index not in exclude:
		page.add_redact_annot(rect_header)

	#remove footer on all pages
	page.add_redact_annot(rect_footer)
	page.apply_redactions()
	
doc.save("redacted.pdf")

Kromaa · 05-14-2023, 05:07 PM

Quote:

Originally Posted by Shohreh

Hello,

I need to hard-trim PDFs, ie. the stuff outside the mediabox should really be gone from the output file.

I tried the following, but they only perform visual trimming, ie. it's displayed as expected but the data's actually still in the file:

Code:

cpdf.exe -crop "0 0 400pt 600pt" input.pdf 1-50 -o output.pdf input.pdf

pdfcpu.exe box add -- "media:[0 0 400 600]" input.pdf output.pdf

mutool.exe trim -b mediabox -o output.pdf input.pdf

Is there a tool, preferably open-source, that supports hard-trimming?

Thank you.

I'm not sure I understand. Are you looking to cut out pieces of text from the pdf, or just select a new dimension?

Shohreh · 05-14-2023, 05:41 PM

Cutting out.

Looks like adding and deleting redaction annotions is the way to go.

The commands above do work… but only on the screen: The data's still in the file, and thus included in the EPUB generated by Calibre.

Kromaa · 05-16-2023, 02:09 PM

Quote:

Originally Posted by Shohreh

Cutting out.

Looks like adding and deleting redaction annotions is the way to go.

The commands above do work… but only on the screen: The data's still in the file, and thus included in the EPUB generated by Calibre.

Have you tried resizing the code to make it converge with what you are explaining? I think the problem might be that you have a slightly old device. Or maybe you didn't upgrade... I'd really like to know how it's going

05-12-2023, 02:22 PM	#1
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	[SOLVED] Hard trimming PDFs? Hello, I need to hard-trim PDFs, ie. the stuff outside the mediabox should really be gone from the output file. I tried the following, but they only perform visual trimming, ie. it's displayed as expected but the data's actually still in the file: Code: cpdf.exe -crop "0 0 400pt 600pt" input.pdf 1-50 -o output.pdf input.pdf pdfcpu.exe box add -- "media:[0 0 400 600]" input.pdf output.pdf mutool.exe trim -b mediabox -o output.pdf input.pdf Is there a tool, preferably open-source, that supports hard-trimming? Thank you. Last edited by Shohreh; 05-14-2023 at 08:49 AM.

05-12-2023, 03:48 PM	#2
jackie_w Grand Sorcerer Posts: 6,216 Karma: 16534894 Join Date: Sep 2009 Location: UK Device: Kobo: KA1, ClaraHD, Forma, Libra2, Clara2E. PocketBook: TouchHD3	With the caveat that I haven't attempted to crop PDFs for more than 5 years ... At that time I seem to remember that GhostScript had the ability (commandline only) to, using your terms, "hard trim" a PDF which had previously been "visually trimmed" with some other utility (I used to use briss for the visual trimming). Here's an old example I noted at the time: Code: gswin64c.exe -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -sOutputFile=hard_trim.pdf visual_trim.pdf It may be too old to be useful, but I offer it as an option to look into if you wish.

05-12-2023, 04:52 PM	#3
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	Calibre still displays the cropped data in the EPUB, so it's still in the file, but I'll look into GS. Briss got stuck on that ~400 page PDF, which is partly why I tried CLI apps. Thank you. -- Edit: Code: mutool.exe pages soft.cropped.pdf 20 soft.cropped.pdf: <page pagenum="20"> <MediaBox l="0" b="0" r="424" t="600" /> <CropBox l="0" b="0" r="424" t="600" /> <Rotate v="0" /> </page> mutool.exe pages hard_trim.pdf 20 hard_trim.pdf: <page pagenum="20"> <MediaBox l="0" b="0" r="424" t="600" /> <CropBox l="0" b="0" r="424" t="600" /> <Rotate v="0" /> </page> Last edited by Shohreh; 05-12-2023 at 05:18 PM.

05-13-2023, 08:09 AM	#5
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	How would you use it to permanently (not just hide) headers and footers? https://github.com/rrthomas/pdfjam What about adding "redaction annotations" in each page, and then have those sections entirely removed from the PDF? https://pspdfkit.com/guides/processo...tion/overview/ Yet another possibility: What about a script in PyMuPDF that would read each page, create a new one that's cropped, and save that into a new PDF? https://pypdf2.readthedocs.io/en/3.0...nsforming.html Last edited by Shohreh; 05-13-2023 at 08:12 AM.

05-14-2023, 08:48 AM	#10
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	PyMuPDF to the rescue… Code: #https://artifex.com/blog/advanced-text-manipulation-using-pymupdf import fitz doc = fitz.open("original.pdf") page = doc[18] #print(page.get_text()) rect = fitz.Rect(0,0,424,50) page.add_redact_annot(rect) page.apply_redactions() doc.save("redacted.pdf") #ebook-convert.exe redacted.pdf redacted.epub Hard to believe no ready-to-use command-line tool can add and delete redaction annotations. Another useful tool would be a PDF viewer that lets the user select a rectangle and display its coordinates, ready to be copy-pasted into the command line. Attached Thumbnails Last edited by Shohreh; 05-14-2023 at 09:11 AM.

05-13-2023, 05:37 AM	#4
rjwse@aol.com Addict Posts: 303 Karma: 2228060 Join Date: Dec 2013 Location: LaVernia, Texas Device: kindle epub readers on android	pdfjam works well for me in linux. it is command line driven in terminal. no gui.

05-13-2023, 08:43 AM	#6
Quoth the rook, bossing Never. Posts: 12,324 Karma: 90943357 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	I've used K2pdfopt, imagemagik and also the GIMP.

05-13-2023, 12:07 PM	#7
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	I'd rather convert the PDF to EPUB with Calibre since my e-reader doesn't handle PDFs very well. I tried k2pdfopt, and didn't like it. Besides, if the only thing is to remove the headers and footers, it's worth investigating.

05-13-2023, 01:11 PM	#8
Quoth the rook, bossing Never. Posts: 12,324 Karma: 90943357 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11	PDFs vary in ability to convert. If convertible at all, Word, Writer or other tools are far better than Calibre for PDFs. Then convert a docx to epub. Unless you OCR, all you can do with an image based PDF is crop, resize, contrast/brightness/bit-depth. I'd only use Calibre to catalogue and transfer existing PDFS as PDFs to ereaders or tablet that can manage them.

05-13-2023, 02:46 PM	#9
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	The book I'm playing with converts just fine to EPUB. The only thing I'd need to get a near perfect EPUB is removing headers and footers… which is the perfect occasion to dig and understand why people bother with regex in the HTML at all if you can just remove the data from the source PDF before running Calibre. There's got to be a way to either remove everything that's outside the mediabox, or mark some sections as redaction annotions and remove them all.

05-14-2023, 05:41 PM	#14
Shohreh Groupie Posts: 181 Karma: 304158 Join Date: Jan 2016 Device: none	Cutting out. Looks like adding and deleting redaction annotions is the way to go. The commands above do work… but only on the screen: The data's still in the file, and thus included in the EPUB generated by Calibre.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Trimming covers going wrong	ownedbycats	Calibre	5	07-26-2022 04:03 AM
CBR to PDF Conversion and Trimming	stexxe	Conversion	3	07-05-2011 01:51 PM
Trimming Covers	hmf	Library Management	5	03-15-2011 03:44 AM
problems with individuating and trimming the ebooks covers	killa	Calibre	1	12-11-2010 10:59 AM
TRIMMING MY SHORT 'N CURLIES!!!!!	recluse	Lounge	19	04-08-2010 12:24 PM

Advert

Advert