06-28-2008, 08:51 AM | #1 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2008
Device: iRex iLiad
|
Yet another PDF cropping tool
Hi,
This is my first post and I thought I'd share a script I use for cropping PDFs, so that they'll display better on my iLiad. I'm pretty sure a lot people on here have done this or something similar, but I created the script a month or two ago after finding that some journal papers didn't want to be cropped using the normal pdfcrop tool. The script is based heavily on the cropping section of the example on the pyPdf homepage. To run the script, you will need python and pyPdf (available for most linux distros)... With that installed, just copy the below code into a file and make executable. Code:
#! /usr/bin/python import getopt, sys from pyPdf import PdfFileWriter, PdfFileReader def usage (): print """sjvr767\'s PDF Cropping Script. Example: my_pdf_crop.py -s -p 0.5 -i input.pdf -o output.pdf my_pdf_crop.py --skip --percent 0.5 -input input.pdf -output output.pdf \n REQUIRED OPTIONS: -p\t--percent The factor by which to crop. Must be positive and less than or equal to 1. -i\t--input The path to the file to be cropped. \n OPTIONAL: -s\t--skip Skip the first page. Ouptut file will not contain the first page of the input file. -o\t--output Specify the name and path of the output file. If none specified, the script appends \'cropped\' to the file name. """ sys.exit(0) def cut_length(dictionary, key, factor): cut_factor = 1-factor cut = dictionary[key]*cut_factor cut = cut / 4 return cut def new_coords(dictionary, key, cut): return abs(dictionary[key]-cut) try: opts, args = getopt.getopt(sys.argv[1:], "sp:i:o:s", ["skip", "percent=", "input=", "output="]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" usage() sys.exit(2) skipone = 0 for a in opts[:]: if a[0] == '-s' or a[0]=='--skip': skipone = 1 factor = 0.8 #default scaling factor for a in opts[:]: if a[0] == '-p' or a[0]=='--factor': if a[1] != None: try: factor = float(a[1]) except TypeError: print "Factor must be a number." sys.exit(2) #exit if no appropriate input file input_file = None #no defualt input file for a in opts[:]: if a[0] == '-i' or a[0]=='--input': if a[1] != None: try: if a[1][-4:]=='.pdf': input_file = a[1] else: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except TypeError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except IndexError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file else: print "Please speicfy an input file." sys.exit(2) #exit if no appropriate input file output_file = "%s_cropped.pdf" %input_file[:-4] #default output for a in opts[:]: if a[0] == '-o' or a[0]=='--output': if a[1]!= None: try: if a[1][-4:]=='.pdf': output_file = a[1] else: print "Output file must be a PDF." except TypeError: print "Output file must be a PDF." except IndexError: print "Output file must be a PDF." input1 = PdfFileReader(file(input_file, "rb")) output = PdfFileWriter() outputstream = file(output_file, "wb") pages = input1.getNumPages() top_right = {'x': input1.getPage(1).mediaBox.getUpperRight_x(), 'y': input1.getPage(1).mediaBox.getUpperRight_y()} top_left = {'x': input1.getPage(1).mediaBox.getUpperLeft_x(), 'y': input1.getPage(1).mediaBox.getUpperLeft_y()} bottom_right = {'x': input1.getPage(1).mediaBox.getLowerRight_x(), 'y': input1.getPage(1).mediaBox.getLowerRight_y()} bottom_left = {'x': input1.getPage(1).mediaBox.getLowerLeft_x(), 'y': input1.getPage(1).mediaBox.getLowerLeft_y()} cut = cut_length(top_right, 'x', factor) new_tr = (new_coords(top_right, 'x', cut), new_coords(top_right, 'y', cut)) new_br = (new_coords(bottom_right, 'x', cut), new_coords(bottom_right, 'y', cut)) new_tl = (new_coords(top_left, 'x', cut), new_coords(top_left, 'y', cut)) new_bl = (new_coords(bottom_left, 'x', cut), new_coords(bottom_left, 'y', cut)) if skipone == 0: for i in range(0, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) else: for i in range(1, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) output.write(outputstream) outputstream.close() Code:
./my_pdfcrop.py -p 0.8 -i input.pdf |
08-16-2008, 12:08 PM | #2 |
Zealot
Posts: 119
Karma: 603
Join Date: May 2008
Location: Oslo, Norway
Device: irex iliad
|
This is interesting. Could you add the code as a file attachment?
|
Advert | |
|
08-24-2008, 07:39 AM | #3 |
Zealot
Posts: 119
Karma: 603
Join Date: May 2008
Location: Oslo, Norway
Device: irex iliad
|
I have now tried to crop a pdf, but it doesn't crop the left side of the document. Furthermore, it takes some time guessing the correct percentage.
|
08-29-2008, 05:21 AM | #4 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2008
Device: iRex iLiad
|
Hi,
Sorry that I haven't put up a file yet, I have few other things that need to be addressed immediately (i.e. my dissertation). Yes, the percentage is tricky (since it isn't exactly a true percentage). Personally, I only use this script on files which do not crop using Heiko Oberdiek's "pdfcrop" (found or available on most Linux systems via the command "pdfcrop"). These tend to be papers from JSTOR, hence the skip first page option. Thank you for trying it though, I will try my best to address the "left-crop" issue as soon as I have time. |
09-23-2008, 12:07 PM | #5 | |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2008
Device: iRex iLiad
|
Quote:
Before I give the code, I'd like to say that when I get time I will do a proper update of this. There are a few features I want to implement, such as splitting pages in half and then scaling those to A4. That should enlarge the doc quite a bit.. Here is the code: Code:
#! /usr/bin/python import subprocess import getopt, sys import find_lines from pyPdf import PdfFileWriter, PdfFileReader def usage (): print """sjvr767\'s PDF Cropping Script. Example: my_pdf_crop.py -s -p 0.5 -i input.pdf -o output.pdf my_pdf_crop.py --skip --percent 0.5 -input input.pdf -output output.pdf \n REQUIRED OPTIONS: -p\t--percent The factor by which to crop. Must be positive and less than or equal to 1. -i\t--input The path to the file to be cropped. \n OPTIONAL: -s\t--skip Skip the first page. Ouptut file will not contain the first page of the input file. -o\t--output Specify the name and path of the output file. If none specified, the script appends \'cropped\' to the file name. """ sys.exit(0) def cut_length(dictionary, key, factor): cut_factor = 1-factor cut = dictionary[key]*cut_factor cut = cut / 4 return cut def new_coords(dictionary, key, cut): return abs(dictionary[key]-cut) def new_coords2(ty, lx, rx, by, cut): new_ty = ty - cut new_by = by + cut new_lx = lx + cut new_rx = rx - cut top_left = {'x': new_lx, 'y': new_ty} bottom_left = {'x': new_lx, 'y': new_by} bottom_right = {'x': new_rx, 'y': new_by} top_right = {'x': new_rx, 'y': new_ty} return {'tr': top_right, 'tl': top_left, 'bl': bottom_left, 'br': bottom_right} try: opts, args = getopt.getopt(sys.argv[1:], "sp:i:o:sch", ["skip", "percent=", "input=", "output=", "column", "half"]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" usage() sys.exit(2) skipone = 0 for a in opts[:]: if a[0] == '-s' or a[0]=='--skip': skipone = 1 factor = 0.8 #default scaling factor for a in opts[:]: if a[0] == '-p' or a[0]=='--factor': if a[1] != None: try: factor = float(a[1]) except TypeError: print "Factor must be a number." sys.exit(2) #exit if no appropriate input file input_file = None #no defualt input file for a in opts[:]: if a[0] == '-i' or a[0]=='--input': if a[1] != None: try: if a[1][-4:]=='.pdf': input_file = a[1] else: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except TypeError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except IndexError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file else: print "Please speicfy an input file." sys.exit(2) #exit if no appropriate input file output_file = "%s_cropped.pdf" %input_file[:-4] #default output for a in opts[:]: if a[0] == '-o' or a[0]== '--output': if a[1]!= None: try: if a[1][-4:]=='.pdf': output_file = a[1] else: print "Output file must be a PDF." except TypeError: print "Output file must be a PDF." except IndexError: print "Output file must be a PDF." col = 0 for a in opts[:]: if a[0] == '-c' or a[0]=='--column': col = 1 half = 0 for a in opts[:]: if a[0] == '-h' or a[0]=='--half': half = 1 input1 = PdfFileReader(file(input_file, "rb")) output = PdfFileWriter() outputstream = file(output_file, "wb") pages = input1.getNumPages() top_right = {'x': input1.getPage(1).mediaBox.getUpperRight_x(), 'y': input1.getPage(1).mediaBox.getUpperRight_y()} ty = input1.getPage(1).mediaBox.getUpperLeft_y() lx = input1.getPage(1).mediaBox.getUpperLeft_x() rx = input1.getPage(1).mediaBox.getLowerRight_x() by = input1.getPage(1).mediaBox.getLowerRight_y() print ty, lx, rx, by cut = cut_length(top_right, 'x', factor) newCoords = new_coords2(ty, lx, rx, by, cut) new_tr = (newCoords['tr']['x'], newCoords['tr']['y']) new_tl = (newCoords['tl']['x'], newCoords['tl']['y']) new_br = (newCoords['br']['x'], newCoords['br']['y']) new_bl = (newCoords['bl']['x'], newCoords['bl']['y']) print new_tl[1], new_tl[0], new_bl[1], new_bl[0] if skipone == 0 and col == 0 and half == 0: for i in range(0, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) elif skipone == 0 and col == 0 and half == 1: for i in range(0, pages-2): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br temp_output = PdfFileWriter() temp_output.addPage(page) tos = file("temp.pdf", "wb") temp_output.write(tos) tos.close() cmd = 'convert temp.pdf -density 8400 -colorspace Gray -contrast -contrast -contrast -colors 16 temp.gif' subprocess.call(cmd, shell=True) height = find_lines.find_hline('temp.gif', 5, 80) page1 = input1.getPage(i) page1.mediaBox.upperLeft = new_tl page1.mediaBox.upperRight = new_tr page1.mediaBox.lowerLeft = (new_tl[0], new_tl[1]-height) page1.mediaBox.lowerRight = (new_tr[0], new_tr[1]-height) output.addPage(page1) page2 = input1.getPage(i) page2.mediaBox.upperLeft = (new_tl[0], new_tl[1]-height) page2.mediaBox.upperRight = (new_tr[0], new_tr[1]-height) page2.mediaBox.lowerLeft = new_bl page2.mediaBox.lowerRight = new_br output.addPage(page2) elif skipone == 1 and col == 0 and half == 0: for i in range(1, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) output.write(outputstream) outputstream.close() |
|
Advert | |
|
09-23-2008, 12:21 PM | #6 |
Addict
Posts: 350
Karma: 705
Join Date: Dec 2006
Location: Mumbai, India
Device: Kindle 1/REB 1200
|
sjvr767: I was going to work on this very idea, but you beat me to it
First, there are two very good projects which already implement this: pdfcrop and pdfcrop.pl (the latter has a very good fork at pdfcrop2). All of them have the same disadvantage: they detect the bounding box using ghostscript (which is very good and accurate) but then they don't update the PDF in-place: they re-create the PDF using pdftex or other software. I'd already done a proof-of-concept that it worked using pyPdf [I've contributed to it in the past] but other projects (notably ebookutils) took my time Would you be interested in taking it further using gs? The command line to generate a bbox is Code:
gs -dBATCH -dSAFER -dNOPAUSE -dUseCropBox -sDEVICE=bbox <input.pdf> EDIT: just saw that you posted this much earlier. My apologies |
09-28-2008, 12:29 PM | #7 | |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2008
Device: iRex iLiad
|
Quote:
The main thing I want to do is create a programme that will take a page and cut it into two pages. Those "half-pages" can then be rescaled using a PDF printer (probably as landscape)... Think how well that will display on the Iliad? Half an A4 page is about the same size as the Iliads screen, so it should work nicely. The trick, however, is to cut the page in such a way that you do not cut through a sentence. I tried converting PDF pages to images and then using Python Image Library to analyze the color composition of areas at and near the middle. If the area was mostly white, then it was fine to cut there... It almost worked, but the results were quite inconsistent. Some pages were cut cleanly near the middle, others were cut either a third of the way down etc. The idea that I have now is to export each page of the PDF as a SVG file. Since SVG is an XML-based format, one can then simply copy elements with y coordinates above or below a certain value to separate SVG files. Then print those files as PDFs, and merge all of them back into one PDF. Unfortunately I haven't had the time to really sit and code the above. Never worked with with parsing XML in Python, so I have to first learn how to do that... Any suggestions would be welcome. BTW, I'm not a programmer... I only code for fun. |
|
02-14-2009, 08:04 AM | #8 |
Junior Member
Posts: 5
Karma: 10
Join Date: Jun 2008
Device: iRex iLiad
|
small update
Hi there,
Sorry this has taken so long. I have a lot going on in my life right now, but I managed to do a bit of code clean-up (not much) and I added the ability to specify manual cropping in addition to the proportional cropping. Therefore, you can now tweak the cropping slightly. For example, I have paper called "systemic_risk.pdf", Code:
./my_crop.py -s -p 0.7 -i systemic_risk.pdf -o systemic_risk2.pdf Code:
./my_crop.py -s -p 0.7 -i systemic_risk.pdf -o systemic_risk2.pdf -m "15 50 0 50" BTW, the script now outputs the dimension of the first page... You can use that in order to give yourself an idea as to how much to crop manually. Also, you can do pure manual cropping by specifying -p 1. If you have pyPDF, just cut and paste the following code into a file called "my_crop.py" and make it executable: Code:
#! /usr/bin/python import getopt, sys from pyPdf import PdfFileWriter, PdfFileReader def usage (): print """sjvr767\'s PDF Cropping Script. Example: my_pdf_crop.py -s -p 0.5 -i input.pdf -o output.pdf my_pdf_crop.py --skip --percent 0.5 -input input.pdf -output output.pdf \n REQUIRED OPTIONS: -p\t--percent The factor by which to crop. Must be positive and less than or equal to 1. -i\t--input The path to the file to be cropped. \n OPTIONAL: -s\t--skip Skip the first page. Ouptut file will not contain the first page of the input file. -o\t--output Specify the name and path of the output file. If none specified, the script appends \'cropped\' to the file name. -m\t--margin Specify additional absolute cropping, for fine tuning results. \t-m "left top right bottom" """ sys.exit(0) def cut_length(dictionary, key, factor): cut_factor = 1-factor cut = float(dictionary[key])*cut_factor cut = cut / 4 return cut def new_coords(dictionary, key, cut, margin, code = "tl"): if code == "tl": if key == "x": return abs(float(dictionary[key])+(cut+margin["l"])) else: return abs(float(dictionary[key])-(cut+margin["t"])) elif code == "tr": if key == "x": return abs(float(dictionary[key])-(cut+margin["r"])) else: return abs(float(dictionary[key])-(cut+margin["t"])) elif code == "bl": if key == "x": return abs(float(dictionary[key])+(cut+margin["l"])) else: return abs(float(dictionary[key])+(cut+margin["b"])) else: if key == "x": return abs(float(dictionary[key])-(cut+margin["r"])) else: return abs(float(dictionary[key])+(cut+margin["b"])) try: opts, args = getopt.getopt(sys.argv[1:], "sp:i:o:m:", ["skip", "percent=", "input=", "output=", "margin="]) except getopt.GetoptError, err: # print help information and exit: print str(err) # will print something like "option -a not recognized" usage() sys.exit(2) skipone = 0 for a in opts[:]: if a[0] == '-s' or a[0]=='--skip': skipone = 1 factor = 0.8 #default scaling factor for a in opts[:]: if a[0] == '-p' or a[0]=='--factor': if a[1] != None: try: factor = float(a[1]) except TypeError: print "Factor must be a number." sys.exit(2) #exit if no appropriate input file input_file = None #no defualt input file for a in opts[:]: if a[0] == '-i' or a[0]=='--input': if a[1] != None: try: if a[1][-4:]=='.pdf': input_file = a[1] else: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except TypeError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file except IndexError: print "Input file must be a PDF." sys.exit(2) #exit if no appropriate input file else: print "Please speicfy an input file." sys.exit(2) #exit if no appropriate input file output_file = "%s_cropped.pdf" %input_file[:-4] #default output for a in opts[:]: if a[0] == '-o' or a[0]=='--output': if a[1]!= None: try: if a[1][-4:]=='.pdf': output_file = a[1] else: print "Output file must be a PDF." except TypeError: print "Output file must be a PDF." except IndexError: print "Output file must be a PDF." margin = {"l": 0, "t": 0, "r": 0, "b": 0} for a in opts[:]: if a[0] == '-m' or a[0]=='--margin': if a[1]!= None: m_temp = a[1].strip("\"").split() margin["l"] = float(m_temp[0]) margin["t"] = float(m_temp[1]) margin["r"] = float(m_temp[2]) margin["b"] = float(m_temp[3]) else: print "Error" input1 = PdfFileReader(file(input_file, "rb")) output = PdfFileWriter() outputstream = file(output_file, "wb") pages = input1.getNumPages() top_right = {'x': input1.getPage(1).mediaBox.getUpperRight_x(), 'y': input1.getPage(1).mediaBox.getUpperRight_y()} top_left = {'x': input1.getPage(1).mediaBox.getUpperLeft_x(), 'y': input1.getPage(1).mediaBox.getUpperLeft_y()} bottom_right = {'x': input1.getPage(1).mediaBox.getLowerRight_x(), 'y': input1.getPage(1).mediaBox.getLowerRight_y()} bottom_left = {'x': input1.getPage(1).mediaBox.getLowerLeft_x(), 'y': input1.getPage(1).mediaBox.getLowerLeft_y()} print('Page dim.\t%f by %f' %(top_right['x'], top_right['y'])) cut = cut_length(top_right, 'x', factor) new_tr = (new_coords(top_right, 'x', cut, margin, code = "tr"), new_coords(top_right, 'y', cut, margin, code = "tr")) new_br = (new_coords(bottom_right, 'x', cut, margin, code = "br"), new_coords(bottom_right, 'y', cut, margin, code = "br" )) new_tl = (new_coords(top_left, 'x', cut, margin, code = "tl"), new_coords(top_left, 'y', cut, margin, code = "tl")) new_bl = (new_coords(bottom_left, 'x', cut, margin, code = "bl"), new_coords(bottom_left, 'y', cut, margin, code = "bl")) if skipone == 0: for i in range(0, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) else: for i in range(1, pages): page = input1.getPage(i) page.mediaBox.upperLeft = new_tl page.mediaBox.upperRight = new_tr page.mediaBox.lowerLeft = new_bl page.mediaBox.lowerRight = new_br output.addPage(page) output.write(outputstream) outputstream.close() |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
PDF cropping software: BRISS | laborg | 331 | 08-18-2023 09:30 AM | |
PDF to EPUP conversion after page cropping | Naismith | Calibre | 6 | 03-09-2010 09:37 AM |
cropping pdf with preview | wang960 | Sony Reader | 2 | 05-05-2009 10:28 AM |
yet another cropping tool | moggie | 4 | 01-16-2009 05:42 AM | |
Nice Mac OS X .pdf Cropping Tool | jmdor | Sony Reader | 0 | 04-04-2007 11:41 PM |