PDF samples

kovidgoyal · 09-18-2009, 04:26 PM

Hi all,

I'm starting work on a new PDF conversion engine for calibre that will hopefully handle header and footer extraction and multiple column extraction as well.

I'm asking for a few sample PDF files that I can use as a test corpus. I'd appreciate it if you could just extract a few pages with different typographical features and make a new PDF file with them.

Note that this new engine will not handle mathematics/tables/vector diagrams, etc. so don't provide samples for those.

Also this is a bit of a long term project, so don't expect results too quickly.

Elfwreck · 09-18-2009, 05:09 PM

Would you prefer a single PDF with extracts from different works, or separate PDFs?

kovidgoyal · 09-18-2009, 05:11 PM

separate PDFs as the algorithm is likely to take into account overall document structure as well. I just dont want very large PDFs and also if they're copyrighted, its best to just extract a small subsection.

Elfwreck · 09-18-2009, 05:32 PM

Sample PDFs, all Creative Commons or or other ok-to-distribute files.

They range from simple mainstream novel PDFs to nightmarish magazine formatting. Some have pictures; some have links.

ANAT annual report 2008: Colored text, multi-column layout. (CC)
James Boyle's The Public Domain: Changes in margins & leading; lists (CC)
Helen Keller's essay I Learned to Speak: Non-crucial font & margin changes; should convert well. (PD text)
Lenz v Universal: standard legal document layout (PD)
Lowry Pei's For Adam: Novel with irregular page breaks & nonstandard headers; should convert well, might be worth noting how the spacing carried over. (CC)
TWC-1: Journal; columns with pictures (permission to share)
Wick's Houses of the Blooded: RPG game book; nightmarish; not expected to convert well at all. (permission to share)

Metadata's likely all over the place. Some have it; some don't.

kovidgoyal · 09-18-2009, 05:34 PM

Thanks, will come in handy.

neilmarr · 09-19-2009, 09:30 AM

Hello there, Kovid: I just sent you an email, but in case it gets lost; I head up the editorial team at a small indie publishing house that's covered all its paperback novels with PDF versions for the past eight years (we have about 120 titles at www.bewrite.net). If you think it would be of any help, please drop me a line and I'll send as many PDF ebooks as you might need to experiment with. Most are straight text, all well-formatted, some use more than one font and a few carry inside illustration. All are front covered.

Cheers and good luck. Neil

darkmonk · 09-20-2009, 04:20 PM

Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove.

Pablo · 09-20-2009, 04:48 PM

Quote:

Originally Posted by darkmonk

Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove.

What an interesting book! I wonder if it is legal to reformat and redistribute it. I doesn't include any license information.

kovidgoyal · 09-20-2009, 04:54 PM

@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.

user_none · 09-20-2009, 05:58 PM

Quote:

Originally Posted by kovidgoyal

@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.

It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.

acidzebra · 09-20-2009, 06:13 PM

Quote:

Originally Posted by kovidgoyal

And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.

I hope there will still be an "show advanced/all features" button or something like that

kovidgoyal · 09-20-2009, 06:25 PM

Quote:

Originally Posted by user_none

It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.

Yeah that's what remove content will do

aleks · 09-22-2009, 11:27 AM

Hi Kovid,

Here is something with columns for you to work on.

Kozak

mrmikel · 09-23-2009, 07:40 AM

Here is a pdf from the UK, the early history of their air force. It have tried and tried to convert this and everything I did was unsuccessful without endless manual work, so good luck!

09-18-2009, 04:26 PM	#1
kovidgoyal creator of calibre Posts: 44,564 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	PDF samples Hi all, I'm starting work on a new PDF conversion engine for calibre that will hopefully handle header and footer extraction and multiple column extraction as well. I'm asking for a few sample PDF files that I can use as a test corpus. I'd appreciate it if you could just extract a few pages with different typographical features and make a new PDF file with them. Note that this new engine will not handle mathematics/tables/vector diagrams, etc. so don't provide samples for those. Also this is a bit of a long term project, so don't expect results too quickly.

09-19-2009, 09:30 AM	#6
neilmarr neilmarr Posts: 7,215 Karma: 6000059 Join Date: Apr 2009 Location: Monaco-Menton, France Device: sony	Hello there, Kovid: I just sent you an email, but in case it gets lost; I head up the editorial team at a small indie publishing house that's covered all its paperback novels with PDF versions for the past eight years (we have about 120 titles at www.bewrite.net). If you think it would be of any help, please drop me a line and I'll send as many PDF ebooks as you might need to experiment with. Most are straight text, all well-formatted, some use more than one font and a few carry inside illustration. All are front covered. Cheers and good luck. Neil Last edited by neilmarr; 09-19-2009 at 09:31 AM. Reason: to check my email address was included in sig line

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Any way to differentiate samples?	carld	Calibre	1	07-26-2010 11:51 PM
Hacks Looking for samples of the new fonts	Granvillen	Amazon Kindle	3	06-24-2010 06:10 PM
Classic Can't download samples	Sakura	Barnes & Noble NOOK	2	04-28-2010 01:12 AM
Classic Samples from BN not downloading	robslp	Barnes & Noble NOOK	1	04-21-2010 06:58 PM
Samples now available without Whispernet	AnemicOak	Amazon Kindle	0	10-09-2009 05:42 PM

09-18-2009, 05:09 PM	#2
Elfwreck Grand Sorcerer Posts: 5,185 Karma: 25133758 Join Date: Nov 2008 Location: SF Bay Area, California, USA Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)	Would you prefer a single PDF with extracts from different works, or separate PDFs?

09-18-2009, 05:11 PM	#3
kovidgoyal creator of calibre Posts: 44,564 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	separate PDFs as the algorithm is likely to take into account overall document structure as well. I just dont want very large PDFs and also if they're copyrighted, its best to just extract a small subsection.

09-18-2009, 05:34 PM	#5
kovidgoyal creator of calibre Posts: 44,564 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Thanks, will come in handy.

09-20-2009, 04:54 PM	#9
kovidgoyal creator of calibre Posts: 44,564 Karma: 24495948 Join Date: Oct 2006 Location: Mumbai, India Device: Various	@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.

09-22-2009, 11:27 AM	#13
aleks Connoisseur Posts: 89 Karma: 205 Join Date: Jul 2006 Location: Upstate NY Device: Rocket eBook & Sony Reader	Hi Kovid, Here is something with columns for you to work on. Kozak

Advert

Advert