09-18-2009, 04:26 PM | #1 |
creator of calibre
Posts: 44,564
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
PDF samples
Hi all,
I'm starting work on a new PDF conversion engine for calibre that will hopefully handle header and footer extraction and multiple column extraction as well. I'm asking for a few sample PDF files that I can use as a test corpus. I'd appreciate it if you could just extract a few pages with different typographical features and make a new PDF file with them. Note that this new engine will not handle mathematics/tables/vector diagrams, etc. so don't provide samples for those. Also this is a bit of a long term project, so don't expect results too quickly. |
09-18-2009, 05:09 PM | #2 |
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Would you prefer a single PDF with extracts from different works, or separate PDFs?
|
Advert | |
|
09-18-2009, 05:11 PM | #3 |
creator of calibre
Posts: 44,564
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
separate PDFs as the algorithm is likely to take into account overall document structure as well. I just dont want very large PDFs and also if they're copyrighted, its best to just extract a small subsection.
|
09-18-2009, 05:32 PM | #4 |
Grand Sorcerer
Posts: 5,185
Karma: 25133758
Join Date: Nov 2008
Location: SF Bay Area, California, USA
Device: Pocketbook Touch HD3 (Past: Kobo Mini, PEZ, PRS-505, Clié)
|
Sample PDFs, all Creative Commons or or other ok-to-distribute files.
They range from simple mainstream novel PDFs to nightmarish magazine formatting. Some have pictures; some have links.
Metadata's likely all over the place. Some have it; some don't. Last edited by Elfwreck; 09-18-2009 at 05:37 PM. |
09-18-2009, 05:34 PM | #5 |
creator of calibre
Posts: 44,564
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Thanks, will come in handy.
|
Advert | |
|
09-19-2009, 09:30 AM | #6 |
neilmarr
Posts: 7,215
Karma: 6000059
Join Date: Apr 2009
Location: Monaco-Menton, France
Device: sony
|
Hello there, Kovid: I just sent you an email, but in case it gets lost; I head up the editorial team at a small indie publishing house that's covered all its paperback novels with PDF versions for the past eight years (we have about 120 titles at www.bewrite.net). If you think it would be of any help, please drop me a line and I'll send as many PDF ebooks as you might need to experiment with. Most are straight text, all well-formatted, some use more than one font and a few carry inside illustration. All are front covered.
Cheers and good luck. Neil Last edited by neilmarr; 09-19-2009 at 09:31 AM. Reason: to check my email address was included in sig line |
09-20-2009, 04:20 PM | #7 |
Connoisseur
Posts: 58
Karma: 12
Join Date: Jan 2009
Device: none
|
Here my contribution, a book I've tried several times to convert using xpath to remove headers/footers. Just left a few sections with different things. If you want the full book, it's here. It's legal to distribute, too, I think.
I'm really happy you're finally trying to do this, but I'd still like it if you kept the option to specify what header/footer to remove. |
09-20-2009, 04:48 PM | #8 | |
Guru
Posts: 971
Karma: 4999999
Join Date: Mar 2009
Location: Rosario, Argentina
Device: SONY PRS-T2, Kindle Paperwhite 11th gen
|
Quote:
|
|
09-20-2009, 04:54 PM | #9 |
creator of calibre
Posts: 44,564
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
@darkmonk: Well the new engine wont use text content, but rather position on page to detect headers and footers. And I will probably remove the current remove header/footer option and replace it with a generic "remove content" option.
|
09-20-2009, 05:58 PM | #10 |
Sigil & calibre developer
Posts: 2,487
Karma: 1063785
Join Date: Jan 2009
Location: Florida, USA
Device: Nook STR
|
It might be a good idea to keep it. Just re-define it as remove content since it can actually match any text in the document. This could be helpful for poor PDF conversion for instance where the headers and footers are interspersed in the middle of the text in say Mobi or Epub files.
|
09-20-2009, 06:13 PM | #11 |
Liseuse Lover
Posts: 869
Karma: 1035404
Join Date: Jul 2008
Location: Netherlands
Device: PRS-505
|
|
09-20-2009, 06:25 PM | #12 | |
creator of calibre
Posts: 44,564
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
|
|
09-22-2009, 11:27 AM | #13 |
Connoisseur
Posts: 89
Karma: 205
Join Date: Jul 2006
Location: Upstate NY
Device: Rocket eBook & Sony Reader
|
Hi Kovid,
Here is something with columns for you to work on. Kozak |
09-23-2009, 07:40 AM | #14 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
some pdf to work on
Here is a pdf from the UK, the early history of their air force. It have tried and tried to convert this and everything I did was unsuccessful without endless manual work, so good luck!
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Any way to differentiate samples? | carld | Calibre | 1 | 07-26-2010 11:51 PM |
Hacks Looking for samples of the new fonts | Granvillen | Amazon Kindle | 3 | 06-24-2010 06:10 PM |
Classic Can't download samples | Sakura | Barnes & Noble NOOK | 2 | 04-28-2010 01:12 AM |
Classic Samples from BN not downloading | robslp | Barnes & Noble NOOK | 1 | 04-21-2010 06:58 PM |
Samples now available without Whispernet | AnemicOak | Amazon Kindle | 0 | 10-09-2009 05:42 PM |