04-14-2008, 02:26 AM | #1 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Teasing 2: extract snippets/tag PDFs
I figured Rio shouldn't be the only tease around here so here's my contribution. It's a proof of concept that seems to be working, but is a tease because it uses a language that few are likely to use here (it uses R, a scripting language I am more familiar with than any other so can get things done with it).
Several people have expressed a desire for a way to highlight or annotate journal articles etc directly on the iliad. OK, we can annotate and underline using scribbles, but we can't currently do much with the results, except just read them; there's no way to search them for example. Here's an approach that might get part way there. I can now mark up a PDF on the iliad using scribbles then run my script on the result (on my PC) and it will extract snippets of text that have been marked up. If I mark up the text with an L-shape the result is stored as a snippet; if I use an inverted L-shape the result is stored as a tag. See the attached PDF, which when processed produced the following results: Code:
SNIPPETS: Europe has already said it will press the G7 to demand more disclosure from banks on their investments as the credit crunch spreads from the financial sector to the household and corporate sectors. How bad will a whole? This n financial servi A housi property Square Mile, GDP IMF FS TAGS: chancellor I use Code:
pdftohtml -xml TODO:
Last edited by daudi; 04-14-2008 at 05:32 AM. Reason: Clarified point about annotation, etc. |
04-14-2008, 10:15 AM | #2 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Wow! Please keep us posted. I'm very interested in this functionality. And I know a Java programmer who might help.
I've used R, but I had no idea you could do stuff like this with it. |
Advert | |
|
04-14-2008, 11:57 AM | #3 |
Grand Sorcerer
Posts: 19,832
Karma: 11844413
Join Date: Jan 2007
Location: Tampa, FL USA
Device: Kindle Touch
|
Yes, it would be nice if all the publisers would put their content in DocBook XML (or other open format) which would make conversion to any format simple. I'm not sure why there is pursuit of new "standars" when there are already actually 3 that I can thing of, ODF and OpenXML being the other two.
BOb |
04-14-2008, 12:15 PM | #4 | |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Quote:
The key thing is that the output of pdftohtml -xml is an xml file with the co-ordinates of each line of text. If there is a way this can be done using java alone then it will be easy to implement the rest and have the whole thing platform independent. I use R for my work almost every day (not necessarily very deeply) and so I'm more familiar with it than anything else. The R code to do this uses none of the things that makes R special, I'm just using it as a scripting language. I've used R for all sorts of things unrelated to statistics. I suppose one good reason for using R here it is so easy to use the plotting facilities of R to plot the scribbles. This helped with debugging. |
|
04-22-2008, 09:00 AM | #5 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
extract snippets/tag PDFs: proof of concept for testing
I have now moved my code to python and added some additional functionality.
<caveat> I do not know how to write good python code, so for those who do know how to write good, pythonic code I would like to inform you that I will not be liable for any rehabilitation or psychotherapy fees that you may incur as a result of reading this code. Really, it is very, very nasty in places (that's "places" as in "most places"). If you have questions about the design you'll have to wait until I have designed it . This is a series of hacks that developed a life of their own. </caveat> At the moment I have not bothered too much about making this work on different operating systems. It depends on pdftohtml which is now part of the poppler project. There is a darwin port for mac and it apparently compiles out of the box on cygwin. Someone has created a windows version, but have wrapped a GUI around it, and I don't know if it can be used from the command line. If it can work from the command line this python code would still need to be tweak a little (but not much). I had to download and compile the latest version of the poppler-utils on my office machine (running ubuntu dapper) but was able to use the repo version at home on ubuntu feisty (or gutsy?). All versions report version 0.36 which is a bit of a pain, because they ain't the same AFAICT. To use this script mark-up a PDF on the iliad. L-shapes select text as snippets, inverted L-shapes are intended to select single words as tags. The default is to use a different colour to the default pen colour so that you can make notes with one colour and select text with another. Copy the PDF container folder to your PC (or connect via USB or samba) then run snippet on it. The extracted text is saved in two files: snippets and tags which are stored within the container directory. Make sure the script is executable with Code:
chmod +x snippets The options at the moment are: Code:
snippets [-hbk] [-p <path-to-pdftohtml>] [-c <colour>] <directory> -h print the help message -c <colour> the colour (color) that identifies strokes that markup areas to be extracted as snippets or tags. Should be one of: #000000, #555555, or the other two colours. Need to add them to this list. The default (i.e. if you do not specify a colour with this option) is #555555, which is the colour next to black (second from the right when selecting colours on the iliad). -k keep the full xml output of pdftohtml. Default is to delete it. -p <path-to-pdftohtml> path to pdftohtml (in case you need to specify a custom version) -b use a brute-force approach to cleaning up XML that is not well-formed. If the XML output from pdftohtml is not well-formed you'll probably get a "mismatched tag" error. <directory> input container directory EXAMPLE: snippets -b -c "#000000" test.pdf PROBLEMS: I have had a few problems along the way, and getting the script to this stage took me much longer than I anticipated. I had to learn about a number of things that were new to me (e.g. how to work with XML, the difference between MediaBox and CropBox in PDFs, etc.). Some of the problems remain unresolved, or are dealt with in brutal manner. In particular I have had problems with unicode and characters that appear in the XML output from pdftohtml that are below ascii code 32. The Guardian TopStories.pdf has a few of these (that appear as ^C which is ETX (???) and ^B etc). The -b option activates some code that attempts to deal with some problems, mainly tags that are in the wrong order, but I have not been able to get those ^C things sorted. The script does run, however, on a most of the PDFs I have tried. I am now going to actually start using the blasted thing and see what else needs fixing. My intention is to use this approach to extract text and have a version of the multi-directory search tool to search snippets and tags on the iliad. I'd be grateful if people could try this out and provide feedback. [Note: you'll want to remove the .txt extension from the script] [Edit 2008-04-22] Minor edits to script. Note also that the script does not handle files and containers with spaces in the names. Some quoting is needed in several places. [Edit 2008-04-23] Added option to extract images of selected regions (text or embedded images) using imagemagick. Also creates a simple HTML file for displaying the extracts. Last edited by daudi; 04-23-2008 at 11:39 AM. Reason: Added option to extract images and create simple HTML |
Advert | |
|
04-22-2008, 09:48 AM | #6 |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
I probably won't be able to look at it until this weekend, but I'll try to give it a good workout then.
|
04-22-2008, 10:23 AM | #7 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
OK. I look forward to hearing your comments and ideas. A couple of things that I might start thinking about are:
One other thing to note: this rewrites the snippets and tags files each time. So, if you manually add other tags to the tags file (I was originally thinking about exporting keywords from jabref) they'll get wiped out next time you process the folder. Last edited by daudi; 04-22-2008 at 10:49 AM. Reason: Added idea to change from write to append |
04-22-2008, 10:32 AM | #8 | |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Quote:
|
|
04-22-2008, 10:46 AM | #9 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Originally I did have it appending it, but then I ended up with duplications if I read an article again later and made further snippets. This could be handled by looking at the coordinates of the snippets already taken, but that started to look like too much hard work at the time (that was when I was still battling with bad XML and unicode issues). I'll add it to the list above and have a think about it.
|
04-23-2008, 11:38 AM | #10 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Still ugly, but does more
I've just added an option to extract images of the selected areas (these can be text or images in the document) and also create a simple HTML file to show the images and extracted text.
Code:
-i extract images of selected areas. You need to have imagemagick installed and on your path. If this proves to be useful I'll need to add ways of specifying more parameters for image creation. This also produces a rudimentary HTML file that links the images and snippets (currently in the order they were made). Code:
snippets -i test.pdf Note that image extraction requires imagemagick. Note also that I need to deal with spaces in file paths so this will not work with PDFs with spaces in the file names. This should not be hard to do, I just need to get around to doing it. Last edited by daudi; 04-24-2008 at 03:16 AM. Reason: fixed typo |
04-25-2008, 04:42 AM | #11 | |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Quote:
The snippet extraction plus search tool means the following work flow is possible:
I am also starting to think about extending the logic of the searches and having a "combine" entry. It should be possible to make several searches and keep each set of results and then combine them (like the ovid command line syntax). This would mean you could:
I also need to give some thought to integration of the snippets with my main PC-based bibliography tool. Last edited by daudi; 04-25-2008 at 04:48 AM. |
|
04-26-2008, 09:21 PM | #12 | |
fruminous edugeek
Posts: 6,745
Karma: 551260
Join Date: Oct 2006
Location: Northeast US
Device: iPad, eBw 1150
|
Quote:
Can we categorize snippets when we capture them? Or afterward, when reviewing them in the HTML version? |
|
04-27-2008, 03:00 AM | #13 | ||
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
Quote:
The snippets could, however, easily be edited at a later stage as they have a very simple structure. Here's one: Quote:
[BTW, in the extract above notice that the hyphen and 'fi' ligature have been converted to '?' because I need help to understand encoding schemes] Last edited by daudi; 04-27-2008 at 04:02 AM. Reason: fixed a typo |
||
04-27-2008, 04:12 AM | #14 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
It might not even need an application to handle snippets. jabref is very flexible and it is easy to add a file link within a bibliography reference that links to the snippet. That way you can easily open the snippet from the bibliography entry. You can't search the snippet from with jabref (yet) but I'll have a think about how that might be possible. As much as possible I would like to keep things integrated with my full bibliography manager.
In fact it might not be too hard at all. I think we could add a custom field for snippets to jabref and have a way to import/update that field from the snippets file in the PDF container folder. It should be possible to either create a custom import/export filter to keep the two in sync or create a script that does this. (The new version of jabref is going to have a plugin system that would make it easier for someone to create a java plugin that could integrate this more elegantly.) Doing this would mean that it would be possible to use jabref to manage references and to do quite powerful searches (including searches of the snippets) and still have the ability to do searches on snippets directly on the iliad. But, again, we need to be careful about what happens if the PDF is processed again, e.g. if it is read again at a later date, perhaps for a different purpose, and more text is marked-up. We'd need to keep track of existing snippets and not wipe out the categories that had been added externally. This is can be done, but means more work. Last edited by daudi; 04-27-2008 at 04:39 AM. |
05-01-2008, 01:04 PM | #15 |
Addict
Posts: 281
Karma: 904
Join Date: Oct 2007
Location: Kent, UK
Device: iRex iLiad, Psion 5MX, nokia n800
|
snippet search now works
I've added a snippet search tool to the multi-directory search tool. This allows you to search within snippets and produces results that show the title of the original file plus a couple of lines of the context of the match (in the description). The description starts with the page number in the original PDF where the matching snippet comes from. I was not able to make this open on the matching page (at least not nicely), but this allows you to see which page to jump to once you open the PDF (via the result set).
This tool uses the same config tool to decide where to search so you could set up directories of PDFs and then choose to only search within some. Here's an example of the result of a search. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
epub code snippets (html / css) | zelda_pinwheel | ePub | 196 | 10-09-2016 04:21 AM |
[Old Thread] Extract ISBN from file name | ChristianQ | Calibre | 59 | 12-09-2015 05:08 AM |
Programming language code snippets in ebooks? | Connochaetes | Writers' Corner | 7 | 10-18-2010 02:43 PM |
Emailing snippets from Kindle... | gfmucci | Amazon Kindle | 0 | 05-17-2010 08:56 AM |
iLiad Teasing :D | rio | iRex Developer's Corner | 17 | 04-14-2008 04:28 AM |