08-29-2009, 02:42 AM | #1 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
pacify.py (Text reformatter / RTF extractor)
Updated download as of 2009-08-29 12:27 EST
Updated download as of 2009-08-29 23:20 EST Very much work in progress, and rather unpolished... but it's already turning out to be incredibly helpful to me in some eBook preparation, so I thought I would share it here. Suggested use: Produce HTML from text: pacify.py -i input.txt -pcq Produce LaTeX from text: pacify.py -i input.txt -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi" Produce HTML from RTF (preserving italic/bold formatting + footnotes): pacify.py -i input.rtf -pcql gppro Produce LaTeX from RTF (preserving italic/bold formatting + footnotes): pacify.py -i input.rtf -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi" The RTF extraction isn't very sophisticated yet... but should work fine with simple, straightforward RTF files. - Ahi Last edited by ahi; 08-31-2009 at 01:06 PM. |
08-29-2009, 08:55 AM | #2 |
GuteBook/Mobi2IMP Creator
Posts: 2,958
Karma: 2530691
Join Date: Dec 2007
Location: Toronto, Canada
Device: REB1200 EBW1150 Device: T1 NSTG iLiad_v2 NC Device: Asus_TF Next1 WPDN
|
Thanks for this code. It looks like it does a very nice job of an arduous task.
I ran the python script with just -r and it worked, but with -pqr it crashed yielding this error message: Code:
E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify>pacify.py -i 157.txt -pqr pacify v0.2 - Copyright 2009 Pax Librorum (www.PaxLibrorum.com) Right-trimming lines... Parsing... line break: nr (2643) paragraph break: nrnr (823) ...: (46) Removing intraparagraph linebreaks... Fixing erroneous paragraph breaks... Traceback (most recent call last): File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 802, in <module> main() File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 111, in main theTome = parfixTome(theTome) File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 440, in parfix Tome if strLowerAlpha.find(theTome[idx+2][0:1]) > -1: IndexError: list index out of range Code:
import py2exe from distutils.core import setup setup( name = 'Pacify', description = 'Text reformatter / RTF extractor - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)', version = '0.2', author = 'ahi', author_email = 'http://www.paxlibrorum.com/contact/', console = ['pacify.py'], options = {"py2exe": {"typelibs": [('{1103EA00-3A0C-11D3-A6F6-00104B2947FB}',0,1,0)]}}, ) Last edited by nrapallo; 08-29-2009 at 09:07 AM. Reason: added smart quotes version of output.txt |
Advert | |
|
08-29-2009, 12:28 PM | #3 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Thanks, nrapallo!
I'll take a look at your crash report and see what's going on. I'm getting semi-regular crashes (always on certain files) myself--crashes that I suspect could be fixed by a bit more thorough/clever preprocessing of files. And yes, I'll make an .exe of it for the next upload. This script is a step short of my crazy (and formerly aired) idea of turning text files into databases with walkable nodes representing all words, sentences, paragraphs, et al. The next things I am going to try to get working are 1) part/chapter/section title detection and 2) poetry/quotation detection. I see both of those working either in an overzealous automatic mode (assumes anything that might be a title or a quotation *is* one, and the user/bookmaker will restore formatting if it isn't) and an interactive one, where python informs the user of the match it thinks it found, and lets the user instruct it how to handle said potential match. e.g.: Code:
Potential title match: -2: and so she left. -1: 0: III. On the way to Istanbul 1: 2: The friar did not hesitate to purchase a ticket on the next ship, perhaps because Encode line 0 as [P]art/H1, [C]hapter/H2, [S]ection/H3 or [I]gnore? Enter choice: _ - Ahi |
08-29-2009, 12:41 PM | #4 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
I will figure out the cause of the error... but I should perhaps note, nrapallo (in case it is unclear either to you or others) the -p option corrects erroneous paragraph breaking, not systematic paragraph breaking.
The program, regardless of whether the -p option is used, fixes systematic paragraph breaking like: Code:
Here I am! I travelled yesterday for four hours in a train. It's a funny sensation, isn't it? I never rode in one before. College is the biggest, most bewildering place--I get lost whenever I leave my room. I will write you a description later when I'm feeling less muddled; also I will tell you about my lessons. Classes don't begin until Monday morning, and this is Saturday night. But I wanted to write a letter first just to get acquainted. Code:
Here I am! I travelled yesterday for four hours in a train. It's a funny sensation, isn't it? I never rode in one before. College is the biggest, most bewildering place--I get lost whenever I leave my room. I will write you a description later when I'm feeling less muddled; also I will tell you about my lessons. Classes don't begin until Monday morning, and this is Saturday night. But I wanted to write a letter first just to get acquainted. While the -p option is good to use (once it works reliably) on all files "just in case" (and since it reports to the user what it changes, you'll know if it corrects something in error)... a file that has no such systematic paragraph errors could be nicely processed with: pacify.py -i input.txt -cq Doing so with 157.txt yields the attached. At first look, it seems to work rather nicely, smartening up all single quotes without interfering/being confused by apostrophes... though if and when you find it messed up somewhere in this file, nrapallo, do let me know. It almost certainly get it wrong if there was a word like 'tis that began with a single quote--though since there are not many such words, it's not unreasonable for my program to keep a list of those so it knows to treat them correctly. - Ahi |
08-29-2009, 01:27 PM | #5 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Corrected file uploaded...
... it was actually cutting off final lines, and the error nrapallo found related to that. Seems ok now. - Ahi Last edited by ahi; 08-30-2009 at 12:20 AM. |
Advert | |
|
08-30-2009, 12:34 AM | #6 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Updated pacify.py ... see attached or first post.
--- Now pacify.py by default produces HTML... of sorts. Don't worry, I'll add back in the functionality to output UTF-8 plaintext when I get around to it. The caveat being: no <p> tags are produced (although if you look at the source, it is separated by linebreaks, so you can add in the <p> tags easily enough with some clever search and replace. Also, footnotes extracted from RTF files are enclosed in <footnote>some text here</footnote> for the sake of simplicity--this will be fixed. Oh, and presently only formatting from RTF is picked up... so presently _emphasized phrase_ style formatting is not recognized. --- The input is autodetected as either .txt or .rtf based on the file extension. Many RTFs work well... I am regularly encountering ones that prove problematic. Since RTF seems to be a rather large and unwieldy specification, I am not sure how likely am I to be able to guarantee the accuracy of conversion. If anybody has advice on how I can make my RTF parser cleverly ignore stuff that it doesn't care about, I'd be grateful. It does alright so far... but since I do not yet understand how I could opt to only process text that shows visibly (as opposed to metadata) I am actively filtering out metadata one rtf command at a time... doubtless the wrong way to do it, I know. The output defaults to HTML unless the -l switch (LaTeX) is used. The LaTeX switch now requires an argument... currently only supports -l gppro though. Also, if you should provide the title (-T "..."), author (-A "lastname, firstname"), and optionally subtitle (-S "...") for the generated LaTeX document to have a nice title page. Optionally you can also specify your name (-I "...") for an "Ex Libris ..." inscription at the bottom of the title page. Some parts of the program are a bit more robust now... so you are less likely to encounter errors, but they will almost certainly still happen if the file is very messy (or, I suppose, just very different from the ones I have tested with). Comments, reports, suggestions are appreciated. --- Suggested use: Produce HTML from text: pacify.py -i input.txt -pcq Produce LaTeX from text: pacify.py -i input.txt -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi" Produce HTML from RTF (preserving italic/bold formatting + footnotes): pacify.py -i input.rtf -pcql gppro Produce LaTeX from RTF (preserving italic/bold formatting + footnotes): pacify.py -i input.rtf -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi" - Ahi |
08-31-2009, 03:51 AM | #7 |
Guru
Posts: 869
Karma: 2676800
Join Date: Sep 2008
Location: Taranaki - NZ
Device: Kobo Aura H2O, Kobo Forma
|
I haven't tried this script yet, bit I will do at some point when I have some text files that need working on.
Just an idea for the distant future - and don't know if this is feasable even - but a lot of text files do not have any markdown whatsoever. It would be so timesaving if a script/program could be written that could automate the task of adding italics/emphasis to text. Stuff like internal dialogue (I doubt this is possible he mused) and telepathic type conversations. Maybe even ship/aircraft/<insert vehicle> names. Yeah, I know. I'm probably dreaming... |
08-31-2009, 12:36 PM | #8 | ||
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Sherman, assuming there are reasonably consistent rules that define what you want in italics, this might not even be hard.
The example you gave though is far too vague. Quote:
Quote:
The other issue is that "mused" doesn't necessarily indicate internal dialogue. He could have been musing aloud to somebody else. Having said that... if you wanted all sentences that end with ", s/he thought" and ", s/he thought to him/herself" and ", s/he wondered." that's doable. The problem is the potential for considerable variety. The only way I see this being doable is via a method where upon first pass, the program produces a list of sentences that it believes (based on whatever sort of pattern matching) to be candidates of italicizing as internal dialogue. The user would then go over this list, and take out all false positives, and rerun the program for the italicization to take place based on the previously produced and now corrected list file. Not sure how reliable this would be for the specific thing you are proposing it for... but the general principle might work for other similar tasks. - Ahi |
||
08-31-2009, 10:01 PM | #9 |
Guru
Posts: 869
Karma: 2676800
Join Date: Sep 2008
Location: Taranaki - NZ
Device: Kobo Aura H2O, Kobo Forma
|
Hmm, when there's a sentence (or more than one) that include "I", sans quotes of any types this might not be so hard. The program would have to check it against the entire paragraph though to ensure it is not part of a conversation.
Also, if a sentence ends in something like "...s/he/<character name> thought.", and again that sentence is not detected within quotes, chances are it could be internal dialogue. A harder one to catch is this sort of situation: ...That's the sixth servent he's sent screaming so far today., where that was the end of some internal dialogue. There may or may not be preceding internal dialogue with that, but I doubt it would be so simple for an automatic script or program to catch. |
08-31-2009, 10:22 PM | #10 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
If you post a 4-5 paragraph sample text, I can tell you how readily scriptable it would be.
- Ahi |
09-01-2009, 05:55 AM | #11 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
I seriously doubt this can be automated in any way. Recognizing sentences and parsing their meaning goes far beyond simple scripting and into artificial intelligence.
Take, for instance, a text written in first person: It was a dark night, I could hardly see her face, and I wondered what she thought. It has no quotes, it includes "I" and it ends with "she thougt"... |
09-01-2009, 08:51 AM | #12 | |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
Quote:
Not that I'm even sure though what part of the sentence you quoted ought to be italicized as "internal dialogue"... just the last third? All of it? None if it (it being more narration than internal dialogue)? - Ahi |
|
09-01-2009, 08:58 AM | #13 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
|
09-01-2009, 09:54 AM | #14 |
Wizard
Posts: 1,790
Karma: 507333
Join Date: May 2009
Device: none
|
|
09-01-2009, 10:23 AM | #15 |
frumious Bandersnatch
Posts: 7,536
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
|
No, it was an example of a likely false positive with the rules above
Granted, you can eliminate false positives with your two pass method, but there could be literally hundreds of them, often many more than real "internal dialogue" phrases. As for false negatives, I often find dialogues (internal or not) that just omit the "he said", "she thought", etc. words. One should also look for "he said to himself" or "he wondered", or "he secretly admited", etc. An automated tool can be of some help, but the danger is letting the user rely solely on the tool, which can be worse than just leaving the "internal dialogues" unformatted. Similarly, when I see curly quotes wrongly oriented I would prefer they had been left as straight quotes instead. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Best pdf to text/rtf/whatever I have ever seen | jblitereader | Ectaco jetBook | 13 | 07-10-2010 01:02 AM |
RTF and TEXT conversion | spaze | Calibre | 4 | 08-23-2009 04:11 AM |
Automatic .Lit extractor for the iLiad | Adam B. | iRex | 34 | 09-25-2008 08:20 PM |
kovidgoyal: templatemaker -- automatic data extractor | sammykrupa | Sony Reader | 1 | 07-21-2007 02:52 PM |
Text to RTF question. | Roy White | Sony Reader | 0 | 05-12-2007 07:59 PM |