pacify.py (Text reformatter / RTF extractor)

ahi · 08-29-2009, 01:42 AM

Updated download as of 2009-08-29 12:27 EST
Updated download as of 2009-08-29 23:20 EST

Very much work in progress, and rather unpolished... but it's already turning out to be incredibly helpful to me in some eBook preparation, so I thought I would share it here.

Suggested use:

Produce HTML from text:
pacify.py -i input.txt -pcq

Produce LaTeX from text:
pacify.py -i input.txt -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi"

Produce HTML from RTF (preserving italic/bold formatting + footnotes):
pacify.py -i input.rtf -pcql gppro

Produce LaTeX from RTF (preserving italic/bold formatting + footnotes):
pacify.py -i input.rtf -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi"

The RTF extraction isn't very sophisticated yet... but should work fine with simple, straightforward RTF files.

- Ahi

nrapallo · 08-29-2009, 07:55 AM

Thanks for this code. It looks like it does a very nice job of an arduous task.

I ran the python script with just -r and it worked, but with -pqr it crashed yielding this error message:

Code:

E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify>pacify.py -i 157.txt -pqr
pacify v0.2 - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)

Right-trimming lines...

Parsing...

  line break: nr (2643)
  paragraph break: nrnr (823)
  ...:  (46)

Removing intraparagraph linebreaks...

Fixing erroneous paragraph breaks...

Traceback (most recent call last):
  File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 802, in <module>
    main()
  File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 111, in main
    theTome = parfixTome(theTome)
  File "E:\ebooks\Coding_Mobi2IMP_PDFRead\pacify\pacify.py", line 440, in parfix Tome
    if strLowerAlpha.find(theTome[idx+2][0:1]) > -1:
IndexError: list index out of range

Could you also create a windows executable, pacify.exe, for those here that don't do python? I did create one myself using py2exe and this setup.py code (used in PDFRead

):

Code:

import py2exe
from distutils.core import setup

setup(
    name = 'Pacify',
    description = 'Text reformatter / RTF extractor - Copyright 2009 Pax Librorum (www.PaxLibrorum.com)',
    version = '0.2',
    author = 'ahi',
    author_email = 'http://www.paxlibrorum.com/contact/',
    console = ['pacify.py'],
    options = {"py2exe": {"typelibs": [('{1103EA00-3A0C-11D3-A6F6-00104B2947FB}',0,1,0)]}},
)

ahi · 08-29-2009, 11:28 AM

Thanks, nrapallo!

I'll take a look at your crash report and see what's going on. I'm getting semi-regular crashes (always on certain files) myself--crashes that I suspect could be fixed by a bit more thorough/clever preprocessing of files.

And yes, I'll make an .exe of it for the next upload.

This script is a step short of my crazy (and formerly aired) idea of turning text files into databases with walkable nodes representing all words, sentences, paragraphs, et al.

The next things I am going to try to get working are 1) part/chapter/section title detection and 2) poetry/quotation detection.

I see both of those working either in an overzealous automatic mode (assumes anything that might be a title or a quotation *is* one, and the user/bookmaker will restore formatting if it isn't) and an interactive one, where python informs the user of the match it thinks it found, and lets the user instruct it how to handle said potential match.

e.g.:

Code:

Potential title match:

-2:        and so she left.
-1:        
 0:        III. On the way to Istanbul
 1:        
 2:        The friar did not hesitate to purchase a ticket on the next ship, perhaps because

   Encode line 0 as [P]art/H1, [C]hapter/H2, [S]ection/H3 or [I]gnore?
   Enter  choice: _

I'm also hoping I'll get around to tidying up the code a bit at some point...

- Ahi

ahi · 08-29-2009, 11:41 AM

I will figure out the cause of the error... but I should perhaps note, nrapallo (in case it is unclear either to you or others) the -p option corrects erroneous paragraph breaking, not systematic paragraph breaking.

The program, regardless of whether the -p option is used, fixes systematic paragraph breaking like:

Code:

Here I am!  I travelled yesterday for four hours in a train.  It's a

funny sensation, isn't it?  I never rode in one before.



College is the biggest, most bewildering place--I get lost whenever I

leave my room.  I will write you a description later when I'm feeling

less muddled; also I will tell you about my lessons.  Classes don't

begin until Monday morning, and this is Saturday night.  But I wanted

to write a letter first just to get acquainted.

What -p would fix would be if the same lines were thus:

Code:

Here I am!  I travelled yesterday for four hours in a train.  It's a

funny sensation, isn't it?  I never rode in one before.



College is the biggest, most bewildering place--I get lost whenever I

leave my room.  I will write you a description later when I'm feeling


less muddled; also I will tell you about my lessons.  Classes don't

begin until Monday morning, and this is Saturday night.  But I wanted

to write a letter first just to get acquainted.

The -p option would detect that the line that the "paragraph" that ends with "... I'm feeling" and is followed by a paragraph that starts with "less muddled; also ..." are almost certainly supposed to be a single paragraph.

While the -p option is good to use (once it works reliably) on all files "just in case" (and since it reports to the user what it changes, you'll know if it corrects something in error)... a file that has no such systematic paragraph errors could be nicely processed with:

pacify.py -i input.txt -cq

Doing so with 157.txt yields the attached. At first look, it seems to work rather nicely, smartening up all single quotes without interfering/being confused by apostrophes... though if and when you find it messed up somewhere in this file, nrapallo, do let me know. It almost certainly get it wrong if there was a word like 'tis that began with a single quote--though since there are not many such words, it's not unreasonable for my program to keep a list of those so it knows to treat them correctly.

- Ahi

ahi · 08-29-2009, 12:27 PM

Corrected file uploaded...

... it was actually cutting off final lines, and the error nrapallo found related to that.

Seems ok now.

- Ahi

ahi · 08-29-2009, 11:34 PM

Updated pacify.py ... see attached or first post.

---

Now pacify.py by default produces HTML... of sorts. Don't worry, I'll add back in the functionality to output UTF-8 plaintext when I get around to it.

The caveat being: no <p> tags are produced (although if you look at the source, it is separated by linebreaks, so you can add in the <p> tags easily enough with some clever search and replace.

Also, footnotes extracted from RTF files are enclosed in <footnote>some text here</footnote> for the sake of simplicity--this will be fixed.

Oh, and presently only formatting from RTF is picked up... so presently _emphasized phrase_ style formatting is not recognized.

---

The input is autodetected as either .txt or .rtf based on the file extension. Many RTFs work well... I am regularly encountering ones that prove problematic. Since RTF seems to be a rather large and unwieldy specification, I am not sure how likely am I to be able to guarantee the accuracy of conversion.

If anybody has advice on how I can make my RTF parser cleverly ignore stuff that it doesn't care about, I'd be grateful. It does alright so far... but since I do not yet understand how I could opt to only process text that shows visibly (as opposed to metadata) I am actively filtering out metadata one rtf command at a time... doubtless the wrong way to do it, I know.

The output defaults to HTML unless the -l switch (LaTeX) is used. The LaTeX switch now requires an argument... currently only supports -l gppro though.

Also, if you should provide the title (-T "..."), author (-A "lastname, firstname"), and optionally subtitle (-S "...") for the generated LaTeX document to have a nice title page. Optionally you can also specify your name (-I "...") for an "Ex Libris ..." inscription at the bottom of the title page.

Some parts of the program are a bit more robust now... so you are less likely to encounter errors, but they will almost certainly still happen if the file is very messy (or, I suppose, just very different from the ones I have tested with).

Comments, reports, suggestions are appreciated.

---

Suggested use:

Produce HTML from text:
pacify.py -i input.txt -pcq

Produce LaTeX from text:
pacify.py -i input.txt -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi"

Produce HTML from RTF (preserving italic/bold formatting + footnotes):
pacify.py -i input.rtf -pcql gppro

Produce LaTeX from RTF (preserving italic/bold formatting + footnotes):
pacify.py -i input.rtf -pcql gppro -T "Title of Book" -A "Lastname, Firstname" -S "a jolly good tale" -I "Ahi"

- Ahi

sherman · 08-31-2009, 02:51 AM

I haven't tried this script yet, bit I will do at some point when I have some text files that need working on.

Just an idea for the distant future - and don't know if this is feasable even - but a lot of text files do not have any markdown whatsoever. It would be so timesaving if a script/program could be written that could automate the task of adding italics/emphasis to text. Stuff like internal dialogue (I doubt this is possible he mused) and telepathic type conversations. Maybe even ship/aircraft/<insert vehicle> names.

Yeah, I know. I'm probably dreaming...

ahi · 08-31-2009, 11:36 AM

Sherman, assuming there are reasonably consistent rules that define what you want in italics, this might not even be hard.

The example you gave though is far too vague.

Quote:

I doubt this is possible he mused

to

Quote:

I doubt this is possible he mused

There is nothing indicated that "he mused" is not just a natural part of the sentence... like as in: "She always admired the way he mused." Not the most sensible statement... but certainly you wouldn't want any part of that italicized as internal dialogue.

The other issue is that "mused" doesn't necessarily indicate internal dialogue. He could have been musing aloud to somebody else.

Having said that... if you wanted all sentences that end with ", s/he thought" and ", s/he thought to him/herself" and ", s/he wondered." that's doable. The problem is the potential for considerable variety.

The only way I see this being doable is via a method where upon first pass, the program produces a list of sentences that it believes (based on whatever sort of pattern matching) to be candidates of italicizing as internal dialogue.

The user would then go over this list, and take out all false positives, and rerun the program for the italicization to take place based on the previously produced and now corrected list file.

Not sure how reliable this would be for the specific thing you are proposing it for... but the general principle might work for other similar tasks.

- Ahi

sherman · 08-31-2009, 09:01 PM

Hmm, when there's a sentence (or more than one) that include "I", sans quotes of any types this might not be so hard. The program would have to check it against the entire paragraph though to ensure it is not part of a conversation.

Also, if a sentence ends in something like "...s/he/<character name> thought.", and again that sentence is not detected within quotes, chances are it could be internal dialogue.

A harder one to catch is this sort of situation: ...That's the sixth servent he's sent screaming so far today., where that was the end of some internal dialogue. There may or may not be preceding internal dialogue with that, but I doubt it would be so simple for an automatic script or program to catch.

ahi · 08-31-2009, 09:22 PM

If you post a 4-5 paragraph sample text, I can tell you how readily scriptable it would be.

- Ahi

Jellby · 09-01-2009, 04:55 AM

I seriously doubt this can be automated in any way. Recognizing sentences and parsing their meaning goes far beyond simple scripting and into artificial intelligence.

Take, for instance, a text written in first person:

It was a dark night, I could hardly see her face, and I wondered what she thought.

It has no quotes, it includes "I" and it ends with "she thougt"...

ahi · 09-01-2009, 07:51 AM

Quote:

Originally Posted by Jellby

I seriously doubt this can be automated in any way. Recognizing sentences and parsing their meaning goes far beyond simple scripting and into artificial intelligence.

Take, for instance, a text written in first person:

It was a dark night, I could hardly see her face, and I wondered what she thought.

It has no quotes, it includes "I" and it ends with "she thougt"...

Oh, there no chance of being able to automate this universally. But if somebody has a specific book in mind that perhaps has a particular sort of MO for internal dialogue, it could be gotten right for that.

Not that I'm even sure though what part of the sentence you quoted ought to be italicized as "internal dialogue"... just the last third? All of it? None if it (it being more narration than internal dialogue)?

- Ahi

Jellby · 09-01-2009, 07:58 AM

Quote:

Originally Posted by ahi

Not that I'm even sure though what part of the sentence you quoted ought to be italicized as "internal dialogue"... just the last third? All of it? None if it (it being more narration than internal dialogue)?

None of it, of course.

ahi · 09-01-2009, 08:54 AM

Quote:

Originally Posted by Jellby

None of it, of course.

Well then, Old Chap... it wasn't a very good example of internal dialogue that should be detected to be italicized, but probably wouldn't be.

- Ahi

Jellby · 09-01-2009, 09:23 AM

No, it was an example of a likely false positive with the rules above

Granted, you can eliminate false positives with your two pass method, but there could be literally hundreds of them, often many more than real "internal dialogue" phrases.

As for false negatives, I often find dialogues (internal or not) that just omit the "he said", "she thought", etc. words. One should also look for "he said to himself" or "he wondered", or "he secretly admited", etc.

An automated tool can be of some help, but the danger is letting the user rely solely on the tool, which can be worse than just leaving the "internal dialogues" unformatted. Similarly, when I see curly quotes wrongly oriented I would prefer they had been left as straight quotes instead.

08-29-2009, 11:28 AM	#3
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Thanks, nrapallo! I'll take a look at your crash report and see what's going on. I'm getting semi-regular crashes (always on certain files) myself--crashes that I suspect could be fixed by a bit more thorough/clever preprocessing of files. And yes, I'll make an .exe of it for the next upload. This script is a step short of my crazy (and formerly aired) idea of turning text files into databases with walkable nodes representing all words, sentences, paragraphs, et al. The next things I am going to try to get working are 1) part/chapter/section title detection and 2) poetry/quotation detection. I see both of those working either in an overzealous automatic mode (assumes anything that might be a title or a quotation is one, and the user/bookmaker will restore formatting if it isn't) and an interactive one, where python informs the user of the match it thinks it found, and lets the user instruct it how to handle said potential match. e.g.: Code: Potential title match: -2: and so she left. -1: 0: III. On the way to Istanbul 1: 2: The friar did not hesitate to purchase a ticket on the next ship, perhaps because Encode line 0 as [P]art/H1, [C]hapter/H2, [S]ection/H3 or [I]gnore? Enter choice: _ I'm also hoping I'll get around to tidying up the code a bit at some point... - Ahi

08-29-2009, 12:27 PM	#5
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	Corrected file uploaded... ... it was actually cutting off final lines, and the error nrapallo found related to that. Seems ok now. - Ahi Last edited by ahi; 08-29-2009 at 11:20 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Best pdf to text/rtf/whatever I have ever seen	jblitereader	Ectaco jetBook	13	07-10-2010 12:02 AM
RTF and TEXT conversion	spaze	Calibre	4	08-23-2009 03:11 AM
Automatic .Lit extractor for the iLiad	Adam B.	iRex	34	09-25-2008 07:20 PM
kovidgoyal: templatemaker -- automatic data extractor	sammykrupa	Sony Reader	1	07-21-2007 01:52 PM
Text to RTF question.	Roy White	Sony Reader	0	05-12-2007 06:59 PM

08-31-2009, 02:51 AM	#7
sherman Guru Posts: 875 Karma: 2676800 Join Date: Aug 2008 Location: Taranaki - NZ Device: Kobo Aura H2O, Kobo Forma	I haven't tried this script yet, bit I will do at some point when I have some text files that need working on. Just an idea for the distant future - and don't know if this is feasable even - but a lot of text files do not have any markdown whatsoever. It would be so timesaving if a script/program could be written that could automate the task of adding italics/emphasis to text. Stuff like internal dialogue (I doubt this is possible he mused) and telepathic type conversations. Maybe even ship/aircraft/<insert vehicle> names. Yeah, I know. I'm probably dreaming...

08-31-2009, 09:01 PM	#9
sherman Guru Posts: 875 Karma: 2676800 Join Date: Aug 2008 Location: Taranaki - NZ Device: Kobo Aura H2O, Kobo Forma	Hmm, when there's a sentence (or more than one) that include "I", sans quotes of any types this might not be so hard. The program would have to check it against the entire paragraph though to ensure it is not part of a conversation. Also, if a sentence ends in something like "...s/he/<character name> thought.", and again that sentence is not detected within quotes, chances are it could be internal dialogue. A harder one to catch is this sort of situation: ...That's the sixth servent he's sent screaming so far today., where that was the end of some internal dialogue. There may or may not be preceding internal dialogue with that, but I doubt it would be so simple for an automatic script or program to catch.

08-31-2009, 09:22 PM	#10
ahi Wizard Posts: 1,790 Karma: 507333 Join Date: May 2009 Device: none	If you post a 4-5 paragraph sample text, I can tell you how readily scriptable it would be. - Ahi

09-01-2009, 04:55 AM	#11
Jellby frumious Bandersnatch Posts: 7,548 Karma: 19500001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I seriously doubt this can be automated in any way. Recognizing sentences and parsing their meaning goes far beyond simple scripting and into artificial intelligence. Take, for instance, a text written in first person: It was a dark night, I could hardly see her face, and I wondered what she thought. It has no quotes, it includes "I" and it ends with "she thougt"...

09-01-2009, 09:23 AM	#15
Jellby frumious Bandersnatch Posts: 7,548 Karma: 19500001 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	No, it was an example of a likely false positive with the rules above Granted, you can eliminate false positives with your two pass method, but there could be literally hundreds of them, often many more than real "internal dialogue" phrases. As for false negatives, I often find dialogues (internal or not) that just omit the "he said", "she thought", etc. words. One should also look for "he said to himself" or "he wondered", or "he secretly admited", etc. An automated tool can be of some help, but the danger is letting the user rely solely on the tool, which can be worse than just leaving the "internal dialogues" unformatted. Similarly, when I see curly quotes wrongly oriented I would prefer they had been left as straight quotes instead.

Advert

Advert