11-12-2007, 03:15 PM | #1 |
Addict
Posts: 208
Karma: 582
Join Date: Aug 2006
Device: Zire71
|
Text tool for formatting Gutenberg text files
a.k.a.
What "Cleaning Up" Do Project Gutenberg Texts Need Part 2 Here is the download link for the Txt4EBook tool The tool is written in Java so you'll need the latest Java 6 software to be installed on your system. For downloads and more info go to Sun's Java download page The program file is already configured to run so long as the OS has Java system installed. In general that means you can start it either: 1) simply by double-clicking on the program file txt4ebook.jar icon in a GUI file manager 2) using the command in a console: java -jar txt4ebook.jar Either method should work for most machines. If you have problems then consult the Sun's help pages. Again, you need the latest VERSION 6 of Java!!! I only created it the other week, so it still doesn't even have a version number. I'll try to incorporate more functionality based on your comments, but don't expect too much. It is only a side project for me, limited time. Its primary goal is to simply process a text file and not change its formatting to another more advanced format like HTML. So my goals are very modest. The primary goal is to do whatever processing is necessary to prepare a Gutenberg text file for a reader device (including text 2 voice reading software). That means simple manipulations. Still, I will include ability to add custom defined manipulations so that you can process ANY text file for ANY purpose (keep the processor more or less general purpose). However, defaults are preset for Gutenberg text based on my preferences. At some point I'll try to add other preferences and/or ability to load/save user preferences. Anyway so much for now. This version simply formats a paragraph lines by removing extra line breaks. There is also optional paragraph indentation option. Next I'll add tab processing and custom regular expression filters (for removing things like Page XXX). I hope you find it useful. P.S.: I am using the latest Cybook reader, so default settings are geared for it. Last edited by bob_ninja; 11-12-2007 at 03:18 PM. |
11-12-2007, 08:30 PM | #2 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Very interesting.
I used it with a rather large PG file and it did rather well for the most part. It still missed converting many sections of block text. While it would not combine text blocks that had indents as the first few characters, the ones I mention were flush left without indentation. One of the first features you need to add is an option to change the name of the output file rather than assume that we all want it put back on top of the original file. (As you said, this is your setting.) |
Advert | |
|
11-12-2007, 10:27 PM | #3 | ||||
Addict
Posts: 208
Karma: 582
Join Date: Aug 2006
Device: Zire71
|
Here is an example I believe you refer to:
Quote:
The default value is 300 characters, or using 80 character lines almost 4 lines. So it determined that the section above is not really a paragraph and didn't process it. Now at some point I could/should add a more sophisticated semantics analyzer that would be smarter in distinguishing paragraphs from other sections. For now this simple check will have to do. So try reducing the value to a lower number. Perhaps I should use a smaller default. Here some examples of sections that are NOT a paragraph and should not be processed: Quote:
Quote:
Quote:
So in summary, reduce minimum paragraph length if you find some paragraphs are not processed but you want them to be processed. |
||||
11-12-2007, 11:42 PM | #4 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Thanks for the pointers. My editors don't show backup copies so I missed the backup that your program made. (I also had another copy just in case.)
It does go a long way. I have historically used Stingo's Word Macro which only reacts to double <CR>s. I will work with more. I see a lot of potential there. |
11-13-2007, 07:47 AM | #5 | |
Addict
Posts: 208
Karma: 582
Join Date: Aug 2006
Device: Zire71
|
Quote:
Still I'll add the option for a different output file. |
|
Advert | |
|
11-13-2007, 12:28 PM | #6 |
creator of calibre
Posts: 44,380
Karma: 23766374
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
From a technical perspective, I haven't looked at your code, but I suggest you create an internal object model of the txt file so that it becomes easy to support different output formats in the future. It will be a little slower, but I think it's worth it.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
eBook PDF - free tool for creating PDF eBooks from text files | KACartlidge | 6 | 01-04-2012 09:41 AM | |
Utility for Project Gutenberg Text Files | rocketgranny | Deals and Resources (No Self-Promotion or Affiliate Links) | 7 | 03-20-2010 02:44 AM |
help with formatting text files | chooky | Workshop | 2 | 11-26-2009 04:16 AM |
Text formatting for .txt files | motorhead | HanLin eBook | 9 | 01-08-2009 06:29 PM |
PRS-500 Text Formatting Tool | tesseract420 | Sony Reader Dev Corner | 5 | 09-13-2007 05:36 PM |