03-20-2007, 09:05 PM | #1 |
Enthusiast
Posts: 49
Karma: 11
Join Date: Mar 2007
Device: prs500
|
Is there a conversion program which will do this?
I own and use both an ebookwise 1150 and a Sony reader the last of which I have had for almost three weeks now.
I have been unable to find a conversion program(s) to successfully create ebooks for either reader. I have tried ebook librarian, BBeB binder and Book Designer 4. Not bad programs, but not perfect. Can anyone tell me if there is a conversion program or programs that will do the following: * Convert pdf to irf and/or imp. * Do this without the annoying line/page breaks which leave one or two words on a line, or every other page blank except for a few lines of text. * Strip out page numbering which causes numbers to appear throughout the text. * Allow me to use S, M, L text size on the Sony reader and still have good "flowing" text formatting. * Create a TOC. * Allow me to set margins at top/bottom/edges. * Keep paragraph formatting intact. I have been playing with Book Designer for a couple of days but have not found a way to make ebooks as I would like to see them (like the ones from CONNECT - what software do they use I wonder?). Does anyone know if it is possible to do the above with Book Designer? If not I will stop wasting my time with it. Any suggestions would be most welcome. Thanks. |
03-20-2007, 09:57 PM | #2 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
I do not know of any one package that does all the things you ask. Almost everything you mention can be accomplished with a combination of packages.
While BookDesigner (I assume that you have the BD4 release and have applied all updates through the March 15, 2007 update) can convert directly to IMP and LRF, the results are sometimes suboptimal and a bit of reformat is required. No package that I have ever found can strip out the page headers and page numbers and page footers of a PDF in converstion. For that I have used Adobe Acrobat Professional and cropped the pages to exclude these areas and then extracted the remaining text. On most cases this works. On one project the page headers still showed up in the resultant extraction. I use ABC Amber PDF Converter. I tend to extract the text and then do my editing in Word. I use Stingo's Word Macro from the MobileRead Wiki Conversion page to remove the end of line page breaks. This becomes a bit dicey when there is only an indent to signal a new paragraph and certain preactons must be performed to replace the indent with another paragraph mark. Once the paragraphs are set as units rather than a collection of lines the reflowing will work fine. As I understand the program, BD5 allows for the setting of margins and can produce a TOC. I have not pushed at the TOC function much but the default seems to work. It seems rather strange to want to set the margins and have reflow while wanting to keep the paragraph formatting intact. The line breaks and paragraphs per page cannot stand or be the same while changing the size. Also, the PDF that was used as a source is most likely either letter or A4 sized. The only answer there would be to load the PDF directly on the Reader which would produce a letter size too small to read. DB5 can produce files in both IMP and LRF from the same source file. I think everything you need for formatting the output is in BD5, you may just have to dig a little more at the documentation. As for what Sony uses, they invented and control the format so their tool is an internally generated one. |
Advert | |
|
03-21-2007, 11:09 AM | #3 |
Member
Posts: 17
Karma: 24
Join Date: Mar 2007
Device: PRS-500
|
When I end up with an ebook where the formatting is totally cheesed, I usually end up converting it to text, and then writing a quick PERL or PHP script to re-format it for me. I had a book recently (Brother Odd I believe) which had the hard line-breaks, and page numbers and such throughout the text. Seven lines of scripting completely fixed the formatting including re-joining paragraphs which spanned page breaks, indenting paragraphs, fixing chapter marks, etc. Doing that sort of thing isn't too difficult once you get the hang of it. Since I run into that infrequently I typically just modify the script for the book instead of writing a full blown tool to handle a myriad of formatting issues.
|
03-21-2007, 08:18 PM | #4 | |
Enthusiast
Posts: 49
Karma: 11
Join Date: Mar 2007
Device: prs500
|
Quote:
What I mean is that I want lines like this: "I want lines like this" he said. "Do you?" I said. Rather than: "I want lines like this" he said. "Do you?" I said. and yet still have flowable re-sizable text for ordinary sentences. Is this possible? I tried the macro you mentioned but it will not do this. It just bungs all the text together in a big block. |
|
03-21-2007, 09:05 PM | #5 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
AS I mentioned before, it is always a bit dicey when there are not specific paragraph end symbols. These are most commonly double paragraph marks when you are converting from text that also has a paragraph mark at the end of each line (as Gutenberg does and is also the default for Stingo's macro.) Other paragraph marks are often an indent at the start of a paragraph. In that case you need to adjust macro to seek a paragraph mark followed by either a tab or a series of spaces as the end of paragraph mark. Otherwise there is no way for any script to know what is simply an end of line marker and what is an end of paragraph mark. In that type of case the only solution is manual editing to add the additional paragraph mark prior to running the macro.
|
Advert | |
|
03-22-2007, 10:29 AM | #6 |
Reborn Paper User
Posts: 8,616
Karma: 15446734
Join Date: May 2006
Location: Que Nada
Device: iPhone8, iPad Air
|
With a macro you can twist around functions of Word .
In a search and replace, Word recognizes the difference between simple and double paragraph marks. Knowing this eases the process. First, do a 'find and replace' search from the edit menu, find a special character that does not exit in your text like 'column break'.(It's under the 'More' and then 'Special' tabs) Do a replace 'double paragraph marks' with that particular character. Then erase every 'simple paragraph marks'.(That's replace with nothing) Then come back and switch that special character to 'simple paragraph mark'. When you record your Macro from the 'Tools' menu, all that can be added to the text formating part. Last edited by yvanleterrible; 03-22-2007 at 10:32 AM. |
03-22-2007, 10:07 PM | #7 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Sometimes (most times) the simple removal of paragraph marks as yvan describes will leave the last word of a line joinrd to the first word of the next line. I always replace it with a space and then replace the double spaces with single spaces.
|
03-24-2007, 10:32 PM | #8 |
Enthusiast
Posts: 49
Karma: 11
Join Date: Mar 2007
Device: prs500
|
The texts I want to convert seems to have a paragraph break at the end of every line. I can't find any double ones at all. Sorry guys, but this one seems hard to crack!
I see that http://manybooks.net// has hundreds of Guttenburg books all perfectly formatted for the Sony reader. How do they do it I wonder??? |
03-24-2007, 11:39 PM | #9 | |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Quote:
Although a lot of the Gutenberg texts have double paragraph marks to signal the end of a paragraph, sometimes you just have to go through the book line-by-line to make it right. Often -- if you search hard enough -- you might find a better formatted edition. Sometimes you just have to do it yourself. |
|
03-25-2007, 02:34 AM | #10 |
eNigma
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
|
Post an example?
Can someone attach an example of a file that contains the line-termination problem under discussion? I think Plowman has the right idea, but without an example to play with it is hard to be sure of a solution.
|
03-25-2007, 04:18 PM | #11 |
Reborn Paper User
Posts: 8,616
Karma: 15446734
Join Date: May 2006
Location: Que Nada
Device: iPhone8, iPad Air
|
It's easy to find double paragraph marks, do the find thing. When you do it it will bring a dialog box showing the number of times it found them. Then you do a replace. Sometimes it can be a 'paragraph mark, space, paragraph mark.
@mogui You can make a quick example for yourself. Take any PDF and do the 'convert to text'. Then open the document with word, it's messed with them. @RWood You're right it's replace with a space. Writing scripts is totally 'alienite' to most of us, we have to make do with tools already offered. |
03-26-2007, 01:10 AM | #12 |
eNigma
Posts: 503
Karma: 1335
Join Date: Dec 2006
Location: The Philippines
Device: HTC G1 Android FBReader
|
Special characters
I do file conversions frequently, and by various means. I fixed my old broken TRGpro, so now I use it for reading again. I am in China so I can't just go to CompUSA or Fry's and pick up a new device. Though I have plenty of e-books in Palm format, most of my files are in ".pdf" or ".txt" formats. I have found Plucker to be a wonderful way to achieve readable formatting of those files with very little effort. I do not need to futz with line endings and such.
But when I was formatting text for my 24 character line MP4 player I became familiar with the necessary gyrations which I will outline here. I converted the Palm TX manual from pdf to txt to use here as an example. It is freely downloadable. I saved it as text using Acrobat. I brought it up in PSPad, an excellent freeware programmer's editor. Then I was able to view the txt file in hex mode. The first little bit looks like this: "User Guide 0D0A 0D0A 0D0A 0C 0D0A Copyright and" The ASCII codes are: 0D0A = carriage return, line feed (CRLF). 0C = form feed (FF), or page break. The CRLF is what you are calling a paragraph mark. It commonly shows in most editors as a paragraph symbol. This symbol is not a part of the ASCII character set. The form feed is a page break. Often the FF is close to a number or a repeated string. This is a good clue for the identification of text you might wish to remove. My problem was to reformat the text to fit a 24 character line and reflow the text at word breaks. To do this I followed the following steps: 1/ Replace all CRLF sequences with <*>CRLF. 2/ Select all the text in my editor (PFE in this case) and reflow it. 3/ Replace all <*>CRLF sequences with CRLF. Some good free programmer's editors are : PSPad, ConTEXT and PFE32. Now the text has the proper paragraph formatting and text breaks occur between words. In other words, it is readable. Now if I wish, I can replace CRLFCRLF sequences with CRLF to eliminate extra line spacing. The form feed "0C" character can be replaced by a space or a CRLF sequence -- your choice. I usually replace tabs with 2 spaces and later crunch spaces down by repeatedly replacing SPSP with SP. There is a small free conversion utility called storymaster that will do the above operations quite satisfactorily. The site will link you to a free download of html2txt that includes the source code, in case you are starting with an html file. Of course the problem stated at the beginning of this thread is more complex, which is why I do not simply give you the link to storymaster and be done with it. This thread discusses scripting. Windows has a nice scripting facility built in. It is called Windows Scripting Host, or WSH. With it you can use Visual Basic scripts or Javascripts and run them just like regular programs, Microsoft is quite happy to teach you all about it here. Help with VBScript is only a click away. You can find lots of script examples on the web, so you don't have to start from scratch. I never do. I am attaching a Visual Basic WSH script I used to reformat text to 24 character lines for my MP4 player. I have added the extension ".txt" to the filename so the uploader would accept it. Remember to virus-scan all executables before running them. Open it in Notepad or your favorite editor. Remove the ".txt" extension before running it. It is not perfect. I never bothered to clean it up. I adapted it from a script that converted html to text. I have chopped out irrelevant code. I have commented-out part of it that is involved with string replacement. You can use that section for your own experiments. Back in the old days, running unix systems, we used to use AWK for all our text reformatting needs. It is a dream come true for changing text strings, and it does not take long to learn. After all, programmers can use it. Whenever we wanted to tell programmers from other people we would just point to something. Ordinary people would look where we pointed. The programmers would always look at our finger. So, if they can do it you can do it. This forum is populated with highly intelligent people. This site offers sample AWK scripts and will help one to learn to use AWK. It refers to AWK as a programming language. I am sorry. It is really a simple command line utility. Here are some simple one line examples of how to manipulate text. Once you get used to it you will love it. In summary, it is essential to be able to see the raw data in your files to understand what you need to do. I like hex editors for this. PSPad has a hex viewing mode. Find an ASCII chart you like and link to it. Then you can understand what the character codes are. Many programming editors will allow you to do search and replace using character codes. Sometimes they look like this "/f" (FF), or "/n" (CRLF). Sometimes they are hex codes like this, "0x0c" (FF) or "0x0d0x0a" (CRLF). See the help files for your editor under "regular expressions". By experimenting with different replacement sequences you can learn what needs to be done. Then you are ready to use a script or a tool like AWK. If you create a useful script or AWK command line, please consider posting it here so we can all learn. Last edited by mogui; 03-26-2007 at 05:11 AM. |
03-26-2007, 08:10 AM | #13 |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Also, some files that start in a Unix-like environment often have just an LF character at the end of the line ecpecting the system to provide the CR portion. I have seen some files that must have been printer files that also contain Backspace (BS) characters used to provide bold as in (to use the mogui representation): "this is mBSyBS first time." which would print as "this is my first time."
|
03-26-2007, 09:07 AM | #14 |
Reborn Paper User
Posts: 8,616
Karma: 15446734
Join Date: May 2006
Location: Que Nada
Device: iPhone8, iPad Air
|
You lost me and I won't even try. I've never done any prog. and I have no time to learn a language just for this. The tools offered by Word are simple, easy to use and do the job. Thanks anyway.
|
03-26-2007, 11:32 AM | #15 | |
Technogeezer
Posts: 7,233
Karma: 1601464
Join Date: Nov 2006
Location: Virginia, USA
Device: Sony PRS-500
|
Quote:
I fo one am very glad to put those days far behind me and just be a user of tools rather than a creator of tools. Coding is a young person's game and my eyes are not what they once were. |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
My Manga Program | lilman | Kindle Developer's Corner | 177 | 07-10-2011 08:39 PM |
Does anyone program in Forth? | Nate the great | Alternative Devices | 14 | 11-11-2009 06:21 PM |
Who can compile a .c program for me? | owl123 | Workshop | 14 | 04-19-2009 05:26 AM |
Is there any program like that? | Shehabi | Workshop | 1 | 11-11-2008 10:19 AM |
iLiad Witeboard program | GRJOTI | iRex Developer's Corner | 0 | 04-09-2008 08:32 AM |