09-12-2010, 02:03 PM | #1 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Chapter detection when only digits - regex needed
Hi, I need help.
I'm trying to convert a load of rtf's to epubs, and most of them have chapters which are only digits on their own line, followed by a title on the next. And rather than adding 'Chapter ' to all places in the rtf's by hand, can anyone show me a regex that will allow a chapter that is only numbers to be recognised. None of the rtf's have line/page numbers, they have been removed, so only chapter #'s are on line of their own. I've added to the detect chapters regex re:test(., '\d|\d\d',i) and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit. Also would be nice if was able to assign it automatically a <h#> tag. Any help is appreciated. |
09-12-2010, 02:51 PM | #2 | |
reader
Posts: 6,975
Karma: 5183568
Join Date: Mar 2006
Location: Mississippi, USA
Device: Kindle 3, Kobo Glo HD
|
Quote:
Code:
re:test(.,'^\d+$',i) Last edited by wallcraft; 09-12-2010 at 02:54 PM. |
|
Advert | |
|
09-12-2010, 03:02 PM | #3 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Thanks, that works great.
Now, is there anyway to make that a <h#> entry? |
09-12-2010, 05:32 PM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Try the preprocess option, can't remember if that case is covered under rtf, but I'm pretty sure it is.
|
09-12-2010, 05:53 PM | #5 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
I'm preprocessing anyway. Not to worry, I think I've got it covered, I'm converting the rtf to epub with Calibre, then using Sigil in code-view and using several regex's to do the thing's like Header's, broken lines etc..
Now with each of the chapters split properly, the sigil part takes a few minutes, much quicker than what I was doing before. They then just need a quick proofread, to fix any other mistakes. I'm happy. |
Advert | |
|
09-13-2010, 10:58 AM | #6 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
I've just done a test, to try and make chapters with only digits come out as heading tags, but have been unsuccessful, is there anyway to do that,
'Chapter ##' come out as headings fine, and so do 'Prologue' and 'Epilogue'. Is there anyway to customise the rft input plugin as you would with some of the other formats. (I think that may be where the heading tags become applied) I'm able to split at the correct places with the regex as give in an earlier post, but would like them to be headings as well. |
09-13-2010, 12:35 PM | #7 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I just looked at the preprocessing code - the single digit chapter headings weren't included in the last checkin - the logic you want is basically done now, but need to check in the changes. If Kovid accepts them then it should be in the next build.
I noticed you said that even with preprocessing enabled you still needed to manually remove hard line breaks - if this is the case please open up a bug with the file, I'm trying to catch as many cases as possible without creating false positives |
09-13-2010, 12:43 PM | #8 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Thanks Idolse, that's nice to know, hope it makes it into the next build.
I'll try and reduce an rtf to a short few paragraphs, which still have the linebreaks not recognised. |
09-13-2010, 01:27 PM | #9 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Idolse, I transplanted several paragraphs which were incorrectly wrapped, into a new rtf, converted it to epub and they were all then wrapped correctly, but reconverting the whole original rtf, still had the mis-wrapping at those same places.
Weird. |
09-13-2010, 01:48 PM | #10 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Maybe not so weird - the unwrapping function looks at the median line length across all the lines. This works great if all the hard line breaks are in roughly the same spot, but if the lines are extremely variable (and sometimes/often long) this doesn't work out that well, since the median becomes longer than the typical broken line.
I'm going to add an option to tweak that logic, but not sure if it's going to make it in the next release as it's the first time I've attempted any GUI work. |
09-13-2010, 01:57 PM | #11 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
If it helps, I fix the remaining ones in Sigil, by doing a s&r
search '([a-z])</p>## <p>' replace '\1 ' (##=crlf*2, +2 spaces), replace has a space as well. And a similar one for lines ending in a comma. (And do a search for ones that end in a semicolon or colon, just as a check) I don't know if the preprocessing code can include a regex, but that sort of thing may make it more complete. |
09-18-2010, 03:23 AM | #12 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped.
The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line. Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length. edit - single digit chapters are covered now as well. Last edited by ldolse; 09-18-2010 at 03:25 AM. |
09-20-2010, 02:51 PM | #13 | |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Idolse, I've been able to test and found that the new wrapping/detection is better.
Thanks There's one small thing I've notice it isn't catching, if one line ends in a quote and the next line begins something like 'he said.' e.g. Quote:
Or something similar depending on paragraph, and spacing etc. Last edited by Perkin; 09-20-2010 at 02:54 PM. Reason: additional info |
|
09-20-2010, 05:44 PM | #14 | ||
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Glad to hear it's working better for you.
That scenario isn't covered on purpose. That's because if the line happens to break on the quote you can't create a simple regex that can differentiate between these two scenarios: Quote:
Quote:
Last edited by ldolse; 09-20-2010 at 05:49 PM. |
||
09-20-2010, 06:20 PM | #15 |
Guru
Posts: 657
Karma: 64171
Join Date: Sep 2010
Location: Kent, England, Sol 3, ZZ9 plural Z Alpha
Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)
|
Thanks to you it's miles easier now.
To rectify those missed wraps, I use the search '</p>(.)(.) <p>([a-z])' and replace ' \3' in sigil, with match case + minimal matching, that wraps them (and any other line that begins with a lower case letter). Can anything similar not be used in the pre-process code? |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Structure Detection - Remove Header (or Footer) Regex | DarkKipper | Conversion | 69 | 11-09-2013 12:21 PM |
Help with Chapter detection | ubergeeksov | Calibre | 0 | 09-02-2010 04:56 AM |
chapter detection in any book | yuki86 | Calibre | 9 | 05-06-2009 06:54 AM |
Chapter detection for LRF | HenryP | Calibre | 12 | 04-03-2009 08:22 AM |
Cant find help for chapter detection | fallwood | Calibre | 6 | 12-10-2008 01:20 PM |