Chapter detection when only digits - regex needed

Perkin · 09-12-2010, 02:03 PM

Hi, I need help.

I'm trying to convert a load of rtf's to epubs, and most of them have chapters which are only digits on their own line, followed by a title on the next.

And rather than adding 'Chapter ' to all places in the rtf's by hand,
can anyone show me a regex that will allow a chapter that is only numbers to be recognised.

None of the rtf's have line/page numbers, they have been removed, so only chapter #'s are on line of their own.

I've added to the detect chapters regex
re:test(., '\d|\d\d',i)
and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit.

Also would be nice if was able to assign it automatically a <h#> tag.

Any help is appreciated.

wallcraft · 09-12-2010, 02:51 PM

Quote:

Originally Posted by Perkin

I've added to the detect chapters regex
re:test(., '\d|\d\d',i)
and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit.

Try adding "^" for start of a line and "$" for end of a line. I would use ("+" matches one or more instances):

Code:

re:test(.,'^\d+$',i)

Perkin · 09-12-2010, 03:02 PM

Thanks, that works great.

Now, is there anyway to make that a <h#> entry?

ldolse · 09-12-2010, 05:32 PM

Try the preprocess option, can't remember if that case is covered under rtf, but I'm pretty sure it is.

Perkin · 09-12-2010, 05:53 PM

I'm preprocessing anyway. Not to worry, I think I've got it covered, I'm converting the rtf to epub with Calibre, then using Sigil in code-view and using several regex's to do the thing's like Header's, broken lines etc..

Now with each of the chapters split properly, the sigil part takes a few minutes, much quicker than what I was doing before. They then just need a quick proofread, to fix any other mistakes.

I'm happy.

Perkin · 09-13-2010, 10:58 AM

I've just done a test, to try and make chapters with only digits come out as heading tags, but have been unsuccessful, is there anyway to do that,

'Chapter ##' come out as headings fine, and so do 'Prologue' and 'Epilogue'.

Is there anyway to customise the rft input plugin as you would with some of the other formats.
(I think that may be where the heading tags become applied)

I'm able to split at the correct places with the regex as give in an earlier post, but would like them to be headings as well.

ldolse · 09-13-2010, 12:35 PM

I just looked at the preprocessing code - the single digit chapter headings weren't included in the last checkin - the logic you want is basically done now, but need to check in the changes. If Kovid accepts them then it should be in the next build.

I noticed you said that even with preprocessing enabled you still needed to manually remove hard line breaks - if this is the case please open up a bug with the file, I'm trying to catch as many cases as possible without creating false positives

Perkin · 09-13-2010, 12:43 PM

Thanks Idolse, that's nice to know, hope it makes it into the next build.

I'll try and reduce an rtf to a short few paragraphs, which still have the linebreaks not recognised.

Perkin · 09-13-2010, 01:27 PM

Idolse, I transplanted several paragraphs which were incorrectly wrapped, into a new rtf, converted it to epub and they were all then wrapped correctly, but reconverting the whole original rtf, still had the mis-wrapping at those same places.

Weird.

ldolse · 09-13-2010, 01:48 PM

Maybe not so weird - the unwrapping function looks at the median line length across all the lines. This works great if all the hard line breaks are in roughly the same spot, but if the lines are extremely variable (and sometimes/often long) this doesn't work out that well, since the median becomes longer than the typical broken line.

I'm going to add an option to tweak that logic, but not sure if it's going to make it in the next release as it's the first time I've attempted any GUI work.

Perkin · 09-13-2010, 01:57 PM

If it helps, I fix the remaining ones in Sigil, by doing a s&r
search '([a-z])## ' replace '\1 ' (##=crlf*2, +2 spaces), replace has a space as well.
And a similar one for lines ending in a comma.
(And do a search for ones that end in a semicolon or colon, just as a check)

I don't know if the preprocessing code can include a regex, but that sort of thing may make it more complete.

ldolse · 09-18-2010, 03:23 AM

The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped.

The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line.

Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length.

edit - single digit chapters are covered now as well.

Perkin · 09-20-2010, 02:51 PM

Idolse, I've been able to test and found that the new wrapping/detection is better.
Thanks

There's one small thing I've notice it isn't catching, if one line ends in a quote and the next line begins something like 'he said.'
e.g.

Quote:

'That was nearly perfect Idolse.'
he said.

Quickly fixed in Sigil with S&R '(.)(.) ([a-z])' -> ' \3'
Or something similar depending on paragraph, and spacing etc.

ldolse · 09-20-2010, 05:44 PM

Glad to hear it's working better for you.

That scenario isn't covered on purpose. That's because if the line happens to break on the quote you can't create a simple regex that can differentiate between these two scenarios:

Quote:

'That was nearly perfect ldolse.'
he said.

and:

Quote:

He said 'That was nearly perfect ldolse.'
This is the start of a new paragraph.

I've been thinking to add an enhancement to preprocessing where the user can specify that there is a blank line between every paragraph, or that most/every paragraph is indented, so that at least this scenario can be covered when when the document provides that much differentiation.

Perkin · 09-20-2010, 06:20 PM

Thanks to you it's miles easier now.
To rectify those missed wraps, I use the search '(.)(.) ([a-z])' and replace ' \3' in sigil, with match case + minimal matching, that wraps them (and any other line that begins with a lower case letter).

Can anything similar not be used in the pre-process code?

09-12-2010, 02:03 PM	#1
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Chapter detection when only digits - regex needed Hi, I need help. I'm trying to convert a load of rtf's to epubs, and most of them have chapters which are only digits on their own line, followed by a title on the next. And rather than adding 'Chapter ' to all places in the rtf's by hand, can anyone show me a regex that will allow a chapter that is only numbers to be recognised. None of the rtf's have line/page numbers, they have been removed, so only chapter #'s are on line of their own. I've added to the detect chapters regex re:test(., '\d\|\d\d',i) and that does split the chapters correctly, however it also splits on any paragraph with a single or double digit. Also would be nice if was able to assign it automatically a <h#> tag. Any help is appreciated.

09-13-2010, 01:57 PM	#11
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	If it helps, I fix the remaining ones in Sigil, by doing a s&r search '([a-z])</p>## <p>' replace '\1 ' (##=crlf*2, +2 spaces), replace has a space as well. And a similar one for lines ending in a comma. (And do a search for ones that end in a semicolon or colon, just as a check) I don't know if the preprocessing code can include a regex, but that sort of thing may make it more complete.

09-18-2010, 03:23 AM	#12
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	The preprocess function basically does the regex you're suggesting, but it analyzes the document and gets the median line length, and ties this to the unwrapping function. This is to prevent things like lists, poetry, titles, etc from being unwrapped. The problem is the original implementation assumed that if there were hard breaks in the document they would be universal across the file. Reality is that many files have a lot of variability, so this pushes the median line length longer than the typical broken line. Preprocessing now lets you specify the aggressiveness of the line unwrapping - i.e. make the line length cut-off shorter. It's the line unwrapping factor under structure detection. If you get a chance give it a shot and see if it solves your original problem with unwrapping. Note you may need to set it down to 0.1 or 0.2 if there is a huge amount of variability in line length. edit - single digit chapters are covered now as well. Last edited by ldolse; 09-18-2010 at 03:25 AM.

09-20-2010, 06:20 PM	#15
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Thanks to you it's miles easier now. To rectify those missed wraps, I use the search '</p>(.)(.) <p>([a-z])' and replace ' \3' in sigil, with match case + minimal matching, that wraps them (and any other line that begins with a lower case letter). Can anything similar not be used in the pre-process code?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Structure Detection - Remove Header (or Footer) Regex	DarkKipper	Conversion	69	11-09-2013 12:21 PM
Help with Chapter detection	ubergeeksov	Calibre	0	09-02-2010 04:56 AM
chapter detection in any book	yuki86	Calibre	9	05-06-2009 06:54 AM
Chapter detection for LRF	HenryP	Calibre	12	04-03-2009 08:22 AM
Cant find help for chapter detection	fallwood	Calibre	6	12-10-2008 01:20 PM

09-12-2010, 03:02 PM	#3
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Thanks, that works great. Now, is there anyway to make that a <h#> entry?

09-12-2010, 05:32 PM	#4
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Try the preprocess option, can't remember if that case is covered under rtf, but I'm pretty sure it is.

09-12-2010, 05:53 PM	#5
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	I'm preprocessing anyway. Not to worry, I think I've got it covered, I'm converting the rtf to epub with Calibre, then using Sigil in code-view and using several regex's to do the thing's like Header's, broken lines etc.. Now with each of the chapters split properly, the sigil part takes a few minutes, much quicker than what I was doing before. They then just need a quick proofread, to fix any other mistakes. I'm happy.

09-13-2010, 10:58 AM	#6
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	I've just done a test, to try and make chapters with only digits come out as heading tags, but have been unsuccessful, is there anyway to do that, 'Chapter ##' come out as headings fine, and so do 'Prologue' and 'Epilogue'. Is there anyway to customise the rft input plugin as you would with some of the other formats. (I think that may be where the heading tags become applied) I'm able to split at the correct places with the regex as give in an earlier post, but would like them to be headings as well.

09-13-2010, 12:35 PM	#7
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I just looked at the preprocessing code - the single digit chapter headings weren't included in the last checkin - the logic you want is basically done now, but need to check in the changes. If Kovid accepts them then it should be in the next build. I noticed you said that even with preprocessing enabled you still needed to manually remove hard line breaks - if this is the case please open up a bug with the file, I'm trying to catch as many cases as possible without creating false positives

09-13-2010, 12:43 PM	#8
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Thanks Idolse, that's nice to know, hope it makes it into the next build. I'll try and reduce an rtf to a short few paragraphs, which still have the linebreaks not recognised.

09-13-2010, 01:27 PM	#9
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	Idolse, I transplanted several paragraphs which were incorrectly wrapped, into a new rtf, converted it to epub and they were all then wrapped correctly, but reconverting the whole original rtf, still had the mis-wrapping at those same places. Weird.

09-13-2010, 01:48 PM	#10
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Maybe not so weird - the unwrapping function looks at the median line length across all the lines. This works great if all the hard line breaks are in roughly the same spot, but if the lines are extremely variable (and sometimes/often long) this doesn't work out that well, since the median becomes longer than the typical broken line. I'm going to add an option to tweak that logic, but not sure if it's going to make it in the next release as it's the first time I've attempted any GUI work.

Advert

Advert