Fixing broken sentences.

Vanguard3000 · 01-06-2011, 01:35 AM

Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using:

Find: ([a-z])\s+
Replace: \1_

(The _ being a space)

I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital?

For example, I'd like to find this:

...blahblah

blahblah...

But not this:

...blahblah

Blahblah...

Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as:

CHAPTER 6: The Plot Thickens
Ottawa

Any help would be much appreciated. Thanks in advance.

kiwidude · 01-06-2011, 01:44 AM

Quote:

Originally Posted by Vanguard3000

Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using:

Find: ([a-z])\s+
Replace: \1_

(The _ being a space)

I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital?

For example, I'd like to find this:

...blahblah

blahblah...

But not this:

...blahblah

Blahblah...

Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as:

CHAPTER 6: The Plot Thickens
Ottawa

Any help would be much appreciated. Thanks in advance.

No problem to do. Do this:
Find: ([a-z])\s+([a-z])
Replace: \1_\2
(Replacing underscore with a space). With matching case turned on of course

Note you may also want to join sentences ending in commas, colons, etc etc. That is why some of the other expressions in threads here are more complex than just looking for paragraphs ending with [a-z].

cybmole · 01-06-2011, 02:25 AM

i have been using this a lot with good resuts. with some sources you have to inspect the code ( via sigil) as sometimes calibre2 has to be changed to calibre[some other number] and sometimes there is also a span - class thingie to search for
e.g.

Code:

 </p>\s+<p class="calibre2"><span class="none">([a-z])

Jellby · 01-06-2011, 04:05 AM

I don't know if the regexp in Sigil allows something like "[^A-Z]" to match anything but an uppercase letter (which would match lowercase letters, as well as quote marks, parentheses, dashes...).

theducks · 01-06-2011, 10:26 AM

My un-wrap line Regex

([\w",])\s+([\w"“…])

\1 \2

Letters Commas, (curly) Quotes

Not Perfect

Code:

ask
 Samuel if

This should not catch a chapter heading, but it might get (I am not a writer

, ) stuff that is in between the heading and first paragraph.

cybmole · 01-07-2011, 05:15 AM

is it safe to wild card the calibre2 bit ?

e.g. would this work ?

([\w",])\s+([\w"“…])

or will that cause it to mess with titles & chapter headers ?

I see that different books have different class names. some do not even have calibre+digit(s) they have a different naming structure e.g. I have seen class="MsoPLainText", so maybe find
[\w",])\s+([\w"“…])

that will exclude calibre1 ?

on a related issue, I have a book with far too much space between chapter header & start of text.

the code uses 3 consecutive instances of
 

how do I test for 3 consecutive instances of that line, and replace with only 1, or maybe 2 instances ?

kiwidude · 01-07-2011, 06:31 AM

You ask "will this work". Like any regular expression related to paragraph matching there is always the possibility of edge cases it catches that you don't want it to. The "wider" you make your regex, the more likely that is to happen. If you intend to step through each find/replace one at a time so you can undo any that you don't want then you can experiment with it. However I find each and every document is different depending on how many times it has been converted in the past, manual editing, what it's original format was, what settings/program was used to convert it along the way etc. So long as you don't expect to stumble onto the holy grail of regexes that fixes all the problems for every document... it doesn't exist

In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g.
 \s+ 
\s+

cybmole · 01-07-2011, 07:04 AM

Quote:

Originally Posted by kiwidude

In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g.
 \s+ 
\s+

OK - i though there may be a way to bracket the expression to indicate the equiv of x3 ?

then I have to put the expression once only into replace - no shorthand for that ?

PS I ask only beacuse I am trying to learn shorthand expressions, not becasue it will save a lot of time
the code you have given me has worked perfectly - thanks.

Jellby · 01-07-2011, 07:32 AM

Quote:

Originally Posted by cybmole

OK - i though there may be a way to bracket the expression to indicate the equiv of x3 ?

You probably can use: (***){3}
where *** stands for whatever expression you want matched 3 times, but you have to take newlines into account. Have a look here

cybmole · 01-07-2011, 07:48 AM

thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff.
so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A.

I already fixed up the text using your long hand version though so I cannot easily test that now.

kiwidude · 01-07-2011, 04:13 PM

Quote:

Originally Posted by cybmole

thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff.
so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A.

I already fixed up the text using your long hand version though so I cannot easily test that now.

To be honest I just was lazy and gave the expression that I use - less regex syntax to remember as you do it

. Just copy the three lines, paste in to the find dialog and replace the two gaps with \s+ and you are done.

However yes I believe you could also do something like
( \s+){3}

In this case you don't have to worry about ABABAB because Sigil reformats the document anyway so it does not matter if the last B (the spaces rendered as a newline in code view) get replaced.

Vanguard3000 · 01-09-2011, 02:14 AM

Awesome. Using a variety of these seem to be working well for me. Thanks a million, guys.

cybmole · 01-13-2011, 05:23 PM

Quote:

Originally Posted by theducks

My un-wrap line Regex

([\w",])\s+([\w"“…])

\1 \2

Letters Commas, (curly) Quotes

Not Perfect

Code:

ask
 Samuel if

This should not catch a chapter heading, but it might get (I am not a writer

, ) stuff that is in between the heading and first paragraph.

i have fixed up several more books & finally realised that all I should be testing is whether a "line" ends as a well formed sentence i.e. with a full stop, a quote, or an exclamation mark.
anything that does not should not be followed by a 
previously I'd been looking for lines that began mid sentence i.e. that began with a lower case letter but really there is no need to test 1st character of next line, just test the previous "line" end - to determine if it is a true "end"

so I am now getting good results with this
find
([Ia-z,])\s*
replace with\1 plus a single space

which bypasses the calibre tags issue.

. I could expand the range to test for for digits / capitalized words but have not yet needed to.

kiwidude · 01-13-2011, 06:24 PM

Quote:

Originally Posted by cybmole

so I am now getting good results with this
find
([Ia-z,])\s*
replace with\1 plus a single space

which bypasses the calibre tags issue.

. I could expand the range to test for for digits / capitalized words but have not yet needed to.

I think your post of the regex got rather mangled? Searching for at the end of your regex will get you nothing on any document converted with Calibre as there is no handling of the class on the tag. And there is no replace expression displayed.

The theory of what you say is indeed what the OP on this thread was doing with their first post. However as has been mentioned before there are other "line endings" you would need to test for such as punctuation characters (colons, semi-colons, hyphens), numeric amounts etc. Your regex also wouldn't include uppercase words, foreign language characters and so on.

Also unless you step through each one then if your book includes poems laid out they will get trashed.

Expressions earlier in this thread and in others similar can improve readability of most of the paragraphs. However imho I think people do need to be reminded that the expressions in this thread will not catch "every" situation nor should they just blindly do "Replace All" because they saw a regex in a thread that someone said worked for them.

cybmole · 01-14-2011, 02:44 AM

Quote:

Originally Posted by kiwidude

I think your post of the regex got rather mangled? Searching for at the end of your regex will get you nothing on any document converted with Calibre as there is no handling of the class on the tag. And there is no replace expression displayed.

The theory of what you say is indeed what the OP on this thread was doing with their first post. ....

yes - whoops - I was working with a non calibre-processed source.
otherwise the expression should be something like
([Ia-z,])\s*
replace with
\1
trailing space after \1
points taken about poems, & about blindly applying - I usually do I few find - replace cycles before hitting replace all, & if I do screw up I close sigil, discarding all changes & start over

01-06-2011, 01:35 AM	#1
Vanguard3000 Groupie Posts: 168 Karma: 474196 Join Date: Jan 2011 Location: Canada Device: Kobo Libra 2	Fixing broken sentences. Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using: Find: ([a-z])</p>\s+<p class="calibre2"> Replace: \1_ (The _ being a space) I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital? For example, I'd like to find this: ...blahblah</p> <p class="calibre2">blahblah... But not this: ...blahblah</p> <p class="calibre2">Blahblah... Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as: CHAPTER 6: The Plot Thickens Ottawa Any help would be much appreciated. Thanks in advance.

01-06-2011, 02:25 AM	#3
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	i have been using this a lot with good resuts. with some sources you have to inspect the code ( via sigil) as sometimes calibre2 has to be changed to calibre[some other number] and sometimes there is also a span - class thingie to search for e.g. Code: </p>\s+<p class="calibre2"><span class="none">([a-z])

01-06-2011, 10:26 AM	#5
theducks Well trained by Cats Posts: 31,037 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	My un-wrap line Regex ([\w",])</p>\s+<p class="calibre2">([\w"“…]) \1 \2 Letters Commas, (curly) Quotes Not Perfect Code: ask Samuel if This should not catch a chapter heading, but it might get (I am not a writer , ) stuff that is in between the heading and first paragraph.

01-07-2011, 05:15 AM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	is it safe to wild card the calibre2 bit ? e.g. would this work ? ([\w",])</p>\s+<p class="calibre\d+">([\w"“…]) or will that cause it to mess with titles & chapter headers ? I see that different books have different class names. some do not even have calibre+digit(s) they have a different naming structure e.g. I have seen class="MsoPLainText", so maybe find [\w",])</p>\s+<p class="[A-Za-z2-9]">([\w"“…]) that will exclude calibre1 ? on a related issue, I have a book with far too much space between chapter header & start of text. the code uses 3 consecutive instances of <p class="MsoPlainText"> </p> how do I test for 3 consecutive instances of that line, and replace with only 1, or maybe 2 instances ? Last edited by cybmole; 01-07-2011 at 05:32 AM.*

01-07-2011, 06:31 AM	#7
kiwidude Calibre Plugins Developer Posts: 4,729 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	You ask "will this work". Like any regular expression related to paragraph matching there is always the possibility of edge cases it catches that you don't want it to. The "wider" you make your regex, the more likely that is to happen. If you intend to step through each find/replace one at a time so you can undo any that you don't want then you can experiment with it. However I find each and every document is different depending on how many times it has been converted in the past, manual editing, what it's original format was, what settings/program was used to convert it along the way etc. So long as you don't expect to stumble onto the holy grail of regexes that fixes all the problems for every document... it doesn't exist In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g. <p class="MsoPlainText"> </p>\s+<p class="MsoPlainText"> </p> \s+<p class="MsoPlainText"> </p>

01-06-2011, 04:05 AM	#4
Jellby frumious Bandersnatch Posts: 7,546 Karma: 19001583 Join Date: Jan 2008 Location: Spaniard in Sweden Device: Cybook Orizon, Kobo Aura	I don't know if the regexp in Sigil allows something like "[^A-Z]" to match anything but an uppercase letter (which would match lowercase letters, as well as quote marks, parentheses, dashes...).

01-07-2011, 07:48 AM	#10
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff. so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A. I already fixed up the text using your long hand version though so I cannot easily test that now.

01-09-2011, 02:14 AM	#12
Vanguard3000 Groupie Posts: 168 Karma: 474196 Join Date: Jan 2011 Location: Canada Device: Kobo Libra 2	Awesome. Using a variety of these seem to be working well for me. Thanks a million, guys.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
fixing broken button (guide)	ashadocat	Sony Reader Dev Corner	0	10-01-2009 01:52 AM
Unutterably Silly Memorable FIRST SENTENCES - Only Yours, please	Dr. Drib	Lounge	431	02-13-2009 04:57 AM
Unutterably Silly Final sentences	pshrynk	Lounge	97	02-08-2009 11:45 AM
Sentences We Love	Dr. Drib	Sony Reader	110	07-13-2007 10:44 PM

Advert

Advert