Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Sigil

Notices

Reply
 
Thread Tools Search this Thread
Old 01-06-2011, 01:35 AM   #1
Vanguard3000
Groupie
Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.
 
Posts: 152
Karma: 474196
Join Date: Jan 2011
Location: Ottawa
Device: Kobo Aura H2O
Fixing broken sentences.

Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using:

Find: ([a-z])</p>\s+<p class="calibre2">
Replace: \1_

(The _ being a space)

I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital?

For example, I'd like to find this:

...blahblah</p>

<p class="calibre2">blahblah...

But not this:

...blahblah</p>

<p class="calibre2">Blahblah...

Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as:

CHAPTER 6: The Plot Thickens
Ottawa

Any help would be much appreciated. Thanks in advance.
Vanguard3000 is offline   Reply With Quote
Old 01-06-2011, 01:44 AM   #2
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,678
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by Vanguard3000 View Post
Hi, all. I've found out through a few other threads how to fix broken sentences left by conversions from PDF to ePub formats. Currently, I'm using:

Find: ([a-z])</p>\s+<p class="calibre2">
Replace: \1_

(The _ being a space)

I was wondering if there was a way to add something to skip over breaks where the first letter of the second line is a capital?

For example, I'd like to find this:

...blahblah</p>

<p class="calibre2">blahblah...

But not this:

...blahblah</p>

<p class="calibre2">Blahblah...

Basically, this would help me a lot while trying to fix things like scripts or screenplays, or books with multi-line chapter titles, such as:

CHAPTER 6: The Plot Thickens
Ottawa

Any help would be much appreciated. Thanks in advance.
No problem to do. Do this:
Find: ([a-z])</p>\s+<p class="calibre2">([a-z])
Replace: \1_\2
(Replacing underscore with a space). With matching case turned on of course

Note you may also want to join sentences ending in commas, colons, etc etc. That is why some of the other expressions in threads here are more complex than just looking for paragraphs ending with [a-z].

Last edited by kiwidude; 01-06-2011 at 01:48 AM.
kiwidude is offline   Reply With Quote
Advert
Old 01-06-2011, 02:25 AM   #3
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
i have been using this a lot with good resuts. with some sources you have to inspect the code ( via sigil) as sometimes calibre2 has to be changed to calibre[some other number] and sometimes there is also a span - class thingie to search for
e.g.
Code:
 </p>\s+<p class="calibre2"><span class="none">([a-z])
cybmole is offline   Reply With Quote
Old 01-06-2011, 04:05 AM   #4
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,534
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
I don't know if the regexp in Sigil allows something like "[^A-Z]" to match anything but an uppercase letter (which would match lowercase letters, as well as quote marks, parentheses, dashes...).
Jellby is offline   Reply With Quote
Old 01-06-2011, 10:26 AM   #5
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,405
Karma: 58055234
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
My un-wrap line Regex

([\w",])</p>\s+<p class="calibre2">([\w"“…])

\1 \2

Letters Commas, (curly) Quotes

Not Perfect
Code:
ask
 Samuel if
This should not catch a chapter heading, but it might get (I am not a writer , ) stuff that is in between the heading and first paragraph.
theducks is offline   Reply With Quote
Advert
Old 01-07-2011, 05:15 AM   #6
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
is it safe to wild card the calibre2 bit ?

e.g. would this work ?

([\w",])</p>\s+<p class="calibre\d+">([\w"“…])

or will that cause it to mess with titles & chapter headers ?

I see that different books have different class names. some do not even have calibre+digit(s) they have a different naming structure e.g. I have seen class="MsoPLainText", so maybe find
[\w",])</p>\s+<p class="[A-Za-z2-9]*">([\w"“…])

that will exclude calibre1 ?

on a related issue, I have a book with far too much space between chapter header & start of text.

the code uses 3 consecutive instances of
<p class="MsoPlainText">&nbsp;</p>

how do I test for 3 consecutive instances of that line, and replace with only 1, or maybe 2 instances ?

Last edited by cybmole; 01-07-2011 at 05:32 AM.
cybmole is offline   Reply With Quote
Old 01-07-2011, 06:31 AM   #7
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,678
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
You ask "will this work". Like any regular expression related to paragraph matching there is always the possibility of edge cases it catches that you don't want it to. The "wider" you make your regex, the more likely that is to happen. If you intend to step through each find/replace one at a time so you can undo any that you don't want then you can experiment with it. However I find each and every document is different depending on how many times it has been converted in the past, manual editing, what it's original format was, what settings/program was used to convert it along the way etc. So long as you don't expect to stumble onto the holy grail of regexes that fixes all the problems for every document... it doesn't exist

In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g.
<p class="MsoPlainText">&nbsp;</p>\s+<p class="MsoPlainText">&nbsp;</p>
\s+<p class="MsoPlainText">&nbsp;</p>
kiwidude is offline   Reply With Quote
Old 01-07-2011, 07:04 AM   #8
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by kiwidude View Post
In answer to your second question, yes you can do it. Just paste the text you want to find three times separated by \s+. e.g.
<p class="MsoPlainText">&nbsp;</p>\s+<p class="MsoPlainText">&nbsp;</p>
\s+<p class="MsoPlainText">&nbsp;</p>
OK - i though there may be a way to bracket the expression to indicate the equiv of x3 ?

then I have to put the expression once only into replace - no shorthand for that ?

PS I ask only beacuse I am trying to learn shorthand expressions, not becasue it will save a lot of time
the code you have given me has worked perfectly - thanks.
cybmole is offline   Reply With Quote
Old 01-07-2011, 07:32 AM   #9
Jellby
frumious Bandersnatch
Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.Jellby ought to be getting tired of karma fortunes by now.
 
Jellby's Avatar
 
Posts: 7,534
Karma: 19000001
Join Date: Jan 2008
Location: Spaniard in Sweden
Device: Cybook Orizon, Kobo Aura
Quote:
Originally Posted by cybmole View Post
OK - i though there may be a way to bracket the expression to indicate the equiv of x3 ?
You probably can use: (***){3}
where *** stands for whatever expression you want matched 3 times, but you have to take newlines into account. Have a look here
Jellby is offline   Reply With Quote
Old 01-07-2011, 07:48 AM   #10
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff.
so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A.

I already fixed up the text using your long hand version though so I cannot easily test that now.
cybmole is offline   Reply With Quote
Old 01-07-2011, 04:13 PM   #11
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,678
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by cybmole View Post
thinking it through - the annoying section structure is ABABA where A is the expression, B is the line feed stuff.
so if I find (AB){2} and replace with nothing, I should end up with one instance only of expression A.

I already fixed up the text using your long hand version though so I cannot easily test that now.
To be honest I just was lazy and gave the expression that I use - less regex syntax to remember as you do it . Just copy the three lines, paste in to the find dialog and replace the two gaps with \s+ and you are done.

However yes I believe you could also do something like
(<p class="MsoPlainText">&nbsp;</p>\s+){3}

In this case you don't have to worry about ABABAB because Sigil reformats the document anyway so it does not matter if the last B (the spaces rendered as a newline in code view) get replaced.
kiwidude is offline   Reply With Quote
Old 01-09-2011, 02:14 AM   #12
Vanguard3000
Groupie
Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.Vanguard3000 ought to be getting tired of karma fortunes by now.
 
Posts: 152
Karma: 474196
Join Date: Jan 2011
Location: Ottawa
Device: Kobo Aura H2O
Awesome. Using a variety of these seem to be working well for me. Thanks a million, guys.
Vanguard3000 is offline   Reply With Quote
Old 01-13-2011, 05:23 PM   #13
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by theducks View Post
My un-wrap line Regex

([\w",])</p>\s+<p class="calibre2">([\w"“…])

\1 \2

Letters Commas, (curly) Quotes

Not Perfect
Code:
ask
 Samuel if
This should not catch a chapter heading, but it might get (I am not a writer , ) stuff that is in between the heading and first paragraph.
i have fixed up several more books & finally realised that all I should be testing is whether a "line" ends as a well formed sentence i.e. with a full stop, a quote, or an exclamation mark.
anything that does not should not be followed by a </p>
previously I'd been looking for lines that began mid sentence i.e. that began with a lower case letter but really there is no need to test 1st character of next line, just test the previous "line" end - to determine if it is a true "end"

so I am now getting good results with this
find
([Ia-z,])</p>\s*<p>
replace with\1 plus a single space

which bypasses the calibre tags issue.

. I could expand the range to test for for digits / capitalized words but have not yet needed to.
cybmole is offline   Reply With Quote
Old 01-13-2011, 06:24 PM   #14
kiwidude
Calibre Plugins Developer
kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.kiwidude ought to be getting tired of karma fortunes by now.
 
Posts: 4,678
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
Quote:
Originally Posted by cybmole View Post
so I am now getting good results with this
find
([Ia-z,])</p>\s*<p>
replace with\1 plus a single space

which bypasses the calibre tags issue.

. I could expand the range to test for for digits / capitalized words but have not yet needed to.
I think your post of the regex got rather mangled? Searching for <p> at the end of your regex will get you nothing on any document converted with Calibre as there is no handling of the class on the <p> tag. And there is no replace expression displayed.

The theory of what you say is indeed what the OP on this thread was doing with their first post. However as has been mentioned before there are other "line endings" you would need to test for such as punctuation characters (colons, semi-colons, hyphens), numeric amounts etc. Your regex also wouldn't include uppercase words, foreign language characters and so on.

Also unless you step through each one then if your book includes poems laid out they will get trashed.

Expressions earlier in this thread and in others similar can improve readability of most of the paragraphs. However imho I think people do need to be reminded that the expressions in this thread will not catch "every" situation nor should they just blindly do "Replace All" because they saw a regex in a thread that someone said worked for them.

Last edited by kiwidude; 01-13-2011 at 06:30 PM.
kiwidude is offline   Reply With Quote
Old 01-14-2011, 02:44 AM   #15
cybmole
Wizard
cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.cybmole ought to be getting tired of karma fortunes by now.
 
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
Quote:
Originally Posted by kiwidude View Post
I think your post of the regex got rather mangled? Searching for <p> at the end of your regex will get you nothing on any document converted with Calibre as there is no handling of the class on the <p> tag. And there is no replace expression displayed.

The theory of what you say is indeed what the OP on this thread was doing with their first post. ....
yes - whoops - I was working with a non calibre-processed source.
otherwise the expression should be something like
([Ia-z,])</p>\s*<p class="calibre2">
replace with
\1
trailing space after \1
points taken about poems, & about blindly applying - I usually do I few find - replace cycles before hitting replace all, & if I do screw up I close sigil, discarding all changes & start over
cybmole is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
fixing broken button (guide) ashadocat Sony Reader Dev Corner 0 10-01-2009 01:52 AM
Unutterably Silly Memorable FIRST SENTENCES - Only Yours, please Dr. Drib Lounge 431 02-13-2009 04:57 AM
Unutterably Silly Final sentences pshrynk Lounge 97 02-08-2009 11:45 AM
Sentences We Love Dr. Drib Sony Reader 110 07-13-2007 10:44 PM


All times are GMT -4. The time now is 11:33 AM.


MobileRead.com is a privately owned, operated and funded community.