Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Conversion

Notices

Reply
 
Thread Tools Search this Thread
Old 03-02-2018, 11:52 AM   #1
G2B
Enthusiast
G2B began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Feb 2018
Device: PC / iPad
Question Regular expressions

I am experimenting with the search & replace in calibre.


Conversion of txt files often leaves too many line breaks, even with heuristic processing.

When the </p><p class="whatever"> line breaks have small letters on both sides, they can always be removed and substituted with a space.

'[a-z]</p><p class="whatever">[a-z]' finds all those repeats but when I replace with '[a-z] [a-z]' the conversion also removes and replaces every letter before and after the code with the string '[a-z]' which is not what I intended.

I have to include '[a-z] in the search string to find the correct line breaks. If I don't, I'll delete every line break in the book, which imo makes it pretty much unreadable.

Is there a way to include those letters before and after line break in the search string, but to exclude them from substitution?

TIA.

Last edited by G2B; 03-02-2018 at 11:55 AM.
G2B is offline   Reply With Quote
Old 03-02-2018, 12:02 PM   #2
sjfan
Addict
sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.sjfan ought to be getting tired of karma fortunes by now.
 
Posts: 281
Karma: 7724454
Join Date: Sep 2017
Location: Bethesda, MD, USA
Device: Kobo Aura H20, Kobo Clara HD
Quote:
Originally Posted by G2B View Post
I am experimenting with the search & replace in calibre.


Conversion of txt files often leaves too many line breaks, even with heuristic processing.

When the </p><p class="whatever"> line breaks have small letters on both sides, they can always be removed and substituted with a space.

'[a-z]</p><p class="whatever">[a-z]' finds all those repeats but when I replace with '[a-z] [a-z]' the conversion also removes and replaces every letter before and after the code with the string '[a-z]' which is not what I intended.

I have to include '[a-z] in the search string to find the correct line breaks. If I don't, I'll delete every line break in the book, which imo makes it pretty much unreadable.

Is there a way to include those letters before and after line break in the search string, but to exclude them from substitution?

TIA.
Put parens around parts of the search string to form groups. Use \# in the replace string to replace with the content of that group.

Search for:
Code:
([a-z])</p><p class="whatever">([a-z])
Replace with:
Code:
\1 \2
sjfan is offline   Reply With Quote
Advert
Old 03-02-2018, 12:24 PM   #3
deback
Book E d i t o r
deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.
 
Posts: 432
Karma: 288184
Join Date: May 2015
Device: Laptop
Search for this:

([a-z])</p>\s+<p class="whatever">([a-z])

The \s+ in the middle is for the space between the two lines and is necessary to find what you're looking for.

Replace with this:

\1 \2

You might also want to do two searches and drop the ([a-z]) at the end, though, because you'll find there are a lot of lines that need to be connected where the second line doesn't start with a small letter.

Last edited by deback; 03-09-2018 at 07:44 PM.
deback is offline   Reply With Quote
Old 03-02-2018, 08:23 PM   #4
gbm
Wizard
gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.gbm ought to be getting tired of karma fortunes by now.
 
Posts: 2,097
Karma: 8796704
Join Date: Jun 2010
Device: Kobo Clara HD,Hisence Sero 7 Pro RIP, Nook STR, jetbook lite
I recommend useing the editor for finding split paragraphs and line feeds, this is the base search and replace I use:

Search
Code:
</p>\s*<p[^>]+>([a-z])
replace:
Code:
 \1
note the space at the beginning of the replace.

For extra line feeds I use a regex-fuction.

search:
Code:
<p class="(.*?)">(.*?)</p>|<div class="(.*?)">(.*?)</div>
regex-fuction
Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    return match.group().replace('\n', ' ')

bernie
gbm is offline   Reply With Quote
Old 03-08-2018, 02:53 PM   #5
G2B
Enthusiast
G2B began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Feb 2018
Device: PC / iPad
Question

Quote:
Originally Posted by gbm View Post
I recommend using the editor for finding split paragraphs and line feeds, this is the base search and replace I use:

Search
Code:
</p>\s*<p[^>]+>([a-z])
replace:
Code:
 \1
note the space at the beginning of the replace.


bernie
Thanks for the commenting.
I tried that, and despite that your string only shows [a-z] small letters, the editor also takes me to sentence line breaks that start with [A-Z] capitalized letters.

There seems to be something wring with my calibre. Heurisitic processing splits pages instead of removing line breaks. I have uninstalled and reinstalled, but same problems persist.

Last edited by G2B; 03-08-2018 at 03:22 PM.
G2B is offline   Reply With Quote
Advert
Old 03-08-2018, 04:07 PM   #6
theducks
Well trained by Cats
theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.theducks ought to be getting tired of karma fortunes by now.
 
theducks's Avatar
 
Posts: 30,062
Karma: 57259778
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
because you did not include something that limits which beginning term traps
I check for an alpha, optionally followed by: comma, either quote.

This still fails in a few cases like Mr., where the alpha
is upper case like A.M.
I fix those few by hand

<sample clipped from Sigils saved search FILE which includes additional escapes>
Code:
70\Name=Cleanup/Joins/Join to lower
70\Find="([[:alpha:],][\"\x201d]*)</p>\\s*<p\\b[^>]*>([a-z\x201c\"])"
70\Replace=\\1 \\2
71\Name=Cleanup/Joins/Join to upper
71\Find="([[:alpha:],]\x201d*)</p>\\s*<p\\b[^>]*>([\"\x201c]*[A-Z])"
71\Replace=\\1 \\2
theducks is offline   Reply With Quote
Old 03-08-2018, 04:13 PM   #7
deback
Book E d i t o r
deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.deback ought to be getting tired of karma fortunes by now.
 
Posts: 432
Karma: 288184
Join Date: May 2015
Device: Laptop
Quote:
Originally Posted by G2B View Post
I tried that, and despite that your string only shows [a-z] small letters, the editor also takes me to sentence line breaks that start with [A-Z] capitalized letters.
Check the Case Sensitive box, and then it will work correctly.
deback is offline   Reply With Quote
Old 03-09-2018, 02:39 PM   #8
G2B
Enthusiast
G2B began at the beginning.
 
Posts: 28
Karma: 10
Join Date: Feb 2018
Device: PC / iPad
Quote:
Originally Posted by deback View Post
Check the Case Sensitive box, and then it will work correctly.
Oops. Thanks.
G2B is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expressions Help lauralein Library Management 1 11-12-2013 07:05 PM
Regular Expressions help deamonfruba Library Management 2 06-02-2012 02:09 AM
Help with regular expressions MostlyCarbon Library Management 0 02-04-2012 03:00 PM
Help with regular expressions jevonbrady Library Management 6 06-21-2011 10:16 AM
Help with Regular Expressions ghostyjack Workshop 2 01-08-2010 11:04 AM


All times are GMT -4. The time now is 04:44 AM.


MobileRead.com is a privately owned, operated and funded community.