Saved Search/Regex Functions

eschwartz · 04-06-2014, 09:20 PM

If anyone has useful Saved Searches they would like to share, you can share them in this thread.

Generic rules to fix common problems, for example.
Or just anything clever and cool which you are proud of and want to admire.

NOTE: To make it easier to read, it would be nice if all Search & Replace fields were wrapped in the

[CODE]content goes here[/CODE]

tags.

Also, you can export the saved search as a .json, and upload it here in a zipped folder.

Moderator Notice
This thread has been made a sticky, and unlike most other sticky threads, this one is open to all who have a useful saved Search/Replace they wish to share. Do not use this thread to ask any questions. Start a new thread. Posts that don't belong here will be deleted or moved, but you are encouraged to post if you have something to share.

Please add a descriptive title to each post and explain what your Saved Search accomplishes.

mrmikel · 04-07-2014, 07:14 AM

([a-z])whatever else is in the middle here([a-z])

Replace with \1space\2

With case sensitive ticked.

Doesn't get absolutely everything, but can be used very quickly.

Potshots welcome from people who actually know regex welcome. I just guess and see if it works!

mrmikel · 04-07-2014, 07:18 AM

\.” [a-z]

Case sensitive ticked.

No easy replace in this case, but at least you can find it. Remove quote to just find sentences uncapitalized without the quote.

arspr · 04-07-2014, 04:17 PM

As dashes are wrap points in HTML, dialogues in Spanish ebooks can look terrible.

Example in one line:

Code:

—Bla, Bla, Bla, —John said—. More bla, bla, bla.

Wrong:

Code:

—Bla, Bla, Bla, —John said
—. More bla, bla, bla.

Wrong:

Code:

—Bla, Bla, Bla, —
John said—. More 
bla, bla, bla.

Right:

Code:

—Bla, Bla, Bla, —John 
said—. More bla, bla, bla.

The next two following searches add a around the partner word with a specified class. (In my example just ).

Then add the next CSS definition for this class:

Code:

.nw { white-space: nowrap; display: inline-block; text-indent: 0em;}

and you will have prevented the wrong wrapping in Spanish books.

Edit notes. Explanation of the workaround for RMSDK:

Spoiler:

First S&R
Search:

Code:

\x20(—|–|&mdash;|&ndash;)([^ <]+)( |</p>|</div>)

Replace:

Code:

\x20<span class="nw">\1\2</span>\3

Second S&R
Search:

Code:

\x20([^ >]+)(—|–|&mdash;|&ndash;)(\.|\.\.\.|,|;|:|…|&hellip;)?\x20

Replace:

Code:

\x20<span class="nw">\1\2\3</span>\x20

Additional usage notes

Spoiler:

Zajora · 05-21-2014, 10:46 AM

I have created a fair number of regex fixes. I make changes to them every so often, so I'll probably edit this post if I do. I'd use code tags, but they take up too much room. If a regex should be replaced with a space, it will say "space". If there is nothing, then (logically) it should be replaced with nothing.

Of course, there is no guarantee any of these will work properly. I always check them a number of times before doing replace all, since there are a TON of ways eBooks can have formatting that wrecks these regexes.

Scenario: Apostrophes have been replaced with double quotes
Match: (?<=\w)(“|”)(?=\w)
Replace: ’

Scenario: There is a linebreak in the middle of a character's dialogue
Match: (?<=“[^”]*)\s*<p[^>]*>(?!“)
Replace: space

Scenario: A tag closes, is followed by 0+ spaces or newlines, is then reopened and is then followed by a lowercase letter
Match: </(?P<tag>\w+)>\s*<(?P=tag) [^/>]+>(?=[a-z])
Match: (?<![".!?>*”“…~’])</(?P<tag>\w+)>\s*<(?P=tag) [^/>]+>
Replace: space
Notes: The second one is an alternate, which I think is better, but I'm not 100% sure it covers all the cases of the former.

Scenario: "LL" Ligatures have been replaced with a single "L".
Match: (l (?=(y|s|ed|ey|ion|en|ar|ars|er|ow|et|owed|enge|age |enging|ected|egal|ections|ect|apse|ular|op|owing| ocks|ied|ier|ies|ing|ingly|ered|icit|est)(\W)))|(l (?![(–<-])(?=\W))|(?<=’)l(?=\W)|(?<= (wi|du|a|we|te|sma|ca|sti|fu|fa|chi|sha|wa|pha|se| bi|ha|ki|pu|ce|ba|ski|hi|fi|fe|he|ro|ta|i|sme|bri| sta|we))l(?=\W)
Replace: ll
Notes: This regex doesn't really work that well, but it's faster than doing it manually. I would recommend using the spellcheck afterwards and catching the most common ones. This regex is actually a bunch of individual ones chained together by ORs (|) so it's easier to see what's doing what.

Scenario: More than 1 space in a row
Match: (?<=\S) {2,}(?=\S)
Replace: space

Scenario: There are tags (which may be nested) that are either empty or just have a number in them
Match: (<[^/>]*>)+\s*\d*\s*(</[^>]*>)+
Replace:
Notes: This may remove things you'd like to keep, such as scenebreaks/whitespace, or the chapter links.

Scenario: There's a linebreak or spaces before a closing tag
Match: (?<![".!?>*”“…~’])</(?P<tag>\w+)>\s*<(?P=tag) [^/>]+>
Replace:

For future use:

Scenario:
Match:
Replace:

Section8 · 05-22-2014, 08:03 PM

I have a nook, and the only real regexes I've written are for fixing stylesheets to work around its margin bug: if "publisher defaults" are disabled, the nook doesn't handle the css "margin" setting. I've been using these to convert all 4 forms of "margin" to the equivalent margin-top, margin-right, etc. These were written for Sigil, but I *think* they work in the calibre editor.

First: find margin:
Find: margin *:

Convert margin: a (single value):
Find: margin *: *([^\s;]+)(\s*(;|}))
Replace: margin-top: \1; margin-right: \1; margin-bottom: \1; margin-left: \1\2

Convert margin: a, b (2 values)
Find: margin *: *([^\s;]+) +([^\s;]+)([\s]*(;|}))
Replace: margin-top: \1; margin-right: \2; margin-bottom: \1; margin-left: \2\3

Convert margin a, b, c (3 values)
Find: margin *: *([^\s;]+) +([^\s;]+) +([^\s;]+)([\s]*(;|}))
Replace: margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \2\4

Convert margin a, b, c, d (4 values)
Find: margin *: *([^\s;]+) +([^\s;]+) +([^\s;]+) +([^\s;]+)(\s*(;|}))
Replace: margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \4\5

user743 · 06-19-2014, 02:17 PM

switch script links to html links.

Code:

<script>	AddIndex\("(.+?)", (".+?"), ".+?"\); </script>
<a href=\2>\1<a>

regex. not case sensitive. dot all.
change double quotes to single quotes if necessary.

dmonasse · 12-23-2014, 02:46 AM

The search and replace tool with regex function is really fantastic. My little society is building mathematical ebooks from latex sources. One of my problems for converting such books is that latex auto-numbers chapters, sections, subsections and theorem-like assertions (theorems, propositions, lemmas, definitions, corollaries and so on). I would like to do such a numbering in my ebook.

A solution is the following:

1) Converting from latex, I put chapters, sections, subsections and assertions in a <div> tag with a html5 data-type attribute. For example, a latex section

Code:

\section{History of the Fermat-Wiles theorem}

is converted into

Code:

<div class="section" data-type="section">History of the Fermat-Wiles theorem</div>

and

Code:

\begin{theorem}Abracadabra\end{theorem}

is converted into

Code:

<div class="theorem" data-type="theorem">Abracadabra</div>

Nota: I can't use the class attribute to denote the type of the div because the conversion process from HTML to ePub by Calibre modifies these attributes and class="theorem" may be changed into class="pcalibre25". That's the reason for the data-type attribute.

2) After conversion from latex to html (not so easy!!!) and from html to epub (easy with Calibre), I number the whole book with the Calibre editor using the search and replace tool with regex function.
The search pattern I use is:

Code:

<div.*?data-type="(chapter|section|subsection|theorem|proposition|lemma|definition|corollary)"[^>]*>

and the regex function may be:

Code:

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):
    if number==1: #initialization of the counts
        data['chapter']=0
        data['section']=0
        data['subsection']=0
        data['assertion']=0
    the_type=match.group(1)
    if the_type=='chapter': # begins a chapter, reinitialize the counts
        data['section']=0
        data['subsection']=0
        data['assertion']=0
        data['chapter']+=1
        return match.group()+"<span class='chapter_num'>Chapter "+str(data['chapter'])+".</span> "
    elif the_type=='section': # begins a section, reinitialize the subsection count
        data['subsection']=0
        data['section']+=1
        return match.group()+"<span class='section_num'>Section "+str(data['section'])+".</span>" 
    elif the_type=='subsection':
        data['subsection']+=1
        return match.group()+"<span class='subsection_num'>Subsection "+str(data['section'])+"."+str(data['subsection'])+".</span>"
    else: # this is an assertion
        data['assertion']+=1
        return match.group()+"<span class='assertion_num'>Assertion "+str(data['chapter'])+"."+str(data['assertion'])+".</span>"
    return ''

replace.file_order = 'spine'

Adapt the code according to your needs or wishes, this is only an example; it would be nicer to replace "Assertion" by "Theorem", "Proposition", "Lemma", "Corollary", "Definition" (very easy to do starting from the "the_type" variable). I obtain such a numbering:

Code:

Chapter 1
     Section 1
         Subsection 1.1
             Assertion 1.1
             Assertion 1.2
         Subsection 1.2
            Assertion 1.3
     Section 2
         Subsection 2.1
             Assertion 1.4
             Assertion 1.5
         Subsection 2.2
            Assertion 1.6
Chapter 2
     Section 1
         Subsection 1.1
             Assertion 2.1
             Assertion 2.2
         Subsection 1.2
            Assertion 2.3
     Section 2
         Subsection 2.1
             Assertion 2.4
             Assertion 2.5

Hope this may help. Any improvement will be welcome (even in my bad English syntax).

senhal · 07-17-2015, 02:20 AM

I'm trying to write some regex in order to have something similar to pepito cleaner (openoffice plugin) with some other searches for typical OCR errors (some of them are intended for italian language).

So, import & test the attached regex and let me know; my goal is to correct and improve them

Any suggestion's welcome

Explanations:

Words inside [ ] in the saved search name suggest what the replace button will do, so [del] will delete something, [man] needs a manual intervention, [space] will replace something with a space and so on: just make a copy of your ebook and try...
The regex called "ADE verify" finds all the characters that Adobe Digital Editions doesn't show: if the search find something, you'll need embedding fonts for ADE full visualization.
Please forgive my english translations of the regex names: help me to improve them too

Arjayem · 03-03-2016, 12:22 PM

Errors produced by scanning text seem to follow a predictable pattern such a seU for sell or iUness for illness or bom for born etc but never the less aren't corrected by the automatic scanning software. So, I created a function for the calibre editor to fix those I most commonly found. You'll also found I've corrected some American spellings, depending upon your dictionary these won't actually be wrong.

The code is based on the Calibre example that tidies up hyphens.

You'll need to enter the following find : >.*?<

Here's the function, because PYTHON uses intelligent (or not so) indenting you may need to play some to get PYTHON to swallow the code. :

Code:

import regex
from calibre import replace_entities
from calibre import prepare_string_for_xml

def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs):

    def replace_word(wmatch):
        # Check if the current word exits in the dictionary
        CheckThisSpelling = wmatch.group(1)
        if dictionaries.recognized(CheckThisSpelling) == True:   
            return wmatch.group()
        else:
        #	else try to correct it - remove American spelling
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("or", "our") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)         
            NewSpelling = CheckThisSpelling + '~'
            NewSpelling = NewSpelling.replace("or~", "our") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
            NewSpelling = CheckThisSpelling + '~'
            NewSpelling = NewSpelling.replace("ors~", "our") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)    
        #	else try to correct it - remove American spelling
            NewSpelling = CheckThisSpelling + '~'
            NewSpelling = NewSpelling.replace("er~", "re") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("er", "re") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            else:
              NewSpelling = NewSpelling.replace("ree", "re") 
              if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                    
            NewSpelling = CheckThisSpelling + '~'
            NewSpelling = NewSpelling.replace("ers~", "res") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling + '~'
            NewSpelling = NewSpelling.replace("nse~", "nce") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
        #	else try to correct it - remove American spelling
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("l", "ll") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("l", "ll",1) 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("l", "~",2) 
            NewSpelling = NewSpelling.replace("~", "l",1)
            NewSpelling = NewSpelling.replace("~", "ll",1)                       
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                                 
        #	else try to correct it - remove American spelling
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("ll", "l") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("ll", "l",1) 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("ll", "~",2) 
            NewSpelling = NewSpelling.replace("~", "ll",1)
            NewSpelling = NewSpelling.replace("~", "l",1)                       
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)               
         #
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("U", "li") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("U", "ll") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)            
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("h", "li") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("H", "li") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("h", "li",1) 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("H", "li",1) 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)  
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("h", "~",2) 
            NewSpelling = NewSpelling.replace("~", "h",1)
            NewSpelling = NewSpelling.replace("~", "li",1)              
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("H", "~",2) 
            NewSpelling = NewSpelling.replace("~", "H",1)
            NewSpelling = NewSpelling.replace("~", "li",1)   
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                         
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("im", "un") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("l", "ll") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
         #
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("imi", "um") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)              
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("m", "rn") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("m", "in") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2) 
         #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("m", "hi") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)  
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("mn", "um") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)           
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("nm", "run") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("nmi", "rum") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                                                                                           
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("bn", "lm") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                                                                                            
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("ii", "h") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                                                                                            
          #	else try to correct it 
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("ii", "u") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                                                                                            
         #	
         #	else try to correct it 
            if CheckThisSpelling == 'Fd':
                return " I'd" +  wmatch.group(2)  
            if CheckThisSpelling == 'Fve':
                return " I've" +  wmatch.group(2)
            if CheckThisSpelling == 'Fm':
                return " I'm" +  wmatch.group(2)
            if CheckThisSpelling == 'Fll':
                return " I'll" +  wmatch.group(2) 
            if CheckThisSpelling == 'youVe':
                return " you've" +  wmatch.group(2)
            if CheckThisSpelling == 'YouVe':
                return " You've" +  wmatch.group(2)                   
         #	
         #	else try to correct it 
            if CheckThisSpelling == 'wren\'t':
                return " weren't" +  wmatch.group(2)              

         #	
         #	else try to correct it 
            if CheckThisSpelling == '&':
                return ' ' + chr(38) +  wmatch.group(2)  
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace(">", "y") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("j&", "fi") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)
            NewSpelling = CheckThisSpelling
            NewSpelling = NewSpelling.replace("i&", "fi") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)  
            NewSpelling = NewSpelling.replace("l&", "fi") 
            if dictionaries.recognized(NewSpelling) == True:   
                return NewSpelling +  wmatch.group(2)                                      
                                                                              
        return wmatch.group()
        #return wmatch.group() + '1' + wmatch.group(1) + '2' + wmatch.group(2) + '3' + NewSpelling
    # Search for words 
    text = replace_entities(match.group()[1:-1])  # Handle HTML entities like &amp;
    corrected = regex.sub(r'\s*([\w\>\&[[a-z]\'[a-z]]]*)([\s*\.\?\,\"\;])', replace_word, text, flags=regex.VERSION1 | regex.UNICODE)
    return '>%s<' % prepare_string_for_xml(corrected)  # Put back required entities

GOOD LUCK & HOPE ITS OF SOME USE

PeterT · 03-03-2016, 01:06 PM

You might like to wrap your code in [code] .... [/code] tags to preserve spacing and indentation.

Arjayem · 03-04-2016, 04:07 AM

The code sample came via notepad. I keep a copy in a txt file because I've wiped one version in Calibre using the remove button which is unforgiving and next to the edit button, a design feature that it would be nice to see addressed.

phossler · 03-04-2016, 08:40 AM

When posting code or similar, at the bottom of the window is the [Go Advanced] button to show more options.

One is the [#] icon which adds the CODE tags.

Just paste or type between them and it formats nicely

theducks · 03-04-2016, 09:36 AM

you can just type ANY tag pair if you know it. It even permits lowercase entry (it auto-raises on posting)

but I really wish MR (software section) forums that commonly get coding and error logs, default to 'Advanced' (or forum appropriate) tool buttons

eschwartz · 03-04-2016, 10:55 AM

We have a sticky thread you can post this in.

@theducks, would you mind fixing the thread title for that sticky? I think it predated Function-Replace mode.

"Saved Search" ==> "Saved Search/Regex Functions"

04-06-2014, 09:20 PM	#1
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Saved Search/Regex Functions If anyone has useful Saved Searches they would like to share, you can share them in this thread. Generic rules to fix common problems, for example. Or just anything clever and cool which you are proud of and want to admire. NOTE: To make it easier to read, it would be nice if all Search & Replace fields were wrapped in the [CODE]content goes here[/CODE] tags. Also, you can export the saved search as a .json, and upload it here in a zipped folder. Moderator Notice This thread has been made a sticky, and unlike most other sticky threads, this one is open to all who have a useful saved Search/Replace they wish to share. Do not use this thread to ask any questions. Start a new thread. Posts that don't belong here will be deleted or moved, but you are encouraged to post if you have something to share. Please add a descriptive title to each post and explain what your Saved Search accomplishes. Last edited by DoctorOhh; 04-08-2014 at 01:14 AM. Reason: added some formatting/sharing guidelines

04-07-2014, 07:14 AM	#2
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Finding and joining broken paragraphs ([a-z])</p>whatever else is in the middle here<p>([a-z]) Replace with \1space\2 With case sensitive ticked. Doesn't get absolutely everything, but can be used very quickly. Potshots welcome from people who actually know regex welcome. I just guess and see if it works!

04-07-2014, 07:18 AM	#3
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	Uncapitalized letters after period and quote \.” [a-z] Case sensitive ticked. No easy replace in this case, but at least you can find it. Remove quote to just find sentences uncapitalized without the quote.

04-07-2014, 04:17 PM	#4
arspr Dead account. Bye Posts: 587 Karma: 668244 Join Date: Mar 2011 Device: none	Preventing line wraps around dashes in Spanish dialogues As dashes are wrap points in HTML, dialogues in Spanish ebooks can look terrible. Example in one line: Code: —Bla, Bla, Bla, —John said—. More bla, bla, bla. Wrong: Code: —Bla, Bla, Bla, —John said —. More bla, bla, bla. Wrong: Code: —Bla, Bla, Bla, — John said—. More bla, bla, bla. *Right:* Code: —Bla, Bla, Bla, —John said—. More bla, bla, bla. The next two following searches add a <span> around the partner word with a specified class. (In my example just <span class="nw">). Then add the next CSS definition for this class: Code: .nw { white-space: nowrap; display: inline-block; text-indent: 0em;} and you will have prevented the wrong wrapping in Spanish books. Edit notes. Explanation of the workaround for RMSDK: Spoiler: The previous CSS class is a modification of my original one which only included the *white-space: nowrap;* code. But in latest versions of RMSDK the white-space property has stopped working. (It worked, and works, in my old Sony PRS-650). But in an ebook I was recently reading I found that they prevented the wrapping inside formulas enclosing them in a <span> with display: inline-block; text-indent: 0em;. So I just decided to add this method to my previous one. And then it also works in newer versions of RMSDK (Kobo Aura H2O with firmware 3.15.0 as example). The no-wrapping effect is actually obtained through the display: inline-block; part. But if this protected <span> started a new line it would inherit the text-indent value its parent <p> had. Because of that behaviour, the text-indent: 0em; setting is also added. First S&R Search: Code: \x20(—\|–\|—\|–)([^ <]+)( \|</p>\|</div>) Replace: Code: \x20<span class="nw">\1\2</span>\3 Second S&R Search: Code: \x20([^ >]+)(—\|–\|—\|–)(\.\|\.\.\.\|,\|;\|:\|…\|…)?\x20 Replace: Code: \x20<span class="nw">\1\2\3</span>\x20 Additional usage notes Spoiler: Yes, you need both S&R and in that order. Do not forget about setting up the additional CSS style or it would be useless. As you can see they look for dashes and just dashes (in unicode or in named entity flavour). Some horribly formatted books use minus signs that these searches won't catch. Case Sensitive or Dot All settings are probably irrelevant but I've got them in OFF. Because of the [^ <]+ and [^ >]+ parts of the Searches they are completely safe to use. I mean they won't catch and destroy code like: Code: —Bla, Bla, Bla, —<b>John</b> <i>said</i>—. More bla, bla, bla. They will just ignore it. You will never get something wrong like: Code: —Bla, Bla, Bla, <span class="nw">—<b>John</span></b> <i><span class="nw">said</i>—.</span> More bla, bla, bla. You'll have to manually fix this kind of situations. Using them where dashes are used as sentence or word separators is also safe: Code: First sentence—Second sentence. This situation, pretty common in English books, is also ignored. As hinted in other thread I've used \x20 for the starting and ending spaces needed in the regexes, in order to make them clearly visible. Obviously there's no point in adding a <span> around the very first starting dash and word, and these searches don't do that. Strange situation that I remember having found once or twice. If there's some kind of CSS setting directly on <span> tags then *it will be also applied to the newly created tags. I remember suffering a Code: span {font-size: 1.3em;} which I had to override with Code: .nw {font-size: 1em; white-space: nowrap;} while not losing where it was being originally applied. Last edited by arspr; 05-07-2015 at 02:53 PM. Reason: New .nw CSS definition - Workaround for RMSDK*

05-21-2014, 10:46 AM	#5
Zajora Junior Member Posts: 1 Karma: 10 Join Date: May 2014 Device: Kindle Keyboard	I have created a fair number of regex fixes. I make changes to them every so often, so I'll probably edit this post if I do. I'd use code tags, but they take up too much room. If a regex should be replaced with a space, it will say "space". If there is nothing, then (logically) it should be replaced with nothing. Of course, there is no guarantee any of these will work properly. I always check them a number of times before doing replace all, since there are a TON of ways eBooks can have formatting that wrecks these regexes. Scenario: Apostrophes have been replaced with double quotes Match: (?<=\w)(“\|”)(?=\w) Replace: ’ Scenario: There is a linebreak in the middle of a character's dialogue Match: (?<=“[^”])</p>\s<p[^>]>(?!“) Replace:* space Scenario: A tag closes, is followed by 0+ spaces or newlines, is then reopened and is then followed by a lowercase letter Match: </(?P<tag>\w+)>\s<(?P=tag) [^/>]+>(?=[a-z]) Match:* (?<![".!?>”“…~’])</(?P<tag>\w+)>\s<(?P=tag) [^/>]+> Replace: space Notes: The second one is an alternate, which I think is better, but I'm not 100% sure it covers all the cases of the former. Scenario: "LL" Ligatures have been replaced with a single "L". Match: (l (?=(y\|s\|ed\|ey\|ion\|en\|ar\|ars\|er\|ow\|et\|owed\|enge\|age \|enging\|ected\|egal\|ections\|ect\|apse\|ular\|op\|owing\| ocks\|ied\|ier\|ies\|ing\|ingly\|ered\|icit\|est)(\W)))\|(l (?![(–<-])(?=\W))\|(?<=’)l(?=\W)\|(?<= (wi\|du\|a\|we\|te\|sma\|ca\|sti\|fu\|fa\|chi\|sha\|wa\|pha\|se\| bi\|ha\|ki\|pu\|ce\|ba\|ski\|hi\|fi\|fe\|he\|ro\|ta\|i\|sme\|bri\| sta\|we))l(?=\W) Replace: ll Notes: This regex doesn't really work that well, but it's faster than doing it manually. I would recommend using the spellcheck afterwards and catching the most common ones. This regex is actually a bunch of individual ones chained together by ORs (\|) so it's easier to see what's doing what. Scenario: More than 1 space in a row Match: (?<=\S) {2,}(?=\S) Replace: space Scenario: There are tags (which may be nested) that are either empty or just have a number in them Match: (<[^/>]>)+\s\d\s(</[^>]>)+ Replace:* Notes: This may remove things you'd like to keep, such as scenebreaks/whitespace, or the chapter links. Scenario: There's a linebreak or spaces before a closing tag Match: (?<![".!?>”“…~’])</(?P<tag>\w+)>\s<(?P=tag) [^/>]+> Replace: For future use: Scenario: Match: Replace: Last edited by Zajora; 05-21-2014 at 10:51 AM.

05-22-2014, 08:03 PM	#6
Section8 Addict Posts: 264 Karma: 2121470 Join Date: Oct 2011 Location: Arlington, TX Device: Kindle PW4, Moon+ Reader on a cheap Android tablet	I have a nook, and the only real regexes I've written are for fixing stylesheets to work around its margin bug: if "publisher defaults" are disabled, the nook doesn't handle the css "margin" setting. I've been using these to convert all 4 forms of "margin" to the equivalent margin-top, margin-right, etc. These were written for Sigil, but I think they work in the calibre editor. First: find margin: Find: margin : Convert margin: a (single value): Find:* margin : ([^\s;]+)(\s(;\|})) Replace:* margin-top: \1; margin-right: \1; margin-bottom: \1; margin-left: \1\2 Convert margin: a, b (2 values) Find: margin : ([^\s;]+) +([^\s;]+)([\s](;\|})) Replace:* margin-top: \1; margin-right: \2; margin-bottom: \1; margin-left: \2\3 Convert margin a, b, c (3 values) Find: margin : ([^\s;]+) +([^\s;]+) +([^\s;]+)([\s](;\|})) Replace:* margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \2\4 Convert margin a, b, c, d (4 values) Find: margin : ([^\s;]+) +([^\s;]+) +([^\s;]+) +([^\s;]+)(\s(;\|})) Replace:* margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \4\5

06-19-2014, 02:17 PM	#7
user743 Addict Posts: 243 Karma: 44444 Join Date: Mar 2014 Device: Kindle PW2 special offers removed by Amazon for FREE	switch script links to html links. Code: <script> AddIndex\("(.+?)", (".+?"), ".+?"\); </script> <a href=\2>\1<a> regex. not case sensitive. dot all. change double quotes to single quotes if necessary.

12-23-2014, 02:46 AM	#8
dmonasse Member Posts: 23 Karma: 10 Join Date: Apr 2014 Location: Paris Device: ipad 2, Ubuntu	A regex function to number a (mathematical) ebook The search and replace tool with regex function is really fantastic. My little society is building mathematical ebooks from latex sources. One of my problems for converting such books is that latex auto-numbers chapters, sections, subsections and theorem-like assertions (theorems, propositions, lemmas, definitions, corollaries and so on). I would like to do such a numbering in my ebook. A solution is the following: 1) Converting from latex, I put chapters, sections, subsections and assertions in a <div> tag with a html5 data-type attribute. For example, a latex section Code: \section{History of the Fermat-Wiles theorem} is converted into Code: <div class="section" data-type="section">History of the Fermat-Wiles theorem</div> and Code: \begin{theorem}Abracadabra\end{theorem} is converted into Code: <div class="theorem" data-type="theorem">Abracadabra</div> Nota: I can't use the class attribute to denote the type of the div because the conversion process from HTML to ePub by Calibre modifies these attributes and class="theorem" may be changed into class="pcalibre25". That's the reason for the data-type attribute. 2) After conversion from latex to html (not so easy!!!) and from html to epub (easy with Calibre), I number the whole book with the Calibre editor using the search and replace tool with regex function. The search pattern I use is: Code: <div.?data-type="(chapter\|section\|subsection\|theorem\|proposition\|lemma\|definition\|corollary)"[^>]> and the regex function may be: Code: def replace(match, number, file_name, metadata, dictionaries, data, functions, args, *kwargs): if number==1: #initialization of the counts data['chapter']=0 data['section']=0 data['subsection']=0 data['assertion']=0 the_type=match.group(1) if the_type=='chapter': # begins a chapter, reinitialize the counts data['section']=0 data['subsection']=0 data['assertion']=0 data['chapter']+=1 return match.group()+"<span class='chapter_num'>Chapter "+str(data['chapter'])+".</span> " elif the_type=='section': # begins a section, reinitialize the subsection count data['subsection']=0 data['section']+=1 return match.group()+"<span class='section_num'>Section "+str(data['section'])+".</span>" elif the_type=='subsection': data['subsection']+=1 return match.group()+"<span class='subsection_num'>Subsection "+str(data['section'])+"."+str(data['subsection'])+".</span>" else: # this is an assertion data['assertion']+=1 return match.group()+"<span class='assertion_num'>Assertion "+str(data['chapter'])+"."+str(data['assertion'])+".</span>" return '' replace.file_order = 'spine' Adapt the code according to your needs or wishes, this is only an example; it would be nicer to replace "Assertion" by "Theorem", "Proposition", "Lemma", "Corollary", "Definition" (very easy to do starting from the "the_type" variable). I obtain such a numbering: Code: Chapter 1 Section 1 Subsection 1.1 Assertion 1.1 Assertion 1.2 Subsection 1.2 Assertion 1.3 Section 2 Subsection 2.1 Assertion 1.4 Assertion 1.5 Subsection 2.2 Assertion 1.6 Chapter 2 Section 1 Subsection 1.1 Assertion 2.1 Assertion 2.2 Subsection 1.2 Assertion 2.3 Section 2 Subsection 2.1 Assertion 2.4 Assertion 2.5 Hope this may help. Any improvement will be welcome (even in my bad English syntax).

03-03-2016, 01:06 PM	#11
PeterT Grand Sorcerer Posts: 13,510 Karma: 78910112 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	You might like to wrap your code in [code] .... [/code] tags to preserve spacing and indentation.

03-04-2016, 04:07 AM	#12
Arjayem Casual Member Posts: 5 Karma: 10 Join Date: Mar 2016 Location: UK Device: Kindle paperwhite	The code sample came via notepad. I keep a copy in a txt file because I've wiped one version in Calibre using the remove button which is unforgiving and next to the edit button, a design feature that it would be nice to see addressed.

03-04-2016, 08:40 AM	#13
phossler Wizard Posts: 1,087 Karma: 447222 Join Date: Jan 2009 Location: Valley Forge, PA, USA Device: Kindle Paperwhite	When posting code or similar, at the bottom of the window is the [Go Advanced] button to show more options. One is the [#] icon which adds the CODE tags. Just paste or type between them and it formats nicely Attached Thumbnails

03-04-2016, 09:36 AM	#14
theducks Well trained by Cats Posts: 31,047 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	you can just type ANY tag pair if you know it. It even permits lowercase entry (it auto-raises on posting) but I really wish MR (software section) forums that commonly get coding and error logs, default to 'Advanced' (or forum appropriate) tool buttons

03-04-2016, 10:55 AM	#15
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	We have a sticky thread you can post this in. @theducks, would you mind fixing the thread title for that sticky? I think it predated Function-Replace mode. "Saved Search" ==> "Saved Search/Regex Functions"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
About saved searches and regex	Carpatos	Editor	22	09-30-2020 10:56 PM
Regex-Functions - getting user input	CalibUser	Editor	8	09-09-2020 04:26 AM
Difference in Manual Search and Saved Search	phossler	Editor	4	10-04-2015 12:17 PM
Help - Learning to use Regex Functions	weberr	Editor	1	06-13-2015 01:59 AM
Limit on length of saved regex?	ElMiko	Sigil	0	06-30-2013 03:32 PM

Advert

Advert