04-06-2014, 10:20 PM | #1 |
Ex-Helpdesk Junkie
Posts: 19,421
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Saved Search/Regex Functions
If anyone has useful Saved Searches they would like to share, you can share them in this thread.
Generic rules to fix common problems, for example. Or just anything clever and cool which you are proud of and want to admire. NOTE: To make it easier to read, it would be nice if all Search & Replace fields were wrapped in the [CODE]content goes here[/CODE] tags.Also, you can export the saved search as a .json, and upload it here in a zipped folder. Moderator Notice
This thread has been made a sticky, and unlike most other sticky threads, this one is open to all who have a useful saved Search/Replace they wish to share. Do not use this thread to ask any questions. Start a new thread. Posts that don't belong here will be deleted or moved, but you are encouraged to post if you have something to share. Please add a descriptive title to each post and explain what your Saved Search accomplishes. Last edited by DoctorOhh; 04-08-2014 at 02:14 AM. Reason: added some formatting/sharing guidelines |
04-07-2014, 08:14 AM | #2 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Finding and joining broken paragraphs
([a-z])</p>whatever else is in the middle here<p>([a-z])
Replace with \1space\2 With case sensitive ticked. Doesn't get absolutely everything, but can be used very quickly. Potshots welcome from people who actually know regex welcome. I just guess and see if it works! |
Advert | |
|
04-07-2014, 08:18 AM | #3 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
Uncapitalized letters after period and quote
\.” [a-z]
Case sensitive ticked. No easy replace in this case, but at least you can find it. Remove quote to just find sentences uncapitalized without the quote. |
04-07-2014, 05:17 PM | #4 |
Dead account. Bye
Posts: 587
Karma: 668244
Join Date: Mar 2011
Device: none
|
Preventing line wraps around dashes in Spanish dialogues
As dashes are wrap points in HTML, dialogues in Spanish ebooks can look terrible.
Example in one line: Code:
—Bla, Bla, Bla, —John said—. More bla, bla, bla. Code:
—Bla, Bla, Bla, —John said —. More bla, bla, bla. Code:
—Bla, Bla, Bla, — John said—. More bla, bla, bla. Code:
—Bla, Bla, Bla, —John said—. More bla, bla, bla. The next two following searches add a <span> around the partner word with a specified class. (In my example just <span class="nw">). Then add the next CSS definition for this class: Code:
.nw { white-space: nowrap; display: inline-block; text-indent: 0em;} Edit notes. Explanation of the workaround for RMSDK: Spoiler:
First S&R Search: Code:
\x20(—|–|—|–)([^ <]+)( |</p>|</div>) Code:
\x20<span class="nw">\1\2</span>\3 Search: Code:
\x20([^ >]+)(—|–|—|–)(\.|\.\.\.|,|;|:|…|…)?\x20 Code:
\x20<span class="nw">\1\2\3</span>\x20 Spoiler:
Last edited by arspr; 05-07-2015 at 03:53 PM. Reason: New .nw CSS definition - Workaround for RMSDK |
05-21-2014, 11:46 AM | #5 |
Junior Member
Posts: 1
Karma: 10
Join Date: May 2014
Device: Kindle Keyboard
|
I have created a fair number of regex fixes. I make changes to them every so often, so I'll probably edit this post if I do. I'd use code tags, but they take up too much room. If a regex should be replaced with a space, it will say "space". If there is nothing, then (logically) it should be replaced with nothing.
Of course, there is no guarantee any of these will work properly. I always check them a number of times before doing replace all, since there are a TON of ways eBooks can have formatting that wrecks these regexes. Scenario: Apostrophes have been replaced with double quotes Match: (?<=\w)(“|”)(?=\w) Replace: ’ Scenario: There is a linebreak in the middle of a character's dialogue Match: (?<=“[^”]*)</p>\s*<p[^>]*>(?!“) Replace: space Scenario: A tag closes, is followed by 0+ spaces or newlines, is then reopened and is then followed by a lowercase letter Match: </(?P<tag>\w+)>\s*<(?P=tag) [^/>]+>(?=[a-z]) Match: (?<![".!?>*”“…~’])</(?P<tag>\w+)>\s*<(?P=tag) [^/>]+> Replace: space Notes: The second one is an alternate, which I think is better, but I'm not 100% sure it covers all the cases of the former. Scenario: "LL" Ligatures have been replaced with a single "L". Match: (l (?=(y|s|ed|ey|ion|en|ar|ars|er|ow|et|owed|enge|age |enging|ected|egal|ections|ect|apse|ular|op|owing| ocks|ied|ier|ies|ing|ingly|ered|icit|est)(\W)))|(l (?![(–<-])(?=\W))|(?<=’)l(?=\W)|(?<= (wi|du|a|we|te|sma|ca|sti|fu|fa|chi|sha|wa|pha|se| bi|ha|ki|pu|ce|ba|ski|hi|fi|fe|he|ro|ta|i|sme|bri| sta|we))l(?=\W) Replace: ll Notes: This regex doesn't really work that well, but it's faster than doing it manually. I would recommend using the spellcheck afterwards and catching the most common ones. This regex is actually a bunch of individual ones chained together by ORs (|) so it's easier to see what's doing what. Scenario: More than 1 space in a row Match: (?<=\S) {2,}(?=\S) Replace: space Scenario: There are tags (which may be nested) that are either empty or just have a number in them Match: (<[^/>]*>)+\s*\d*\s*(</[^>]*>)+ Replace: Notes: This may remove things you'd like to keep, such as scenebreaks/whitespace, or the chapter links. Scenario: There's a linebreak or spaces before a closing tag Match: (?<![".!?>*”“…~’])</(?P<tag>\w+)>\s*<(?P=tag) [^/>]+> Replace: For future use: Scenario: Match: Replace: Last edited by Zajora; 05-21-2014 at 11:51 AM. |
Advert | |
|
05-22-2014, 09:03 PM | #6 |
Addict
Posts: 256
Karma: 2092424
Join Date: Oct 2011
Location: Arlington, TX
Device: Kindle PW4, Moon+ Reader on a cheap Android tablet
|
I have a nook, and the only real regexes I've written are for fixing stylesheets to work around its margin bug: if "publisher defaults" are disabled, the nook doesn't handle the css "margin" setting. I've been using these to convert all 4 forms of "margin" to the equivalent margin-top, margin-right, etc. These were written for Sigil, but I *think* they work in the calibre editor.
First: find margin: Find: margin *: Convert margin: a (single value): Find: margin *: *([^\s;]+)(\s*(;|})) Replace: margin-top: \1; margin-right: \1; margin-bottom: \1; margin-left: \1\2 Convert margin: a, b (2 values) Find: margin *: *([^\s;]+) +([^\s;]+)([\s]*(;|})) Replace: margin-top: \1; margin-right: \2; margin-bottom: \1; margin-left: \2\3 Convert margin a, b, c (3 values) Find: margin *: *([^\s;]+) +([^\s;]+) +([^\s;]+)([\s]*(;|})) Replace: margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \2\4 Convert margin a, b, c, d (4 values) Find: margin *: *([^\s;]+) +([^\s;]+) +([^\s;]+) +([^\s;]+)(\s*(;|})) Replace: margin-top: \1; margin-right: \2; margin-bottom: \3; margin-left: \4\5 |
06-19-2014, 03:17 PM | #7 |
Addict
Posts: 243
Karma: 44444
Join Date: Mar 2014
Device: Kindle PW2 special offers removed by Amazon for FREE
|
switch script links to html links.
Code:
<script> AddIndex\("(.+?)", (".+?"), ".+?"\); </script> <a href=\2>\1<a> change double quotes to single quotes if necessary. |
12-23-2014, 03:46 AM | #8 |
Member
Posts: 23
Karma: 10
Join Date: Apr 2014
Location: Paris
Device: ipad 2, Ubuntu
|
A regex function to number a (mathematical) ebook
The search and replace tool with regex function is really fantastic. My little society is building mathematical ebooks from latex sources. One of my problems for converting such books is that latex auto-numbers chapters, sections, subsections and theorem-like assertions (theorems, propositions, lemmas, definitions, corollaries and so on). I would like to do such a numbering in my ebook.
A solution is the following: 1) Converting from latex, I put chapters, sections, subsections and assertions in a <div> tag with a html5 data-type attribute. For example, a latex section Code:
\section{History of the Fermat-Wiles theorem} Code:
<div class="section" data-type="section">History of the Fermat-Wiles theorem</div> Code:
\begin{theorem}Abracadabra\end{theorem} Code:
<div class="theorem" data-type="theorem">Abracadabra</div> 2) After conversion from latex to html (not so easy!!!) and from html to epub (easy with Calibre), I number the whole book with the Calibre editor using the search and replace tool with regex function. The search pattern I use is: Code:
<div.*?data-type="(chapter|section|subsection|theorem|proposition|lemma|definition|corollary)"[^>]*> Code:
def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): if number==1: #initialization of the counts data['chapter']=0 data['section']=0 data['subsection']=0 data['assertion']=0 the_type=match.group(1) if the_type=='chapter': # begins a chapter, reinitialize the counts data['section']=0 data['subsection']=0 data['assertion']=0 data['chapter']+=1 return match.group()+"<span class='chapter_num'>Chapter "+str(data['chapter'])+".</span> " elif the_type=='section': # begins a section, reinitialize the subsection count data['subsection']=0 data['section']+=1 return match.group()+"<span class='section_num'>Section "+str(data['section'])+".</span>" elif the_type=='subsection': data['subsection']+=1 return match.group()+"<span class='subsection_num'>Subsection "+str(data['section'])+"."+str(data['subsection'])+".</span>" else: # this is an assertion data['assertion']+=1 return match.group()+"<span class='assertion_num'>Assertion "+str(data['chapter'])+"."+str(data['assertion'])+".</span>" return '' replace.file_order = 'spine' Code:
Chapter 1 Section 1 Subsection 1.1 Assertion 1.1 Assertion 1.2 Subsection 1.2 Assertion 1.3 Section 2 Subsection 2.1 Assertion 1.4 Assertion 1.5 Subsection 2.2 Assertion 1.6 Chapter 2 Section 1 Subsection 1.1 Assertion 2.1 Assertion 2.2 Subsection 1.2 Assertion 2.3 Section 2 Subsection 2.1 Assertion 2.4 Assertion 2.5 |
07-17-2015, 03:20 AM | #9 |
Connoisseur
Posts: 82
Karma: 25684
Join Date: Sep 2014
Device: Kindle NT
|
I'm trying to write some regex in order to have something similar to pepito cleaner (openoffice plugin) with some other searches for typical OCR errors (some of them are intended for italian language).
So, import & test the attached regex and let me know; my goal is to correct and improve them Any suggestion's welcome Explanations:
|
03-03-2016, 01:22 PM | #10 |
Casual Member
Posts: 5
Karma: 10
Join Date: Mar 2016
Location: UK
Device: Kindle paperwhite
|
Scanning OCR Errors
Errors produced by scanning text seem to follow a predictable pattern such a seU for sell or iUness for illness or bom for born etc but never the less aren't corrected by the automatic scanning software. So, I created a function for the calibre editor to fix those I most commonly found. You'll also found I've corrected some American spellings, depending upon your dictionary these won't actually be wrong.
The code is based on the Calibre example that tidies up hyphens. You'll need to enter the following find : >.*?< Here's the function, because PYTHON uses intelligent (or not so) indenting you may need to play some to get PYTHON to swallow the code. : Code:
import regex from calibre import replace_entities from calibre import prepare_string_for_xml def replace(match, number, file_name, metadata, dictionaries, data, functions, *args, **kwargs): def replace_word(wmatch): # Check if the current word exits in the dictionary CheckThisSpelling = wmatch.group(1) if dictionaries.recognized(CheckThisSpelling) == True: return wmatch.group() else: # else try to correct it - remove American spelling NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("or", "our") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling + '~' NewSpelling = NewSpelling.replace("or~", "our") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling + '~' NewSpelling = NewSpelling.replace("ors~", "our") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it - remove American spelling NewSpelling = CheckThisSpelling + '~' NewSpelling = NewSpelling.replace("er~", "re") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("er", "re") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) else: NewSpelling = NewSpelling.replace("ree", "re") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling + '~' NewSpelling = NewSpelling.replace("ers~", "res") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling + '~' NewSpelling = NewSpelling.replace("nse~", "nce") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it - remove American spelling NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("l", "ll") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("l", "ll",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("l", "~",2) NewSpelling = NewSpelling.replace("~", "l",1) NewSpelling = NewSpelling.replace("~", "ll",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it - remove American spelling NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("ll", "l") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("ll", "l",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("ll", "~",2) NewSpelling = NewSpelling.replace("~", "ll",1) NewSpelling = NewSpelling.replace("~", "l",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("U", "li") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("U", "ll") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("h", "li") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("H", "li") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("h", "li",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("H", "li",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("h", "~",2) NewSpelling = NewSpelling.replace("~", "h",1) NewSpelling = NewSpelling.replace("~", "li",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("H", "~",2) NewSpelling = NewSpelling.replace("~", "H",1) NewSpelling = NewSpelling.replace("~", "li",1) if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("im", "un") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("l", "ll") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("imi", "um") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("m", "rn") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("m", "in") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("m", "hi") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("mn", "um") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("nm", "run") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("nmi", "rum") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("bn", "lm") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("ii", "h") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # else try to correct it NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("ii", "u") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) # # else try to correct it if CheckThisSpelling == 'Fd': return " I'd" + wmatch.group(2) if CheckThisSpelling == 'Fve': return " I've" + wmatch.group(2) if CheckThisSpelling == 'Fm': return " I'm" + wmatch.group(2) if CheckThisSpelling == 'Fll': return " I'll" + wmatch.group(2) if CheckThisSpelling == 'youVe': return " you've" + wmatch.group(2) if CheckThisSpelling == 'YouVe': return " You've" + wmatch.group(2) # # else try to correct it if CheckThisSpelling == 'wren\'t': return " weren't" + wmatch.group(2) # # else try to correct it if CheckThisSpelling == '&': return ' ' + chr(38) + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace(">", "y") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("j&", "fi") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = CheckThisSpelling NewSpelling = NewSpelling.replace("i&", "fi") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) NewSpelling = NewSpelling.replace("l&", "fi") if dictionaries.recognized(NewSpelling) == True: return NewSpelling + wmatch.group(2) return wmatch.group() #return wmatch.group() + '1' + wmatch.group(1) + '2' + wmatch.group(2) + '3' + NewSpelling # Search for words text = replace_entities(match.group()[1:-1]) # Handle HTML entities like & corrected = regex.sub(r'\s*([\w\>\&[[a-z]\'[a-z]]]*)([\s*\.\?\,\"\;])', replace_word, text, flags=regex.VERSION1 | regex.UNICODE) return '>%s<' % prepare_string_for_xml(corrected) # Put back required entities Last edited by Arjayem; 03-04-2016 at 05:53 AM. |
03-03-2016, 02:06 PM | #11 |
Grand Sorcerer
Posts: 12,930
Karma: 76440364
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
You might like to wrap your code in [code] .... [/code] tags to preserve spacing and indentation.
|
03-04-2016, 05:07 AM | #12 |
Casual Member
Posts: 5
Karma: 10
Join Date: Mar 2016
Location: UK
Device: Kindle paperwhite
|
The code sample came via notepad. I keep a copy in a txt file because I've wiped one version in Calibre using the remove button which is unforgiving and next to the edit button, a design feature that it would be nice to see addressed.
|
03-04-2016, 09:40 AM | #13 |
Wizard
Posts: 1,085
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
When posting code or similar, at the bottom of the window is the [Go Advanced] button to show more options.
One is the [#] icon which adds the CODE tags. Just paste or type between them and it formats nicely |
03-04-2016, 10:36 AM | #14 |
Well trained by Cats
Posts: 30,542
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
you can just type ANY tag pair if you know it. It even permits lowercase entry (it auto-raises on posting)
but I really wish MR (software section) forums that commonly get coding and error logs, default to 'Advanced' (or forum appropriate) tool buttons |
03-04-2016, 11:55 AM | #15 |
Ex-Helpdesk Junkie
Posts: 19,421
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
We have a sticky thread you can post this in.
@theducks, would you mind fixing the thread title for that sticky? I think it predated Function-Replace mode. "Saved Search" ==> "Saved Search/Regex Functions" |
Tags |
conversion, errors, function, ocr, spelling |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
About saved searches and regex | Carpatos | Editor | 22 | 09-30-2020 11:56 PM |
Regex-Functions - getting user input | CalibUser | Editor | 8 | 09-09-2020 05:26 AM |
Difference in Manual Search and Saved Search | phossler | Editor | 4 | 10-04-2015 01:17 PM |
Help - Learning to use Regex Functions | weberr | Editor | 1 | 06-13-2015 02:59 AM |
Limit on length of saved regex? | ElMiko | Sigil | 0 | 06-30-2013 04:32 PM |