10-12-2020, 03:19 PM | #676 |
Guru
Posts: 771
Karma: 2297170
Join Date: Jan 2017
Location: Poland
Device: Various
|
Use:
Code:
<a id="Page_([xvi]+|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+|\d+)\]"></a> Last edited by BeckyEbook; 10-12-2020 at 06:17 PM. Reason: Fix |
10-12-2020, 04:45 PM | #677 |
Well trained by Cats
Posts: 30,410
Karma: 58055234
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
A WAG
I would say your OR is flawed Code:
<a id="Page_([xvi]+)|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a> But it may also be simplified Use the captured valu as part of the second part title=\1 Last edited by theducks; 10-12-2020 at 04:48 PM. Reason: added simplified |
Advert | |
|
10-12-2020, 07:36 PM | #678 | |
Running with scissors
Posts: 1,557
Karma: 14325282
Join Date: Nov 2019
Device: none
|
Quote:
Removing the parentheses from the first part so that it's Page_[xvi]+|\d+ also works (although no capture to reuse for the title). (I thought it worked the first time I tried it but just now it did not.) So why did my extra parentheses screw it up? I added the parentheses so that it was clear, to me at least, what the OR was for. Last edited by hobnail; 10-12-2020 at 07:56 PM. |
|
10-12-2020, 08:18 PM | #679 |
Running with scissors
Posts: 1,557
Karma: 14325282
Join Date: Nov 2019
Device: none
|
I think I understand it; the parentheses are telling the or bar what it's working on. It's not like a regular programming language where you could say "boolean a = (b) | (c);"
So I'm guessing I could add some extra parentheses around it and it would still work, but I haven't tested it; Page_(([xvi]+)|(\d+)) And I didn't know that you could use \1 in the same regexp; I thought you could only use it in the replacement part. That's nice to know. Last edited by hobnail; 10-12-2020 at 08:21 PM. |
10-12-2020, 08:19 PM | #680 | |
Bibliophagist
Posts: 39,898
Karma: 154464500
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Quote:
Basically, you can't use them as separators as you would in a mathematical expression. |
|
Advert | |
|
10-12-2020, 08:57 PM | #681 |
Well trained by Cats
Posts: 30,410
Karma: 58055234
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was
|
10-13-2020, 01:53 AM | #682 | |
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
Code:
<a id="Page_([xvi]+) Code:
([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+) Code:
([\d]+)\]"></a> Code:
<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a> Last edited by davidfor; 10-13-2020 at 02:07 AM. Reason: Remember to refresh before replying.... |
|
01-18-2021, 04:57 AM | #683 |
Enthusiast
Posts: 30
Karma: 10
Join Date: Mar 2019
Location: Slovenia
Device: PocketBoot Inkpad 3
|
How can I transform uppercase text into lowercase text between tags with RegEx?
Example before: Code:
<p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p> Code:
<p class="tibTrans">la ma nam dang ji dam kjil khor lha</p> Code:
Find: <p class="tibTrans">(.*?)<\/p> Code:
Replace: <p class="tibTrans">\L$1<\/p> Last edited by Skydancer; 01-18-2021 at 05:30 AM. |
01-18-2021, 06:09 AM | #684 | |
Grand Sorcerer
Posts: 5,637
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
Code:
Replace: <p class="tibTrans">\L\1<\/p>
|
|
01-20-2021, 12:19 AM | #685 | |
Running with scissors
Posts: 1,557
Karma: 14325282
Join Date: Nov 2019
Device: none
|
Quote:
|
|
11-17-2021, 12:23 PM | #686 |
Guru
Posts: 673
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
|
Often after using Finereader for OCR, some paragraphs are split into two.
Like: <p>This is a journey</p> <p>into sound.</p> which should be: <p>This is a journey into sound.</p> Doing a regex like this: search: ([a-z])</p>.*?<p>([a-z]) replace: \1 \2 seem to work. But sometimes Finereader adds table-stuff: <p>This is a journey</p> <table border="1"> <tbody> <tr> <td></td> <td> <p>into sound.</p> which the regex catches and destroys the table. Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?) Last edited by patrik; 11-17-2021 at 12:32 PM. |
11-17-2021, 01:06 PM | #687 |
Zealot
Posts: 103
Karma: 10
Join Date: Jun 2014
Location: Poland, Żory
Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"
|
Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters.
search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])</p>\s*<p\b[^>]*> replace: \1 |
11-17-2021, 01:42 PM | #688 |
Guru
Posts: 673
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
|
Thanks! Much better then my version.
Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...? |
11-17-2021, 08:36 PM | #689 | ||
Wizard
Posts: 2,303
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Here's a PM I wrote a few months ago with examples: * * * The 3 main "joins" I currently use: Search: -</p>\s+<p> Replace: <--- (Completely blank) and: Search: ([^>”\?\!\.])</p>\s+<p> Replace: \1 <---- (There's a space after the '1') and: Search: <p>[a-z] Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.) 1st one looks for a hyphen at the end of a paragraph: Code:
<p>This is an ex-</p>
<p>ample.</p>
Code:
<p>This is an</p> <p>example.</p> <p>This is a list of one,</p> <p>two, and three.</p> Code:
<blockquote> <p>This is a long quote.</p> </blockquote> <p>apples, Bananas, Pears...</p> <p>and Croutons.</p> - - - Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context: Code:
<p>This is a list:</p> <p>One, Two, Three</p> <p>This is a quote:</p> <p>“Get over here!”</p> Code:
<p>This is a list: One, Two, Three</p> <p>This is a quote: “Get over here!”</p> - - - Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL". Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen: Code:
The proto-</p>
<p>European model of [...]
Code:
The proto-European model of [...]
1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.) 2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage: (Regex #1 alt) Search: -</p>\s+<p> Replace: - This would get you: Code:
<p>This is an ex-</p> <p>ample.</p> <p>This is an ex-ample.</p> 2013: "How do you deal with soft hyphens in OCR texts?" Personally, I squash everything one-by-one during cleanup. Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues! And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier. Quote:
Here's the last 5 steps of my Saved Searches dealing with Finereader tables: Remove Finereader 12 Table Alignment Search: <td style="vertical-align:[^"]+"> Replace: <td> Clean Bold td Search: <td>\s+<p><span class="bold">([^<]+)</span></p>\s+</td> Replace: <td>\1</td> Clean Italics td Search: <td>\s+<p>(<span class="italics">[^<]+</span>)</p>\s+</td> Replace: <td>\1</td> Clean td Search: <td>\s+<p>([^<]+)</p>\s+</td> Replace: <td>\1</td> Clean Table Headers Search: <td colspan="([0-9]+)">\s+<p>([^<]+)</p>\s+</td> Replace: <th colspan="\1">\2</th> * * * For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized. Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base. Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup: Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away. Test out my 3 regexes. You'll be pleasantly surprised at how well it works. Last edited by Tex2002ans; 11-18-2021 at 02:47 PM. |
||
11-18-2021, 11:05 AM | #690 |
Guru
Posts: 673
Karma: 4568205
Join Date: Jan 2010
Location: Sweden
Device: Kobo Forma
|
Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Examples of Subgroups | emonti8384 | Lounge | 32 | 02-26-2011 06:00 PM |
Accessories Pen examples | Gunnerp245 | enTourage Archive | 15 | 02-21-2011 03:23 PM |
Stylesheet examples? | Skitzman69 | Sigil | 15 | 09-24-2010 08:24 PM |
Examples | kafkaesque1978 | iRiver Story | 1 | 07-26-2010 03:49 PM |
Looking for examples of typos in eBooks | Tonycole | General Discussions | 1 | 05-05-2010 04:23 AM |