Regex examples - Page 46

BeckyEbook · 10-12-2020, 03:19 PM

Use:

Code:

<a id="Page_([xvi]+|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+|\d+)\]"></a>

theducks · 10-12-2020, 04:45 PM

A WAG
I would say your OR is flawed

Code:

<a id="Page_([xvi]+)|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>

You want 1 capture for either condition

But it may also be simplified
Use the captured valu as part of the second part
title=\1

hobnail · 10-12-2020, 07:36 PM

Quote:

Originally Posted by BeckyEbook

Use:

Code:

<a id="Page_([xvi]+|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+|\d+)\]"></a>

Thanks, that worked, as did theducks' answer (with the added square brackets for the title).

Removing the parentheses from the first part so that it's

Page_[xvi]+|\d+

also works (although no capture to reuse for the title). (I thought it worked the first time I tried it but just now it did not.)

So why did my extra parentheses screw it up?

I added the parentheses so that it was clear, to me at least, what the OR was for.

hobnail · 10-12-2020, 08:18 PM

Quote:

Originally Posted by hobnail

So why did my extra parentheses screw it up?

I think I understand it; the parentheses are telling the or bar what it's working on. It's not like a regular programming language where you could say "boolean a = (b) | (c);"

So I'm guessing I could add some extra parentheses around it and it would still work, but I haven't tested it; Page_(([xvi]+)|(\d+))

And I didn't know that you could use \1 in the same regexp; I thought you could only use it in the replacement part. That's nice to know.

DNSB · 10-12-2020, 08:19 PM

Quote:

Originally Posted by hobnail

So why did my extra parentheses screw it up?

I added the parentheses so that it was clear, to me at least, what the OR was for.

In regex, parentheses are special characters. That's why you end up needing to escape a literal parenthesis with a \, (text) is capturing parentheses unless you start the text inside the parentheses with a ?: i.e. (?:text) for non-capturing parentheses. I seem to remember a 4th variety of parentheses but not sure about what flavour of regex that was in.

Basically, you can't use them as separators as you would in a mathematical expression.

theducks · 10-12-2020, 08:57 PM

My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was

davidfor · 10-13-2020, 01:53 AM

Quote:

Originally Posted by hobnail

I don't understand why this isn't working; my search string is:

<a id="Page_([xvi]+)|([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)|([\d]+)\]"></a>

When the file contains

<a id="Page_iv" class="x-ebookmaker-pageno" title="[iv]"></a>

and I click on the Find button, it highlights only

<a id="Page_i

What's wrong with my regexp?

The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:

<a id="Page_([xvi]+)

Code:

([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)

Code:

([\d]+)\]"></a>

I think you want:

Code:

<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>

The two groups are the page number and title in either of the formats.

Skydancer · 01-18-2021, 04:57 AM

How can I transform uppercase text into lowercase text between tags with RegEx?

Example
before:

Code:

<p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p>

after:

Code:

<p class="tibTrans">la ma nam dang ji dam kjil khor lha</p>

I tried this, but it doesn't work in Sigil:

Code:

Find: <p class="tibTrans">(.*?)<\/p>

Code:

Replace: <p class="tibTrans">\L$1<\/p>

Doitsu · 01-18-2021, 06:09 AM

Quote:

Originally Posted by Skydancer

I tried this, but it doesn't work in Sigil:

Code:

Find: <p class="tibTrans">(.*?)<\/p>

Code:

Replace: <p class="tibTrans">\L$1<\/p>

Sigil uses the PCRE regex library; you'll need to use backslashes for backreferences.

Code:

Replace: <p class="tibTrans">\L\1<\/p>

hobnail · 01-20-2021, 12:19 AM

Quote:

Originally Posted by davidfor

The "|" is basically an "or". Your regex is basically search for matches to one of:

Code:

<a id="Page_([xvi]+)

Code:

([\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)

Code:

([\d]+)\]"></a>

I think you want:

Code:

<a id="Page_([xvi]+|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+|[\d]+)\]"><\/a>

The two groups are the page number and title in either of the formats.

Sorry, somehow I missed your reply (or maybe I forgot that I read it, also quite likely), so this is a belated thanks.

patrik · 11-17-2021, 12:23 PM

Often after using Finereader for OCR, some paragraphs are split into two.

Like:

This is a journey

into sound.

which should be: This is a journey into sound.

Doing a regex like this:

search: ([a-z]).*?([a-z])
replace: \1 \2

seem to work. But sometimes Finereader adds table-stuff:

This is a journey
<table border="1">
<tbody>
<tr>
<td></td>

<td>
into sound.

which the regex catches and destroys the table.

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

bravosx · 11-17-2021, 01:06 PM

Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters.

search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])\s*<p\b[^>]*>
replace: \1

patrik · 11-17-2021, 01:42 PM

Thanks! Much better then my version.

Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...?

Tex2002ans · 11-17-2021, 08:36 PM

Quote:

Originally Posted by patrik

Often after using Finereader for OCR, some paragraphs are split into two.

Like:

This is a journey

into sound.

which should be: This is a journey into sound.

For ~9 years, I've been using 3 "join" regexes. They catch the ~99% of broken paragraphs, but they have to be decided on a case-by-case basis.

Here's a PM I wrote a few months ago with examples:

* * *

The 3 main "joins" I currently use:

Search: -\s+
Replace: <--- (Completely blank)

and:

Search: ([^>”\?\!\.])\s+
Replace: \1 <---- (There's a space after the '1')

and:

Search: [a-z]
Replace: <---- (BLANK. Only use for FINDING, NOT REPLACING.)

1st one looks for a hyphen at the end of a paragraph:

Code:

<p>This is an ex-</p>
<p>ample.</p>

2nd one looks for any paragraph that ends in a NOT closing punctuation:

Code:

<p>This is an</p>
<p>example.</p>

<p>This is a list of one,</p>
<p>two, and three.</p>

and 3rd one looks for any leftover paragraphs STARTING with a lowercase letter:

Code:

<blockquote>
	<p>This is a long quote.</p>
</blockquote>

<p>apples, Bananas, Pears...</p>

<p>and Croutons.</p>

Those 3 should catch 99% of the broken paragraphs.

- - -

Usage Note: Then you just have to pay close attention to paragraphs that end in ':', because those all depend on the book/context:

Code:

<p>This is a list:</p>
<p>One, Two, Three</p>

<p>This is a quote:</p>
<p>“Get over here!”</p>

These could be:

Code:

<p>This is a list: One, Two, Three</p>

<p>This is a quote: “Get over here!”</p>

(More in-depth regex might also be needed for ” too, but I don't have any Saved Searches on that. Very rarely do I see those actually get split by Finereader. And usually the "lowercase regex" catches all those.)

- - -

Regex #1 Note: You have to be careful, DO NOT "REPLACE ALL".

Not all hyphens are "soft hyphens". Some need to be replaced with an actual hyphen:

Code:

The proto-</p>

<p>European model of [...]

would need to become:

Code:

The proto-European model of [...]

It's up to you when/how you want to deal with these. You can:

1. Deal with "soft" or "hard" hyphens on a case-by-case basis as you go through book. (I find the vast majority out of Finereader are "soft", so 90%+ of the time I want hyphen gone.)

2. Replace all broken paragraphs with a "hard" hyphen, then remove bad/inconsistent hyphens at a later stage:

(Regex #1 alt)

Search: -\s+
Replace: -

This would get you:

Code:

<p>This is an ex-</p>
<p>ample.</p>

<p>This is an ex-ample.</p>

Back in 2013, I wrote how to use "Spellcheck Lists" to catch bad/inconsistent hyphenation:

2013: "How do you deal with soft hyphens in OCR texts?"

Personally, I squash everything one-by-one during cleanup.

Finereader tends to introduce issues at page/line breaks (leaving footnotes in the text, tables smack-dab in the middle of split paragraphs, etc.), so this case-by-case hyphen fixing is also a great time to spot/correct those issues!

And then when I get to the Spellcheck List stage, all hyphens can be mass checked/corrected. And since the leftover hyphens there are correct OR actual hyphenation errors that snuck into the book, this is much easier.

Quote:

Originally Posted by patrik

But sometimes Finereader adds table-stuff:

This is a journey
<table border="1">
<tbody>
<tr>
<td></td>

<td>
into sound.

which the regex catches and destroys the table.

Back in 2020, I partially wrote about my "12-step Finereader Cleanup" (Sigil Saved Searches).

Here's the last 5 steps of my Saved Searches dealing with Finereader tables:

Remove Finereader 12 Table Alignment
Search: <td style="vertical-align:[^"]+">
Replace: <td>

Clean Bold td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean Italics td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean td
Search: <td>\s+([^<]+)\s+</td>
Replace: <td>\1</td>

Clean Table Headers
Search: <td colspan="([0-9]+)">\s+([^<]+)\s+</td>
Replace: <th colspan="\1">\2</th>

* * *

For ~9 years, I've had those 12 steps stored in my Sigil Saved Searches. 99% of the Finereader HTML cruft is cleaned up and normalized.

Then I could open up a Finereader EPUB, run the group of searches, and within seconds... boom... clean code to use as a base.

Here's an example of an Archive.org book I generated through Finereader PDF -> EPUB -> 12-step cleanup:

2021: "Archive.org ePub" (Post #11)

Seconds to create that EPUB. And compared to the automatically generated "EPUB" version hosted on Archive.org, mine blows it away.

Quote:

Originally Posted by patrik

Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?)

Test out my 3 regexes. You'll be pleasantly surprised at how well it works.

patrik · 11-18-2021, 11:05 AM

Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)

10-12-2020, 03:19 PM	#676
BeckyEbook Guru Posts: 771 Karma: 2297170 Join Date: Jan 2017 Location: Poland Device: Various	Use: Code: <a id="Page_([xvi]+\|\d+)" class="x-ebookmaker-pageno" title="\[([xvi]+\|\d+)\]"></a> Last edited by BeckyEbook; 10-12-2020 at 06:17 PM. Reason: Fix

10-12-2020, 04:45 PM	#677
theducks Well trained by Cats Posts: 30,410 Karma: 58055234 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	A WAG I would say your OR is flawed Code: <a id="Page_([xvi]+)\|[\d]+)" class="x-ebookmaker-pageno" title="\[([xvi]+)\|([\d]+)\]"></a> You want 1 capture for either condition But it may also be simplified Use the captured valu as part of the second part title=\1 Last edited by theducks; 10-12-2020 at 04:48 PM. Reason: added simplified

01-18-2021, 04:57 AM	#683
Skydancer Enthusiast Posts: 30 Karma: 10 Join Date: Mar 2019 Location: Slovenia Device: PocketBoot Inkpad 3	How can I transform uppercase text into lowercase text between tags with RegEx? Example before: Code: <p class="tibTrans">LA MA NAM DANG JI DAM KJIL KHOR LHA</p> after: Code: <p class="tibTrans">la ma nam dang ji dam kjil khor lha</p> I tried this, but it doesn't work in Sigil: Code: Find: <p class="tibTrans">(.?)<\/p> Code: Replace: <p class="tibTrans">\L$1<\/p> Last edited by Skydancer; 01-18-2021 at 05:30 AM.*

11-17-2021, 12:23 PM	#686
patrik Guru Posts: 673 Karma: 4568205 Join Date: Jan 2010 Location: Sweden Device: Kobo Forma	Often after using Finereader for OCR, some paragraphs are split into two. Like: <p>This is a journey</p> <p>into sound.</p> which should be: <p>This is a journey into sound.</p> Doing a regex like this: search: ([a-z])</p>.?<p>([a-z]) replace: \1 \2 seem to work. But sometimes Finereader adds table-stuff: <p>This is a journey</p> <table border="1"> <tbody> <tr> <td></td> <td> <p>into sound.</p> which the regex catches and destroys the table. Any way to catch only what is safe to replace? (And catch where the last or first letter is a valid word with capital letter (like "I")?) Last edited by patrik; 11-17-2021 at 12:32 PM.*

11-17-2021, 01:06 PM	#687
bravosx Zealot Posts: 103 Karma: 10 Join Date: Jun 2014 Location: Poland, Żory Device: Prestigio PER3464B, Onyx Lynx, Lenovo S5000 i Tab4-8"	Try it out, for me it connects paragraphs. You can remove the characters you don't want, e.g. Polish characters. search: ([[:alpha:],ą,ć,ę,ł,ń,ó,ś,ź,ż,,,;,:,-,–,—,“,”,])</p>\s<p\b[^>]> replace: \1

10-12-2020, 08:57 PM	#681
theducks Well trained by Cats Posts: 30,410 Karma: 58055234 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	My GUESS is you satisfied the FIND with the first match found (no recursive +), which is why you saw the Highlight as it was

11-17-2021, 01:42 PM	#688
patrik Guru Posts: 673 Karma: 4568205 Join Date: Jan 2010 Location: Sweden Device: Kobo Forma	Thanks! Much better then my version. Though, it does catch cases where there should be two paragraphs but a period is missing, not sure if it's possible to differentiate between these "valid" errors...?

11-18-2021, 11:05 AM	#690
patrik Guru Posts: 673 Karma: 4568205 Join Date: Jan 2010 Location: Sweden Device: Kobo Forma	Tex2002ans, I'm constantly amazed of what amazing posts you post! Thank you very much! :-)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Examples of Subgroups	emonti8384	Lounge	32	02-26-2011 06:00 PM
Accessories Pen examples	Gunnerp245	enTourage Archive	15	02-21-2011 03:23 PM
Stylesheet examples?	Skitzman69	Sigil	15	09-24-2010 08:24 PM
Examples	kafkaesque1978	iRiver Story	1	07-26-2010 03:49 PM
Looking for examples of typos in eBooks	Tonycole	General Discussions	1	05-05-2010 04:23 AM

Advert

Advert