Words Split w/ "id=" Stuff

enuddleyarbl · 01-25-2023, 10:06 AM

I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:

Code:

ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant

or, the more common, simpler variety:

Code:

ele<a id="page_330"></a>phant

The find string I'm currently using to find these artificial breaks is:

Code:

\w<[^/].+?></.+?>\w

\w matches any word character (equivalent to [a-zA-Z0-9_])
< matches the character < with index 6010 (3C16 or 748) literally (case sensitive)
Match a single character not present in the list below [^/]
/ matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
. matches any character (except for line terminators)
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
></ matches the characters ></ literally (case sensitive)
. matches any character (except for line terminators)
+? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
> matches the character > with index 6210 (3E16 or 768) literally (case sensitive)
\w matches any word character (equivalent to [a-zA-Z0-9_])

That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions?

SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods:

Non-Self-Terminating Tags:

Code:

FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b)
REPLACE: \1\3\2

Self-Terminating Tags:

Code:

FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b)
REPLACE: \1\3\2

JSWolf · 01-25-2023, 11:18 AM

I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.

enuddleyarbl · 01-25-2023, 11:30 AM

Quote:

Originally Posted by JSWolf

I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.

Oops. Now that you mention it, I don't actually want to REMOVE those tags with the ids and stuff. I just want to move them outside of the word. I think I'll have to wrap the tags in a group as well and then add that to the end. Let me test it and if it works, I'll update my OP. Again.

enuddleyarbl · 01-25-2023, 11:59 AM

Nope. Moving those tags from the middle of the word is harder than REmoving them. To move them, it looks like I'm back to needing to select the whole front and rear word fragments outside the tags. And, I haven't been able to do that yet.

EDIT: Let me stick some trials in here until I figure something out. First, the OR ("|") is giving me issues with the replacement strings. So, I'm just going to work with the non-self-terminated tags. Second, it looks like I can grab some form of the front/rear word fragments with

Code:

\b

:

Code:

SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b)
REPLACE: \1\3\2

1st Capturing Group (\b\w+?)
- \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)
- \w matches any word character (equivalent to [a-zA-Z0-9_])
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
2nd Capturing Group (<[^/]+?></.?>)
- < matches the character < with index 6010 (3C16 or 748) literally (case sensitive)
- Match a single character not present in the list below [^/]
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
- / matches the character / with index 4710 (2F16 or 578) literally (case sensitive)
- ></ matches the characters ></ literally (case sensitive)
- . matches any character (except for line terminators)
- ? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy)
- > matches the character > with index 6210 (3E16 or 768) literally (case sensitive)
3rd Capturing Group (\w+?\b)
- \w matches any word character (equivalent to [a-zA-Z0-9_])
- +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy)
- \b assert position at a word boundary: (^\w|\w$|\W\w|\w\W)

For the non-self-terminating-tag case, that seems to work fine.

Turtle91 · 01-25-2023, 12:40 PM

Quote:

Originally Posted by enuddleyarbl

EDIT Again: Not Solved. The following removes the tags, but I actually want to move them. See later in the thread. Solved, I think. First, I had to change my search for split words to include self-terminated tags. Then I just had to group the bits of words found by the search string and concatenate them in the replace string:

Code:

SEARCH: (\w)<[^/].+?></.+?>(\w)|(\w)<[^/].+?/>(\w)
REPLACE: \1\2

At a quick glance (I haven’t tested and am responding from my phone) I would say you need to provide a capture for the id= portion as well if you want to just MOVE it. So the search/replace would look something like this:

Code:

SEARCH: (\w)(<[^/].+?></.+?>)(\w)|(\w)(<[^/].+?/>)(\w)
REPLACE: \2\1\3 -or- \1\3\2

enuddleyarbl · 01-25-2023, 01:02 PM

Quote:

Originally Posted by Turtle91

At a quick glance (I haven’t tested and am responding from my phone) I would say you need to provide a capture for the id= portion as well if you want to just MOVE it. So the search/replace would look something like this:

Code:

SEARCH: (\w)(<[^/].+?></.+?>)(\w)|(\w)(<[^/].+?/>)(\w)
REPLACE: \2\1\3 -or- \1\3\2

I'll respond with the crux of my above edit. Yep. I had to grab the tag stuff (the id= stuff). Plus, I had to get the whole word fragments before and after the tags (in order to move the tags after the whole word).

Code:

SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b)
REPLACE: \1\3\2

Unfortunately, I still can't figure out how to handle replacement groups with an OR ("|") in the midst of the search string. So, I stuck with the non-self-terminating-tag option.

EDIT: To the best of my knowledge, the above search should set the first replacement group as starting from the nearest word boundary and running to the starting "<" of the interrupting tags.
The second group should be everything from there that's in a <blah></somethingelse> pair.
The third group should start from there and run to the next word boundary.
The replacement of \1\3\2 sticks the first and last bits of the word together and then appends the tag set afterward.

EDIT 2: I had a spurious plus ("+") in the search string for the closing tag. That made it look for at least one character after the "/" and if it didn't find one inside the tag, it happily continued looking until if either found one somewhere else or ran out of paragraph. I think I've fixed it (again). Sorry.

j.p.s · 01-25-2023, 01:28 PM

This is just an idea, I have no idea whether it is actually easier to implement.

Have you tried moving the initial word fragment to after the tag?

enuddleyarbl · 01-25-2023, 03:51 PM

I'm going to go the easy route and not bother putting an OR inside the search. I'll just have two different searches for this. The first will be what I did, above, for non-self-terminated tags. This is the search string for the self-terminated tags:

Code:

(\b\w+?)(<[^/]+?/>)(\w+?\b)

The replacement string is the same for both cases:

Code:

\1\3\2

@j.p.s moving the initial word fragment to after the tag uses the same search strings as above. The difference would be the replace string:

Code:

\2\1\3

Tex2002ans · 01-25-2023, 05:53 PM

Quote:

Originally Posted by enuddleyarbl

OP: I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:

Code:

ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant

or, the more common, simpler variety:

Code:

ele<a id="page_330"></a>phant

This code is bad practice anyway.

If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER.

See Daisy.org: "Page Navigation":

Quote:

Where do I put the page break if a word is hyphenated across a page?

Place the page marker after the word. Do not retain the print hyphenation and insert the number in the middle of the word.

Anyway, remember to KISS (Keep It Simple, Stupid)!

My Solution

I'd tackle it using:

Find #1: (]+>)([\w”\?!\.]+)

Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+)

Replace: \2\1

This would convert your examples into:

Code:

elephant<span epub:type="pagebreak" id="page_330" title="330"></span>

elephant<a id="page_330"></a>

You can also tweak that regex + list of punctuation as needed.

What's the Regex Doing?

Well, the 1st half is saying:

<span epub:type="pagebreak" = "Hey! Look for any spans with the pagebreak!"
[^>]+> = "then keep on grabbing everything in the span until you reach the closing bracket."

(Similar with the <a> page number version.)

What's the 2nd half doing?

\w = "Look for ANY LETTER."
” = "Look for any RIGHT QUOTATION MARK"
/? = "Look for any QUESTION MARK"
! = "Look for any EXCLAMATION POINT"
\. = "Look for any PERIOD"
+ = "Keep grabbing as many of these letters/punctuation as you can."

The Replace is saying:

\2 = "You know all those letters/punctuation we captured? Yep. Put it first."
\1 = "You know all those page s or <a> we captured? Yep. Put it after."

enuddleyarbl · 01-25-2023, 06:32 PM

I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook. Probably 1) no human ever saw it 2) when they made the ebook there might not have been any standards, and 3) no one ever looks back at the horrible stuff they did in the dark ages to make it better.

Of course, on the glass half-full side of things, if they finagled those page id locations to be in the next space, then when someone referred to a bit of text by page number, an ebook user might not be able to find it. Although, occasionally being half a word off shouldn't be too onerous.

EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like:
""

Yet, I don't think I've ever seen anything like it. Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("") which seems to be making some kind of effort. At some point (probably about where I've finished re-formatting all the books in my library

), I might know enough about this stuff to have realized I should have changed all those references while I was in there fooling around.

Tex2002ans · 01-25-2023, 07:11 PM

Quote:

Originally Posted by enuddleyarbl

I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook.

They most likely placed the page break smack dab in the middle where it occurred in the print book.

Code:

There was an ele-

-------PAGE 2-------

phant in the zoo.

would then be converted to:

Code:

There was an ele<a page="page_2"></a>phant in the zoo.

This is a case where the page number code should've been adjusted afterwards.

Quote:

Originally Posted by enuddleyarbl

EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like:

Listen to it for some advice, not all.

For all the latest "Real Page Numbers" (RPNs) stuff, see my post here:

where I link to many of the previous topics.

You'll also want to type this into your favorite search engine:

Code:

RPNs Tex2002ans site:mobileread.com
page numbers Tex2002ans site:mobileread.com

where me + Hitch + Doitsu have discussed this topic to death.

- - -

For a working sample of EPUB3 page numbers, see Doitu's sample book.

And his fantastic Sigil plugin:

[Plugin] PageList - Generates print edition page numbers

- - -

Quote:

Originally Posted by enuddleyarbl

Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("")

Yes.

The simple <a> was the EPUB2 method.

The + epub:type="pagebreak" is the EPUB3 method.

JSWolf · 01-26-2023, 04:47 AM

Quote:

Originally Posted by enuddleyarbl

Oops. Now that you mention it, I don't actually want to REMOVE those tags with the ids and stuff. I just want to move them outside of the word. I think I'll have to wrap the tags in a group as well and then add that to the end. Let me test it and if it works, I'll update my OP. Again.

Unless there are things like footnotes or other links to these page numbers, you don't need them and removing them is the easiest way to do it.

Quoth · 01-26-2023, 05:52 AM

What he says ^^^^
If I happen to be fixing formatting I also delete all that junk. Pretty quick using global regex or the delete/edit tag tool.

01-25-2023, 11:59 AM	#4
enuddleyarbl Guru Posts: 776 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Forma	Nope. Moving those tags from the middle of the word is harder than REmoving them. To move them, it looks like I'm back to needing to select the whole front and rear word fragments outside the tags. And, I haven't been able to do that yet. EDIT: Let me stick some trials in here until I figure something out. First, the OR ("\|") is giving me issues with the replacement strings. So, I'm just going to work with the non-self-terminated tags. Second, it looks like I can grab some form of the front/rear word fragments with Code: \b : Code: SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2 1st Capturing Group (\b\w+?) \b assert position at a word boundary: (^\w\|\w$\|\W\w\|\w\W) \w matches any word character (equivalent to [a-zA-Z0-9_]) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) 2nd Capturing Group (<[^/]+?></.?>) < matches the character < with index 6010 (3C16 or 748) literally (case sensitive) Match a single character not present in the list below [^/] +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) / matches the character / with index 4710 (2F16 or 578) literally (case sensitive) ></ matches the characters ></ literally (case sensitive) . matches any character (except for line terminators) ? matches the previous token between zero and one times, as many times as possible, giving back as needed (greedy) > matches the character > with index 6210 (3E16 or 768) literally (case sensitive) 3rd Capturing Group (\w+?\b) \w matches any word character (equivalent to [a-zA-Z0-9_]) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) \b assert position at a word boundary: (^\w\|\w$\|\W\w\|\w\W) For the non-self-terminating-tag case, that seems to work fine. Last edited by enuddleyarbl; 01-25-2023 at 05:04 PM.

01-25-2023, 03:51 PM	#8
enuddleyarbl Guru Posts: 776 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Forma	I'm going to go the easy route and not bother putting an OR inside the search. I'll just have two different searches for this. The first will be what I did, above, for non-self-terminated tags. This is the search string for the self-terminated tags: Code: (\b\w+?)(<[^/]+?/>)(\w+?\b) The replacement string is the same for both cases: Code: \1\3\2 @j.p.s moving the initial word fragment to after the tag uses the same search strings as above. The difference would be the replace string: Code: \2\1\3

01-25-2023, 06:32 PM	#10
enuddleyarbl Guru Posts: 776 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Forma	I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook. Probably 1) no human ever saw it 2) when they made the ebook there might not have been any standards, and 3) no one ever looks back at the horrible stuff they did in the dark ages to make it better. Of course, on the glass half-full side of things, if they finagled those page id locations to be in the next space, then when someone referred to a bit of text by page number, an ebook user might not be able to find it. Although, occasionally being half a word off shouldn't be too onerous. EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like: "<span role="doc-pagebreak" id="pg24" aria-label="24"/>" Yet, I don't think I've ever seen anything like it. Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("<span epub:type="pagebreak" id="page_330" title="330"></span>") which seems to be making some kind of effort. At some point (probably about where I've finished re-formatting all the books in my library ), I might know enough about this stuff to have realized I should have changed all those references while I was in there fooling around. Last edited by enuddleyarbl; 01-25-2023 at 06:46 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Sigil's Infamous "colon" Error on File Split	slowsmile	Sigil	24	10-27-2016 09:45 AM
Regex or other method to find split quotations ""	CyanBC	Sigil	9	05-14-2013 02:52 PM
Split long words using the "¬" character (small screens)	DSpider	Workshop	5	03-16-2012 07:09 AM
George R. R. Martin's "A Dance With Dragons" to be split into separate books.	Exer	General Discussions	4	04-02-2011 08:50 AM
Any way to revert the "Do No Split On Page Breaks" option?	dsana123	Calibre	2	07-10-2010 02:37 PM

01-25-2023, 10:06 AM	#1
enuddleyarbl Guru Posts: 776 Karma: 1538394 Join Date: Sep 2013 Device: Kobo Forma	Words Split w/ "id=" Stuff I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance: Code: ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant or, the more common, simpler variety: Code: ele<a id="page_330"></a>phant The find string I'm currently using to find these artificial breaks is: Code: \w<[^/].+?></.+?>\w \w matches any word character (equivalent to [a-zA-Z0-9_]) < matches the character < with index 6010 (3C16 or 748) literally (case sensitive) Match a single character not present in the list below [^/] / matches the character / with index 4710 (2F16 or 578) literally (case sensitive) . matches any character (except for line terminators) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) ></ matches the characters ></ literally (case sensitive) . matches any character (except for line terminators) +? matches the previous token between one and unlimited times, as few times as possible, expanding as needed (lazy) > matches the character > with index 6210 (3E16 or 768) literally (case sensitive) \w matches any word character (equivalent to [a-zA-Z0-9_]) That seems to be working. But, is there some way to manage an automatic replacement? I'd assume that instead of looking for a word character snubbed up to the start of a tag (i.e., "<") and also one stuck to the ending tag (i.e., </...>, I'd look a "word" touching those tags. But, my regex isn't good enough. Any suggestions? SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods: Non-Self-Terminating Tags: Code: FIND: (\b\w+?)(<\w.+?></\w?>)(\w+?\b) REPLACE: \1\3\2 Self-Terminating Tags: Code: FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b) REPLACE: \1\3\2 Last edited by enuddleyarbl; 02-16-2023 at 08:54 PM. Reason: Summarizing results*

01-25-2023, 11:18 AM	#2
JSWolf Resident Curmudgeon Posts: 79,654 Karma: 145864619 Join Date: Nov 2006 Location: Roslindale, Massachusetts Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3	I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.

01-25-2023, 01:28 PM	#7
j.p.s Grand Sorcerer Posts: 5,778 Karma: 103362673 Join Date: Apr 2011 Device: pb360	This is just an idea, I have no idea whether it is actually easier to implement. Have you tried moving the initial word fragment to after the tag?

01-26-2023, 05:52 AM	#13
Quoth Still reading Posts: 13,928 Karma: 103895653 Join Date: Jun 2017 Location: Ireland Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper	What he says ^^^^ If I happen to be fixing formatting I also delete all that junk. Pretty quick using global regex or the delete/edit tag tool.