01-25-2023, 11:06 AM | #1 |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
Words Split w/ "id=" Stuff
I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant Code:
ele<a id="page_330"></a>phant Code:
\w<[^/].+?></.+?>\w
SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods: Non-Self-Terminating Tags: Code:
FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b) REPLACE: \1\3\2 Code:
FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b) REPLACE: \1\3\2 Last edited by enuddleyarbl; 02-16-2023 at 09:54 PM. Reason: Summarizing results |
01-25-2023, 12:18 PM | #2 |
Resident Curmudgeon
Posts: 76,415
Karma: 136564696
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.
|
01-25-2023, 12:30 PM | #3 |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
Oops. Now that you mention it, I don't actually want to REMOVE those tags with the ids and stuff. I just want to move them outside of the word. I think I'll have to wrap the tags in a group as well and then add that to the end. Let me test it and if it works, I'll update my OP. Again.
|
01-25-2023, 12:59 PM | #4 |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
Nope. Moving those tags from the middle of the word is harder than REmoving them. To move them, it looks like I'm back to needing to select the whole front and rear word fragments outside the tags. And, I haven't been able to do that yet.
EDIT: Let me stick some trials in here until I figure something out. First, the OR ("|") is giving me issues with the replacement strings. So, I'm just going to work with the non-self-terminated tags. Second, it looks like I can grab some form of the front/rear word fragments with Code:
\b Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2
Last edited by enuddleyarbl; 01-25-2023 at 06:04 PM. |
01-25-2023, 01:40 PM | #5 | |
A Hairy Wizard
Posts: 3,223
Karma: 19000635
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Quote:
Code:
SEARCH: (\w)(<[^/].+?></.+?>)(\w)|(\w)(<[^/].+?/>)(\w) REPLACE: \2\1\3 -or- \1\3\2 Last edited by Turtle91; 01-25-2023 at 01:42 PM. |
|
01-25-2023, 02:02 PM | #6 | |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2 EDIT: To the best of my knowledge, the above search should set the first replacement group as starting from the nearest word boundary and running to the starting "<" of the interrupting tags. The second group should be everything from there that's in a <blah></somethingelse> pair. The third group should start from there and run to the next word boundary. The replacement of \1\3\2 sticks the first and last bits of the word together and then appends the tag set afterward. EDIT 2: I had a spurious plus ("+") in the search string for the closing tag. That made it look for at least one character after the "/" and if it didn't find one inside the tag, it happily continued looking until if either found one somewhere else or ran out of paragraph. I think I've fixed it (again). Sorry. Last edited by enuddleyarbl; 01-25-2023 at 05:58 PM. |
|
01-25-2023, 02:28 PM | #7 |
Grand Sorcerer
Posts: 5,527
Karma: 100606001
Join Date: Apr 2011
Device: pb360
|
This is just an idea, I have no idea whether it is actually easier to implement.
Have you tried moving the initial word fragment to after the tag? |
01-25-2023, 04:51 PM | #8 |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
I'm going to go the easy route and not bother putting an OR inside the search. I'll just have two different searches for this. The first will be what I did, above, for non-self-terminated tags. This is the search string for the self-terminated tags:
Code:
(\b\w+?)(<[^/]+?/>)(\w+?\b) Code:
\1\3\2 Code:
\2\1\3 |
01-25-2023, 06:53 PM | #9 | ||
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER. See Daisy.org: "Page Navigation": Quote:
My Solution I'd tackle it using: Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+) Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+) Replace: \2\1 This would convert your examples into: Code:
elephant<span epub:type="pagebreak" id="page_330" title="330"></span> elephant<a id="page_330"></a> What's the Regex Doing? Well, the 1st half is saying:
(Similar with the <a> page number version.) What's the 2nd half doing?
The Replace is saying:
Last edited by Tex2002ans; 01-25-2023 at 07:05 PM. |
||
01-25-2023, 07:32 PM | #10 |
Guru
Posts: 751
Karma: 1537886
Join Date: Sep 2013
Device: Kobo Forma
|
I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook. Probably 1) no human ever saw it 2) when they made the ebook there might not have been any standards, and 3) no one ever looks back at the horrible stuff they did in the dark ages to make it better.
Of course, on the glass half-full side of things, if they finagled those page id locations to be in the next space, then when someone referred to a bit of text by page number, an ebook user might not be able to find it. Although, occasionally being half a word off shouldn't be too onerous. EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like: "<span role="doc-pagebreak" id="pg24" aria-label="24"/>" Yet, I don't think I've ever seen anything like it. Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("<span epub:type="pagebreak" id="page_330" title="330"></span>") which seems to be making some kind of effort. At some point (probably about where I've finished re-formatting all the books in my library ), I might know enough about this stuff to have realized I should have changed all those references while I was in there fooling around. Last edited by enuddleyarbl; 01-25-2023 at 07:46 PM. |
01-25-2023, 08:11 PM | #11 | |||
Wizard
Posts: 2,304
Karma: 12587727
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Code:
There was an ele- -------PAGE 2------- phant in the zoo. Code:
There was an ele<a page="page_2"></a>phant in the zoo. Quote:
For all the latest "Real Page Numbers" (RPNs) stuff, see my post here:
where I link to many of the previous topics. You'll also want to type this into your favorite search engine: Code:
RPNs Tex2002ans site:mobileread.com page numbers Tex2002ans site:mobileread.com - - - For a working sample of EPUB3 page numbers, see Doitu's sample book. And his fantastic Sigil plugin: - - - Quote:
The simple <a> was the EPUB2 method. The <span> + epub:type="pagebreak" is the EPUB3 method. Last edited by Tex2002ans; 01-25-2023 at 08:18 PM. |
|||
01-26-2023, 05:47 AM | #12 | |
Resident Curmudgeon
Posts: 76,415
Karma: 136564696
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
01-26-2023, 06:52 AM | #13 |
the rook, bossing Never.
Posts: 12,359
Karma: 92073397
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
What he says ^^^^
If I happen to be fixing formatting I also delete all that junk. Pretty quick using global regex or the delete/edit tag tool. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil's Infamous "colon" Error on File Split | slowsmile | Sigil | 24 | 10-27-2016 10:45 AM |
Regex or other method to find split quotations "" | CyanBC | Sigil | 9 | 05-14-2013 03:52 PM |
Split long words using the "¬" character (small screens) | DSpider | Workshop | 5 | 03-16-2012 08:09 AM |
George R. R. Martin's "A Dance With Dragons" to be split into separate books. | Exer | General Discussions | 4 | 04-02-2011 09:50 AM |
Any way to revert the "Do No Split On Page Breaks" option? | dsana123 | Calibre | 2 | 07-10-2010 03:37 PM |