![]() |
#1 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Words Split w/ "id=" Stuff
I was getting tired of the Calibre Editor's spellchecker being stymied by artificially split words (IOW, the publisher added various tags for things like "id=" right in the middle of them). For instance:
Code:
ele<span epub:type="pagebreak" id="page_330" title="330"></span>phant Code:
ele<a id="page_330"></a>phant Code:
\w<[^/].+?></.+?>\w
SUMMARIZING EDIT: From the material below, I've come up with a way of moving those tags from the middle of the word to the end. I've created two saved searches (one for non-self-terminating tags and one for self-terminating ones). I run both since, for some reason, some of my books use both methods: Non-Self-Terminating Tags: Code:
FIND: (\b\w+?)(<\w.+?></\w*?>)(\w+?\b) REPLACE: \1\3\2 Code:
FIND: (\b\w+?)(<[^/]+?/>)(\w+?\b) REPLACE: \1\3\2 Last edited by enuddleyarbl; 02-16-2023 at 08:54 PM. Reason: Summarizing results |
![]() |
![]() |
![]() |
#2 |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,654
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
I use Diaps Editing Toolbag to remove page numbers inside the HTML. It's very easy to use. It's an editor plugin for Calibre.
|
![]() |
![]() |
![]() |
#3 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
![]() |
|
![]() |
![]() |
![]() |
#4 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Nope. Moving those tags from the middle of the word is harder than REmoving them. To move them, it looks like I'm back to needing to select the whole front and rear word fragments outside the tags. And, I haven't been able to do that yet.
EDIT: Let me stick some trials in here until I figure something out. First, the OR ("|") is giving me issues with the replacement strings. So, I'm just going to work with the non-self-terminated tags. Second, it looks like I can grab some form of the front/rear word fragments with Code:
\b Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2
Last edited by enuddleyarbl; 01-25-2023 at 05:04 PM. |
![]() |
![]() |
![]() |
#5 | |
A Hairy Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 3,346
Karma: 20171571
Join Date: Dec 2012
Location: Charleston, SC today
Device: iPhone 15/11/X/6/iPad 1,2,Air & Air Pro/Surface Pro/Kindle PW & Fire
|
Quote:
Code:
SEARCH: (\w)(<[^/].+?></.+?>)(\w)|(\w)(<[^/].+?/>)(\w) REPLACE: \2\1\3 -or- \1\3\2 Last edited by Turtle91; 01-25-2023 at 12:42 PM. |
|
![]() |
![]() |
![]() |
#6 | |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
Quote:
Code:
SEARCH: (\b\w+?)(<[^/]+?></.?>)(\w+?\b) REPLACE: \1\3\2 EDIT: To the best of my knowledge, the above search should set the first replacement group as starting from the nearest word boundary and running to the starting "<" of the interrupting tags. The second group should be everything from there that's in a <blah></somethingelse> pair. The third group should start from there and run to the next word boundary. The replacement of \1\3\2 sticks the first and last bits of the word together and then appends the tag set afterward. EDIT 2: I had a spurious plus ("+") in the search string for the closing tag. That made it look for at least one character after the "/" and if it didn't find one inside the tag, it happily continued looking until if either found one somewhere else or ran out of paragraph. I think I've fixed it (again). Sorry. Last edited by enuddleyarbl; 01-25-2023 at 04:58 PM. |
|
![]() |
![]() |
![]() |
#7 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 5,778
Karma: 103362673
Join Date: Apr 2011
Device: pb360
|
This is just an idea, I have no idea whether it is actually easier to implement.
Have you tried moving the initial word fragment to after the tag? |
![]() |
![]() |
![]() |
#8 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
I'm going to go the easy route and not bother putting an OR inside the search. I'll just have two different searches for this. The first will be what I did, above, for non-self-terminated tags. This is the search string for the self-terminated tags:
Code:
(\b\w+?)(<[^/]+?/>)(\w+?\b) Code:
\1\3\2 Code:
\2\1\3 |
![]() |
![]() |
![]() |
#9 | ||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
If a page break lands in the middle of a word, they should've been shoving the page numbers AFTER. See Daisy.org: "Page Navigation": Quote:
My Solution I'd tackle it using: Find #1: (<span epub:type="pagebreak" [^>]+></span>)([\w”\?!\.]+) Find #2: (<a id="page_\d+"></a>)([\w”\?!\.]+) Replace: \2\1 This would convert your examples into: Code:
elephant<span epub:type="pagebreak" id="page_330" title="330"></span> elephant<a id="page_330"></a> What's the Regex Doing? Well, the 1st half is saying:
(Similar with the <a> page number version.) What's the 2nd half doing?
The Replace is saying:
Last edited by Tex2002ans; 01-25-2023 at 06:05 PM. |
||
![]() |
![]() |
![]() |
#10 |
Guru
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 776
Karma: 1538394
Join Date: Sep 2013
Device: Kobo Forma
|
I'm assuming that the publisher just ran some automated "stick a page id in here somewhere" program that looked at a print book and at the start of each dead-tree page, stuck a tag into the ebook. Probably 1) no human ever saw it 2) when they made the ebook there might not have been any standards, and 3) no one ever looks back at the horrible stuff they did in the dark ages to make it better.
Of course, on the glass half-full side of things, if they finagled those page id locations to be in the next space, then when someone referred to a bit of text by page number, an ebook user might not be able to find it. Although, occasionally being half a word off shouldn't be too onerous. EDIT: From that "Page Navigation" link you provided, inline page markers are supposed to look something like: "<span role="doc-pagebreak" id="pg24" aria-label="24"/>" Yet, I don't think I've ever seen anything like it. Ninety-nine percent of the time, it'll be the old <a id="pag_330"></a> method, which that document specifically says bad things about. Occasionally, I'll see something like what's done in the current book I'm editing ("<span epub:type="pagebreak" id="page_330" title="330"></span>") which seems to be making some kind of effort. At some point (probably about where I've finished re-formatting all the books in my library ![]() Last edited by enuddleyarbl; 01-25-2023 at 06:46 PM. |
![]() |
![]() |
![]() |
#11 | |||
Wizard
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 2,306
Karma: 13057279
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Code:
There was an ele- -------PAGE 2------- phant in the zoo. Code:
There was an ele<a page="page_2"></a>phant in the zoo. Quote:
![]() For all the latest "Real Page Numbers" (RPNs) stuff, see my post here:
where I link to many of the previous topics. You'll also want to type this into your favorite search engine: Code:
RPNs Tex2002ans site:mobileread.com page numbers Tex2002ans site:mobileread.com - - - For a working sample of EPUB3 page numbers, see Doitu's sample book. And his fantastic Sigil plugin: - - - Quote:
The simple <a> was the EPUB2 method. The <span> + epub:type="pagebreak" is the EPUB3 method. Last edited by Tex2002ans; 01-25-2023 at 07:18 PM. |
|||
![]() |
![]() |
![]() |
#12 | |
Resident Curmudgeon
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 79,654
Karma: 145864619
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
Quote:
|
|
![]() |
![]() |
![]() |
#13 |
Still reading
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 13,928
Karma: 103895653
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper
|
What he says ^^^^
If I happen to be fixing formatting I also delete all that junk. Pretty quick using global regex or the delete/edit tag tool. |
![]() |
![]() |
![]() |
Thread Tools | Search this Thread |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Sigil's Infamous "colon" Error on File Split | slowsmile | Sigil | 24 | 10-27-2016 09:45 AM |
Regex or other method to find split quotations "" | CyanBC | Sigil | 9 | 05-14-2013 02:52 PM |
Split long words using the "¬" character (small screens) | DSpider | Workshop | 5 | 03-16-2012 07:09 AM |
George R. R. Martin's "A Dance With Dragons" to be split into separate books. | Exer | General Discussions | 4 | 04-02-2011 08:50 AM |
Any way to revert the "Do No Split On Page Breaks" option? | dsana123 | Calibre | 2 | 07-10-2010 02:37 PM |