07-27-2014, 03:55 AM | #1 |
Guru
Posts: 677
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
s&r for paired tags
I've got a book file full of code like :
Code:
<p class="calibre2"><span class="none2">blah blah blah</span></p> Is there a way I can remove these spans, (the "none2" ones) to get Code:
<p class="calibre2">blah blah blah</p> I can remove the opening by simple s&r, but then I would have orphaned </span>, but could not just delete </span> without screwing up other spans. I could change "<span class="none2">" to "<span>" and neuter them, but I really hate to leave junk code in the file. -- PS, I know what regex are,and have written some simple ones, but parsing HTML is a bit hairy. Last edited by AlanHK; 07-27-2014 at 04:23 AM. |
07-27-2014, 05:18 AM | #2 |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
If the spans are not nested the following simple regex should do the trick:
Find:<span class="none2">(.*?)</span> Replace:\1 |
Advert | |
|
07-27-2014, 07:01 AM | #3 | |
Wizard
Posts: 2,303
Karma: 12126963
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
(I have done it many times, and didn't notice until later when I was doing a few cleaning passes). Later wondering "why the heck is this entire paragraph in smallcaps?". Always save versions of your EPUBs when doing larger edits like this. For nested tags, you really just need something that can actually PARSE HTML, and not just Regex. |
|
07-27-2014, 07:33 AM | #4 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
you could probably regex out only the ones that are adjacent to the P tags
find <p class="calibre2"><span class="none2">(.*?)</span></p> replace <p class="calibre2">\1</p> but take a backup first- this code will go wrong if you have nested spans! |
07-27-2014, 08:50 AM | #5 | ||
Guru
Posts: 677
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
Quote:
That should work, thanks. Quote:
Is there any HTML code editor that does stuff like this? |
||
Advert | |
|
07-27-2014, 09:30 AM | #6 |
Wizard
Posts: 3,720
Karma: 1759970
Join Date: Sep 2010
Device: none
|
if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure.
calibre editor is your other go-to solution as it is in ongoing development / you can post enhancement requests NB the spans may look ugly but they are mostly harmless - the book will render OK if you just leave them be! |
07-27-2014, 10:41 AM | #7 |
Ex-Helpdesk Junkie
Posts: 19,421
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Find:
Code:
<span class="none2">((?:(?!<span).)*?)</span> Code:
\1 Matches nested tags as long as only the outer tag is a span. But you can be more specific if you want, by changing the lookahead. http://regular-expressions.info/completelines.html Last edited by eschwartz; 07-27-2014 at 10:50 AM. |
07-27-2014, 12:56 PM | #8 |
Well trained by Cats
Posts: 30,441
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Nested Spans are a pain and if you START out with the code in Post 2, you will have a disaster because that is only safe with a simple (and IMHO unnecessary, except it is a conversion simplifier) span as you show.
process Code:
<p class="calibre2 none2">blah blah blah</p> I am going give a try to eschwartz's REGEX |
07-27-2014, 01:33 PM | #9 | |
Guru
Posts: 677
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
Quote:
I use Ultraedit for my text. But I was hoping for more than a text editor that highlights. Googled it, "No results found for "eschwartz's REGEX" " ?? |
|
07-27-2014, 02:46 PM | #10 |
Color me gone
Posts: 2,089
Karma: 1445295
Join Date: Apr 2008
Location: Central Oregon Coast
Device: PRS-300
|
If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions). You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended.
Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace. |
07-27-2014, 03:16 PM | #11 | ||
Guru
Posts: 677
Karma: 929286
Join Date: Apr 2014
Device: PW-3, iPad, Android phone
|
Quote:
Quote:
Anyway, I worked it out by first finding and fixing the spans I wanted to keep (as it happens, one) and then could delete the rest with a clear conscience. Duh. I thought it was some kind of software. I didn't register the names next to posts. |
||
07-27-2014, 03:45 PM | #12 |
Grand Sorcerer
Posts: 12,733
Karma: 75000000
Join Date: Nov 2007
Location: Toronto
Device: Libra H2O, Libra Colour
|
You might also check out the forked version of ePub Clean plugin for calibre that has some support for removing SPANs
|
07-27-2014, 04:14 PM | #13 |
Zealot
Posts: 119
Karma: 64428
Join Date: Aug 2011
Device: none
|
If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not </span>, the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.
|
07-27-2014, 09:05 PM | #14 | |
Ex-Helpdesk Junkie
Posts: 19,421
Karma: 85397180
Join Date: Nov 2012
Location: The Beaten Path, USA, Roundworld, This Side of Infinity
Device: Kindle Touch fw5.3.7 (Wifi only)
|
Quote:
|
|
07-28-2014, 10:26 AM | #15 | |
Wizard
Posts: 1,085
Karma: 412718
Join Date: Jan 2009
Location: Valley Forge, PA, USA
Device: Kindle Paperwhite
|
@eschwartz--
Quote:
2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in <span class="..."> I'll eventually end up with a lot of <span>.....</span> constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them Thanks |
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Tags & Series | RealRedhair | Library Management | 22 | 07-22-2014 09:28 AM |
Calibre Tags & Aldiko Tags Not the Same | Themus | Calibre | 3 | 03-21-2012 09:23 PM |
Amazon Tags - Popular tags vs Unique tags. | chrisanthropic | Writers' Corner | 6 | 09-20-2011 12:18 AM |
FBReader tags on DR & PC | sasilk | iRex | 0 | 01-23-2010 02:38 AM |