s&r for paired tags

AlanHK · 07-27-2014, 02:55 AM

I've got a book file full of code like :

Code:

<p class="calibre2"><span class="none2">blah blah blah</span></p>

Is there a way I can remove these spans, (the "none2" ones) to get

Code:

<p class="calibre2">blah blah blah</p>

without messing up any other spans?

I can remove the opening by simple s&r, but then I would have orphaned , but could not just delete without screwing up other spans.

I could change "" to "" and neuter them, but I really hate to leave junk code in the file.

-- PS, I know what regex are,and have written some simple ones, but parsing HTML is a bit hairy.

Doitsu · 07-27-2014, 04:18 AM

If the spans are not nested the following simple regex should do the trick:

Find:(.*?)
Replace:\1

Tex2002ans · 07-27-2014, 06:01 AM

Quote:

Originally Posted by Doitsu

If the spans are not nested the following simple regex should do the trick:

This needs to be stressed. Quite often span tags are NOT nested, and you might accidentally cause a lot of damage if you just do a large "Replace All".

(I have done it many times, and didn't notice until later when I was doing a few cleaning passes). Later wondering "why the heck is this entire paragraph in smallcaps?".

Always save versions of your EPUBs when doing larger edits like this.

For nested tags, you really just need something that can actually PARSE HTML, and not just Regex.

cybmole · 07-27-2014, 06:33 AM

you could probably regex out only the ones that are adjacent to the P tags
find
(.*?)
replace
\1

but take a backup first- this code will go wrong if you have nested spans!

AlanHK · 07-27-2014, 07:50 AM

Quote:

Originally Posted by Doitsu

If the spans are not nested the following simple regex should do the trick:

Find:(.*?)
Replace:\1

That should work, thanks.

Quote:

Originally Posted by Tex2002ans

For nested tags, you really just need something that can actually PARSE HTML, and not just Regex.

Well, Sigil can parse HTML. It highlights the tag pairs, for instance. I was hoping there were some options hidden away that I could use to do this. Too bad it doesn't give users more HTML-aware s&r than generic regex.

Is there any HTML code editor that does stuff like this?

cybmole · 07-27-2014, 08:30 AM

if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure.
calibre editor is your other go-to solution as it is in ongoing development / you can post enhancement requests

NB the spans may look ugly but they are mostly harmless - the book will render OK if you just leave them be!

eschwartz · 07-27-2014, 09:41 AM

Find:

Code:

<span class="none2">((?:(?!<span).)*?)</span>

Replace:

Code:

\1

Using a negative lookahead we search for the LACK of a nested span, followed by any character, then repeat.

Matches nested tags as long as only the outer tag is a span. But you can be more specific if you want, by changing the lookahead.

http://regular-expressions.info/completelines.html

theducks · 07-27-2014, 11:56 AM

Nested Spans are a pain and if you START out with the code in Post 2, you will have a disaster because that is only safe with a simple (and IMHO unnecessary, except it is a conversion simplifier) span as you show.
process

Code:

<p class="calibre2 none2">blah blah blah</p>

should work the same.

I am going give a try to eschwartz's REGEX

AlanHK · 07-27-2014, 12:33 PM

Quote:

Originally Posted by cybmole

if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure.

That what I do want. I'm surprised after all these years there isn't something that does. (HTML, not just ebooks).

I use Ultraedit for my text. But I was hoping for more than a text editor that highlights.

Quote:

Originally Posted by theducks

I am going give a try to eschwartz's REGEX

Googled it, "No results found for "eschwartz's REGEX" "

??

mrmikel · 07-27-2014, 01:46 PM

If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions). You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended.

Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace.

AlanHK · 07-27-2014, 02:16 PM

Quote:

Originally Posted by mrmikel

If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions).

Just my impression that Sigil is (was?) the more code-editing tool, while Calibre the more GUI.

Quote:

Originally Posted by mrmikel

You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended.

Since it's almost every paragraph in a book, that's a few thousand cases. One at a time isn't an option.

Anyway, I worked it out by first finding and fixing the spans I wanted to keep (as it happens, one) and then could delete the rest with a clear conscience.

Quote:

Originally Posted by mrmikel

Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace.

Duh. I thought it was some kind of software. I didn't register the names next to posts.

PeterT · 07-27-2014, 02:45 PM

You might also check out the forked version of ePub Clean plugin for calibre that has some support for removing SPANs

signum · 07-27-2014, 03:14 PM

If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not , the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.

eschwartz · 07-27-2014, 08:05 PM

Quote:

Originally Posted by signum

If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not , the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.

That is why I prefer using a negative lookahead -- it catches that too.

phossler · 07-28-2014, 09:26 AM

@eschwartz--

Quote:

((?

?!<span).)*?)

1. Can you explain how the negative look ahead works, including breaking down the pieces of the RE?

2. Many times when I'm cleaning an epub, removing unneeded 'class=" ... " ' in I'll eventually end up with a lot of ..... constructs. It appears that your RE is better than the more simplistic RE I was using to just remove them

Thanks

07-27-2014, 02:55 AM	#1
AlanHK Guru Posts: 681 Karma: 929286 Join Date: Apr 2014 Device: PW-3, iPad, Android phone	s&r for paired tags I've got a book file full of code like : Code: <p class="calibre2"><span class="none2">blah blah blah</span></p> Is there a way I can remove these spans, (the "none2" ones) to get Code: <p class="calibre2">blah blah blah</p> without messing up any other spans? I can remove the opening by simple s&r, but then I would have orphaned </span>, but could not just delete </span> without screwing up other spans. I could change "<span class="none2">" to "<span>" and neuter them, but I really hate to leave junk code in the file. -- PS, I know what regex are,and have written some simple ones, but parsing HTML is a bit hairy. Last edited by AlanHK; 07-27-2014 at 03:23 AM.

07-27-2014, 06:33 AM	#4
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	you could probably regex out only the ones that are adjacent to the P tags find <p class="calibre2"><span class="none2">(.*?)</span></p> replace <p class="calibre2">\1</p> but take a backup first- this code will go wrong if you have nested spans!

07-27-2014, 09:41 AM	#7
eschwartz Ex-Helpdesk Junkie Posts: 19,421 Karma: 85400180 Join Date: Nov 2012 Location: The Beaten Path, USA, Roundworld, This Side of Infinity Device: Kindle Touch fw5.3.7 (Wifi only)	Find: Code: <span class="none2">((?:(?!<span).)?)</span> Replace: Code: \1 Using a negative lookahead we search for the LACK of a nested span, followed by any character, then repeat. Matches nested tags as long as only the outer tag is a span. But you can be more specific if you want, by changing the lookahead. http://regular-expressions.info/completelines.html Last edited by eschwartz; 07-27-2014 at 09:50 AM.*

07-27-2014, 11:56 AM	#8
theducks Well trained by Cats Posts: 31,838 Karma: 64181416 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Nested Spans are a pain and if you START out with the code in Post 2, you will have a disaster because that is only safe with a simple (and IMHO unnecessary, except it is a conversion simplifier) span as you show. process Code: <p class="calibre2 none2">blah blah blah</p> should work the same. I am going give a try to eschwartz's REGEX

07-27-2014, 03:14 PM	#13
signum Zealot Posts: 119 Karma: 64428 Join Date: Aug 2011 Device: none	If nested spans are a possibility, I like to use a search pattern similar to post #2, except I replace the stuff inside the parentheses with ([^<]*). This says to match any string of characters up to, but not including, a less than sign. If the immediately following characters are not </span>, the entire pattern fails and no replacement is done.Otherwise, the replacement stays the same. In my experience, this leaves only a handful of paragraphs to be dealt with in another way, often by hand.

07-27-2014, 04:18 AM	#2
Doitsu Grand Sorcerer Posts: 5,831 Karma: 24222221 Join Date: Dec 2010 Device: Kindle PW2	If the spans are not nested the following simple regex should do the trick: Find:<span class="none2">(.*?)</span> Replace:\1

07-27-2014, 08:30 AM	#6
cybmole Wizard Posts: 3,720 Karma: 1759970 Join Date: Sep 2010 Device: none	if you want a HTML editor try freeware notepad++, but don't expect it to understand ebook structure. calibre editor is your other go-to solution as it is in ongoing development / you can post enhancement requests NB the spans may look ugly but they are mostly harmless - the book will render OK if you just leave them be!

07-27-2014, 01:46 PM	#10
mrmikel Color me gone Posts: 2,089 Karma: 1445295 Join Date: Apr 2008 Location: Central Oregon Coast Device: PRS-300	If you wait long enough and use the calibre editor, he will have a scripting function that should be able to do multiple tests for different cases. The rest of us know from bitter experience with Sigil, since it has no global undo, how hard it is to consider all cases with regex(regular expressions). You can use the simple version proposed earlier, but the only way to do it safely to use it one find at a time so you can see when it vacuums up more text than you intended. Look up 2-3 posts above and the regex that eschwartz proposed is there showing find and replace.

07-27-2014, 02:45 PM	#12
PeterT Grand Sorcerer Posts: 14,013 Karma: 82524140 Join Date: Nov 2007 Location: Toronto Device: Libra H2O, Libra Colour	You might also check out the forked version of ePub Clean plugin for calibre that has some support for removing SPANs

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Tags & Series	RealRedhair	Library Management	22	07-22-2014 08:28 AM
Calibre Tags & Aldiko Tags Not the Same	Themus	Calibre	3	03-21-2012 08:23 PM
Amazon Tags - Popular tags vs Unique tags.	chrisanthropic	Writers' Corner	6	09-19-2011 11:18 PM
FBReader tags on DR & PC	sasilk	iRex	0	01-23-2010 01:38 AM

Advert

Advert