03-01-2020, 07:32 AM | #1 |
Evangelist
Posts: 487
Karma: 32554
Join Date: May 2014
Location: Canada
Device: Kobo Libra Colour
|
Help creating possible Regex-Function
**If this shouldn't be here, please move or delete, and you have my apologies**
I have been trying to understand Python to create my own regex-functions, but even after a year, I'm clueless. I hope that someone can help me create one...or tell me if what I want is even possible. I have a very long, created text, where the author included a lot of sections where each paragraph is wrapped in tags for italics. That's fine. But they also wrapped all the sections in a tag which automatically makes those paragraphs italic. I have been trying regular search and replace expressions, but I can't get anything that works whether there is one paragraph or eight paragraphs to remove the italics tags from. It's either too greedy, or not greedy enough. I would like help to do the following:
In any case, I'd appreciate being directed to an online tutorial type of place that would be easy enough for me to understand (maybe easier than "Python for Dummies" at the rate I'm going ) so I can eventually learn to do it myself. If someone would prefer to help off-forum - messages - I don't mind that either. |
03-01-2020, 08:10 AM | #2 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
do it in two passes.
1 remove italic tags from paragraphs then 2 remove the section tags if you post a sample section it will be easier to help e.g. to remove the form tags find <form>(.*)</form) replace with \1 |
Advert | |
|
03-01-2020, 08:43 AM | #3 | |
Evangelist
Posts: 487
Karma: 32554
Join Date: May 2014
Location: Canada
Device: Kobo Libra Colour
|
Quote:
Here's an example: Spoiler:
And more difficult, but if possible: Spoiler:
|
|
03-01-2020, 09:04 AM | #4 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
ok for the 1st one
find <p><em>(.*)</em></p> replace <p>\1</p> for the 2nd one, use 2 passes - first remove the not-needed inner bits which start with a close em tag followed by an open em tag: find </em>(.*)<em> replace \1 then use a 2nd pass to change em to strong the trick is to use several simple expressions not one very complicated one, and review results after each stage. make a backup before risking a replace all if you have the patience, step through using find replace to do single operations and then move on to the next candidate, that way you can skip past any o you want to leave unchanged NB I do all this using sigil - syntax may be different for other tools |
03-01-2020, 09:45 AM | #5 |
Well trained by Cats
Posts: 30,351
Karma: 58032210
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
If you were doing ALL italic, I would suggest the Edit Spans and Divs plugin.
Don't believe the name. it does many more tag types...ONE type at a time (It is Diaps toolbag in Sigil) For the stop trying to do all in ONE PASS. All you do is give Murphy a leg up (.*?) reduces the greedyness of th (.*) <p><i>What!</i> will happen if this <i>code appears?</i></p> |
Advert | |
|
03-01-2020, 11:13 AM | #6 |
Evangelist
Posts: 487
Karma: 32554
Join Date: May 2014
Location: Canada
Device: Kobo Libra Colour
|
I think I am misunderstanding you. From what I understand, your suggestions would remove the <em></em> tags, regardless if they are found between the <form></form> tags, which is NOT what I want. Anything outside the <form></form> tags should stay as is. The <form> tag contains formatting which has italics in it, so the addition of <em> tags are unnecessary.
I'll insert another, maybe better, example for you to comment on so I can understand. Although I'm starting to think it can't be done, or I'm just missing something obvious. I'll mark tags I want to keep in blue, tags I don't want to keep in red (only those I'm asking about; <p> tags I won't touch.) Spoiler:
I don't mind doing multiple passes, but as it is, I haven't been able to do anything except check each one almost individually. That's why I though that creating a Regex-Function was the way to go. Ideally, I had hoped to be able to have something that says: "change <p><em></em></p> to just <p></p> when between <form></form> tags". I wouldn't even mind if it was "remove all <em></em> tags when between <form></form> tags". |
03-01-2020, 11:33 AM | #7 | |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
Quote:
|
|
03-01-2020, 11:46 AM | #8 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
ps. you bracket the text fragments you want to keep, so you can refer to them as \1 \2 \3 in the replace forumala
so ( back on a pc keyboard now... find <form>(.*)<em>(.*)</em>(.*)</form> that finds stuff that is in em tags which are within form tags, and you have 3 text fragments which will be preserved now assemble how you want it to look without the em tags so replace with <form>\1\2\3</form> |
03-01-2020, 12:44 PM | #9 |
Evangelist
Posts: 487
Karma: 32554
Join Date: May 2014
Location: Canada
Device: Kobo Libra Colour
|
THANK YOU
I knew I was missing something simple. Leave it to me to complicate everything. I was using something similar to that (<form>(.*?)<em>|</em>), but as you can see, I would need multiple passes, And it would also jump to different sections on occasion. I got very annoyed and frustrated. Using your expression, even if I still have to check, it will be immensely easier. I can also tweak it to only remove at the beginning and end of paragraphs, and then go through and change the other <em> tags to <strong> tags. I'm babbling. Sorry. Thanks again! |
03-01-2020, 12:58 PM | #10 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
Instead of \1\2\3 for the replace you can optionally put new tags either side of the \2
E.g. \1<strong>\2</strong>\3 |
03-01-2020, 01:03 PM | #11 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
Ps I only know a small subset of what can be done with regex, mostly learned by asking here!
You got lucky in that you wanted something similar to what I had done in another book tweak. I use the sigil editor rather than the calibre one, and I think there is a thread of regex examples in The sigil forum |
03-02-2020, 03:47 PM | #12 |
Resident Curmudgeon
Posts: 75,813
Karma: 134321338
Join Date: Nov 2006
Location: Roslindale, Massachusetts
Device: Kobo Libra 2, Kobo Aura H2O, PRS-650, PRS-T1, nook STR, PW3
|
However, the change from <em> to <strong> can be done with Diaps Editing Toolbag editor plugin. You do have to configure it to add in strong to what em can be replaced with. And once done, you won't need regex.
|
03-02-2020, 06:40 PM | #13 |
Not Quite Dead
Posts: 194
Karma: 654170
Join Date: Jul 2015
Device: Paperwhite 4; Galaxy Tab
|
One thing to consider in similar situations such as presented by the OP is to use CSS contextual selectors rather than subject the text to regex.
As I understand, there was a problem with italicized text within forms. A style rule could deal with that instantly: form em {font-style: normal} This would un-italicize anything within em tags which are in a form—while ignoring all other em tag content. |
03-03-2020, 02:04 AM | #14 |
Wizard
Posts: 3,305
Karma: 10259306
Join Date: May 2016
Device: kobo forma, Kobo Libra, Huawei media Tab, fire HD10, PW3 HDX8.9,
|
that's useful to know, and interesting.
I don't think I have ever seen a <form> tag when tweaking novels though. what is the proper/normal use of <form> in book CSS ? google tells me that <form> in HTML is used for user iput forms, which I guessed would be the case, but that make no sense in an EPUB ? Last edited by stumped; 03-03-2020 at 02:09 AM. |
03-03-2020, 05:53 AM | #15 | |
Not Quite Dead
Posts: 194
Karma: 654170
Join Date: Jul 2015
Device: Paperwhite 4; Galaxy Tab
|
Quote:
|
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Predefined regex for Regex-function | sherman | Editor | 3 | 01-19-2020 05:32 AM |
regex function replacement | The_book | Sigil | 5 | 12-09-2019 09:45 AM |
Random number in Regex Function? | nqk | Editor | 2 | 05-23-2017 11:47 PM |
RegEx-Function and hyphenation problem | scratch | Editor | 4 | 01-28-2017 12:44 PM |
Regex Function about «» and “” | senhal | Editor | 8 | 04-06-2016 02:12 AM |