Using the Search & Replace feature

Manichean · 01-26-2011, 06:26 PM

The search and replace feature uses regular expressions to describe what text to replace. If you need an introduction, there's one available here.

You can use the search & replace feature in the conversion options to search and replace strings of text with some other strings. This can, for example, be used to remove headers/footers or pagenumbers. Note that the search & replace operates on the XHTML Calibre produces during conversion, not on the original file.
You can input a regular expression that describes the string of text that will be replaced during the conversion. The neat part is the wizard: Click on the wizard staff and you get a preview of what Calibre "sees" during the conversion process- the previously mentioned XHTML. Find the string you want to replace and construct your regex accordingly. Hit the button labeled "Test" and Calibre highlights the parts it would replace were you to use the regexp. Once you're satisfied, hit OK, input your replacement text, and convert. If you supply an empty string as a replacement text, Calibre will simply delete the strings matching the regular expression.

Practical examples

Removing header/footer strings:

Spoiler:

Code:

"Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4">
<b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4">
It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was

(Shamelessly ripped out of this thread.)
You want to remove the ABC Amber LIT Converter advert that's embedded in the text. To do that, you'll have to remove some of the tags as well. In this example, I'd recommend beginning with the tag <b class="calibre2">, now you have to end with the corresponding closing tag (opening tags are <tag>, closing tags are </tag>), which is simply the next </b> in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using the regex <b.*?>, the closing tag using </b>, thus we could remove everything between those tags using

Code:

<b.*?>.*?</b>

But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression

Code:

<b.*?>\s*Generated\s+by\s+ABC\s+Amber\s+LIT.*?</b>

The \s with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what Calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, Calibre tries to repair the damaged code after doing the header/footer removal.

Moving footnotes:

Spoiler:

Consider a book where the footnotes are presented at the end of a paragraph or what was once a physical page, as may be the case when the source of your book is a OCR'd paper book. An example may look like this:

Code:

<br>
Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br>
nur bewundern. Trotz der Angst war Kathy* nicht durchgedreht. Beide <br>
hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br>
Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br>
<br>
* Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ <br>
<hr>

(From this thread) The asterisk marks where the footnote, the text after the second asterisk, should go. We want to insert the footnote inside a pair of brackets. The assumptions we make for finding the footnotes are that

the footnotes contain no markup (no HTML tags)
the footnote starts at the second asterisk and should be inserted at the position of the first asterisk
there's only one footnote per page

(These are made just for conveniences sake to show a proof-of-concept, if you have a more complicated case than the one presented here, adopt your regular expression accordingly.)
The search expression would then, for example, be

Code:

(?s)\*\s*(?P<text>[^*]*?)\s*\*\s*(?P<footnote>[^<]*)

with the replacement text

Code:

(\g<footnote>) \g<text>

This would yield, after the search & replace finishes, the result

Code:

<br>
Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br>
nur bewundern. Trotz der Angst war Kathy(Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ ) nicht durchgedreht. Beide <br>
hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br>
Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br>
<br><br>
<hr>

Of note here is that, in the regular expression, we use named groups for backreferences. This is to be preferred over numerals as backreferences, as it is easier to read and thus gives more control over what actually happens.

01-26-2011, 06:26 PM	#1
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Using the Search & Replace feature The search and replace feature uses regular expressions to describe what text to replace. If you need an introduction, there's one available here. You can use the search & replace feature in the conversion options to search and replace strings of text with some other strings. This can, for example, be used to remove headers/footers or pagenumbers. Note that the search & replace operates on the XHTML Calibre produces during conversion, not on the original file. You can input a regular expression that describes the string of text that will be replaced during the conversion. The neat part is the wizard: Click on the wizard staff and you get a preview of what Calibre "sees" during the conversion process- the previously mentioned XHTML. Find the string you want to replace and construct your regex accordingly. Hit the button labeled "Test" and Calibre highlights the parts it would replace were you to use the regexp. Once you're satisfied, hit OK, input your replacement text, and convert. If you supply an empty string as a replacement text, Calibre will simply delete the strings matching the regular expression. Practical examples Removing header/footer strings: Spoiler: Code: "Maybe, but the cops feel like you do, Anita. What's one more dead vampire? New laws don't change that." </p><p class="calibre4"> <b class="calibre2">Generated by ABC Amber LIT Conv<a href="http://www.processtext.com/abclit.html" class="calibre3">erter, http://www.processtext.com/abclit.html</a></b></p><p class="calibre4"> It had only been two years since Addison v. Clark. The court case gave us a revised version of what life was (Shamelessly ripped out of this thread.) You want to remove the ABC Amber LIT Converter advert that's embedded in the text. To do that, you'll have to remove some of the tags as well. In this example, I'd recommend beginning with the tag <b class="calibre2">, now you have to end with the corresponding closing tag (opening tags are <tag>, closing tags are </tag>), which is simply the next </b> in this case. (Refer to a good HTML manual or ask in the forum if you are unclear on this point.) The opening tag can be described using the regex <b.?>, the closing tag using </b>, thus we could remove everything between those tags using Code: <b.?>.?</b> But using this expression would be a bad idea, because it removes everything enclosed by <b>- tags (which, by the way, render the enclosed text in bold print), and it's a fair bet that we'll remove portions of the book in this way. Instead, include the beginning of the enclosed string as well, making the regular expression Code: <b.?>\sGenerated\s+by\s+ABC\s+Amber\s+LIT.?</b> The \s with quantifiers are included here instead of explicitly using the spaces as seen in the string to catch any variations of the string that might occur. Remember to check what Calibre will remove to make sure you don't remove any portions you want to keep if you test a new expression. If you only check one occurence, you might miss a mismatch somewhere else in the text. Also note that should you accidentally remove more or fewer tags than you actually wanted to, Calibre tries to repair the damaged code after doing the header/footer removal. Moving footnotes: Spoiler: Consider a book where the footnotes are presented at the end of a paragraph or what was once a physical page, as may be the case when the source of your book is a OCR'd paper book. An example may look like this: Code: <br> Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br> nur bewundern. Trotz der Angst war Kathy* nicht durchgedreht. Beide <br> hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br> Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br> <br> * Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ <br> <hr> (From this thread) The asterisk marks where the footnote, the text after the second asterisk, should go. We want to insert the footnote inside a pair of brackets. The assumptions we make for finding the footnotes are that the footnotes contain no markup (no HTML tags) the footnote starts at the second asterisk and should be inserted at the position of the first asterisk there's only one footnote per page (These are made just for conveniences sake to show a proof-of-concept, if you have a more complicated case than the one presented here, adopt your regular expression accordingly.) The search expression would then, for example, be Code: (?s)\\s(?P<text>[^]?)\s\\s(?P<footnote>[^<]) with the replacement text Code: (\g<footnote>) \g<text> This would yield, after the search & replace finishes, the result Code: <br> Das Mädchen war über sich selbst hinausgewachsen. Johnny konnte es <br> nur bewundern. Trotz der Angst war Kathy(Siehe John Sinclair Nr. 1027: „Der Traum vom Schwarzen Tod“ ) nicht durchgedreht. Beide <br> hatten genau das Richtige getan. Nicht die Helden spielen, sondern das <br> Haus verlassen und flüchten. Kathy hatte sich einfach ein Rad <br> <br><br> <hr> Of note here is that, in the regular expression, we use named groups for backreferences. This is to be preferred over numerals as backreferences, as it is easier to read and thus gives more control over what actually happens. Last edited by Manichean; 01-28-2011 at 01:44 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Search & Replace - Regular expression	oldbwl	Calibre	2	01-09-2011 10:33 AM
Search & Replace Suggestion	Philosopher	Calibre	6	12-31-2010 12:55 PM
Search & Replace: Destination series_index?	Starson17	Calibre	0	12-09-2010 02:12 PM
Search & Replace	Pat Nickholds	Sigil	2	10-22-2010 12:18 AM
Search & replace TEXT	ToeRag	Calibre	3	04-10-2010 02:44 PM