An introduction to regular expressions

Manichean · 01-26-2011, 06:05 PM

The intent of this introduction is not so much to explain all finesses of regular expression usage, but rather to explain enough to handle some common tasks in Calibre and get new users started and knowledgeable enough that they can further educate themselves using the (rather technical) explanation given in the Python documentation, linked through the Calibre manual. So, let's get started.

First, a word of warning and a word of courage: This is, inevitably, going to be somewhat technical- after all, regular expressions are a technical tool for doing technical stuff. I'm going to have to use some jargon and concepts that may seem complicated or convoluted. I'm going to try to explain those concepts as clearly as I can, but really can't do without using them at all. That being said, don't be discouraged by any jargon, as I've tried to explain everything new. And while regular expressions themselves may seem like an arcane, black magic (or, to be more prosaic, a random string of mumbo-jumbo letters and signs), I promise that they are not all that complicated. Even those who understand regular expressions really well have trouble reading the more complex ones, but writing them isn't as difficult- you construct the expression step by step. So, take a step and follow me into the rabbit's hole.

What on earth is a regular expression?
A regular expression is a way to describe a particular string of characters (string for short). (Technical note: I'm using string here in the sense it is used in programming languages: a string of one or more characters, characters including actual characters, numbers, punctuation and so-called whitespaces (linebreaks, tabulators etc.). Keep in mind that computers have no concept of the semantics, the meaning, of the string- what you're essentially doing is describing groups of characters, not words with meaning. This includes numbers- those are considered just another type of character.) Please note that generally, uppercase and lowercase characters are not considered the same, thus "a" being a different character from "A" and so forth. In Calibre, regular expressions are case insensitive in the search bar, but not in the conversion options. There's a way to make every regular expression case insensitive, but we'll discuss that later.) It gets complicated because regular expressions allow for variations in the strings it matches, so one expression can match multiple strings, which is why people bother using them at all. More on that in a bit.

Care to explain?
Well, that's why we're here. First, this is the most important concept in regular expressions: A string in itself is a regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using a regular expression, the regular expression to use would be

Code:

Hello, World!

And yes, it really is that simple. You'll notice, though, that this only matches the exact string "Hello, World!", not e.g. "Hello, wOrld!" or "hello, world!" or any other such variation.

That doesn't sound too bad. What's next?
Next is the beginning of the really good stuff. Remember where I said that regular expressions can match multiple strings? This is were it gets a little more complicated. Say, as a somewhat more practical exercise, the ebook you wanted to convert had a nasty footer counting the pages, like "Page 5 of 423". Obviously the page number would rise from 1 to 423, thus you'd have to match 423 different strings, right? Wrong, actually: regular expressions allow you to define sets of characters that are matched: To define a set, you put all the characters you want to be in the set into square brackets. So, for example, the set

Code:

[abc]

would match either the character "a", "b" or "c". Sets will always only match one of the characters in the set. They "understand" character ranges, that is, if you wanted to match all the lower case characters, you'd use the set

Code:

[a-z]

for lower- and uppercase characters you'd use

Code:

[a-zA-Z]

and so on. Got the idea? So, obviously, using the expression

Code:

Page [0-9] of 423

you'd be able to match the first 9 pages, thus reducing the expressions needed to three: The second expression

Code:

Page [0-9][0-9] of 423

would match all two-digit page numbers, and I'm sure you can guess what the third expression would look like. Yes, go ahead. Write it down.

Hey, neat! This is starting to make sense!
I was hoping you'd say that. But brace yourself, now it gets even better! We just saw that using sets, we could match one of several characters at once. But you can even repeat a character or set, reducing the number of expressions needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this: Some so-called special characters, "+", "?" and "*", repeat the single element preceding them. (Element means either a single character, a character set, an escape sequence or a group (we'll learn about those last two later)- in short, any single entity in a regular expression.) These characters are called wildcards or quantifiers. To be more precise, "?" matches 0 or 1 of the preceding element, "*" matches 0 or more of the preceding element and "+" matches 1 or more of the preceding element. A few examples: The expression "a?" would match either "" (which is the empty string, not strictly useful in this case) or "a", the expression "a*" would match "", "a", "aa" or any number of a's in a row, and, finally, the expression "a+" would match "a", "aa" or any number of a's in a row (Note: it wouldn't match the empty string!). Same deal for sets: The expression

Code:

[0-9]+

would match every integer number there is! I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression

Code:

Page [0-9]+ of 423

would match every page number in that book!
A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This is called "greedy behaviour"- I'm sure you get why. It gets problematic when you, say, try to match a tag. Consider, for example, the string "<p class="calibre2">Title here</p>" and let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression

Code:

<p.*>

would match that tag, but actually, it matches the whole string! (The character "." is another special character. It matches anything except linebreaks, so, basically, the expression

Code:

.*

would match any single line you can think of.) Instead, try using

Code:

<p.*?>

which makes the quantifier "*" non-greedy. That expression would only match the first opening tag, as intended.
There's actually another way to accomplish this: The expression

Code:

<p[^>]*>

will match that same opening tag- you'll see why after the next section. Just note that there quite frequently is more than one way to write a regular expression.

Well, these special characters are very neat and all, but what if I wanted to match a dot or a question mark?
You can of course do that: Just put a backslash in front of any special character and it is interpreted as the literal character, without any special meaning. This pair of a backslash followed by a single character is called an escape sequence, and the act of putting a backslash in front of a special character is called escaping that character. An escape sequence is interpreted as a single element. There are of course escape sequences that do more than just escaping special characters, for example "\t" means a tabulator. We'll get to some of the escape sequences later. Oh, and by the way, concerning those special characters: Consider any character we discuss in this introduction as having some function to be special and thus needing to be escaped if you want the literal character.

So, what are the most useful sets?
Knew you'd ask. Some useful sets are

Code:

[0-9]

matching a single number,

Code:

[a-z]

matching a single lowercase letter,

Code:

[A-Z]

matching a single uppercase letter,

Code:

[a-zA-Z]

matching a single letter and

Code:

[a-zA-Z0-9]

matching a single letter or number. You can also use an escape sequence as shorthand:

Code:

\d is equivalent to [0-9]
\w is equivalent to [a-zA-Z0-9_]
\s is equivalent to any whitespace

("Whitespace" is a term for anything that won't be printed. These characters include space, tabulator, line feed, form feed and carriage return.) As a last note on sets, you can also define a set as any character but those in the set. You do that by including the character "^" as the very first character in the set. Thus,

Code:

[^a]

would match any character excluding "a". That's called complementing the set. Those escape sequence shorthands we saw earlier can also be complemented: "\D" means any non-number character, thus being equivalent to

Code:

[^0-9]

The other shorthands can be complemented by, you guessed it, using the respective uppercase letter instead of the lowercase one. So, going back to the example

Code:

<p[^>]*>

from the previous section, now you can see that the character set it's using tries to match any character except for a closing angle bracket.

But if I had a few varying strings I wanted to match, things get complicated?
Fear not, life still is good and easy. Consider this example: The book you're converting has "Title" written on every odd page and "Author" written on every even page. Looks great in print, right? But in ebooks, it's annoying. You can group whole expressions in normal parentheses, and the character "|" will let you match either the expression to its right or the one to its left. Combine those and you're done. Too fast for you? Okay, first off, we group the expressions for odd and even pages, thus getting

Code:

(Title)
(Author)

as our two needed expressions. Now we make things simpler by using the vertical bar ("|" is called the vertical bar character): If you use the expression

Code:

(Title|Author)

you'll either get a match for "Title" (on the odd pages) or you'd match "Author" (on the even pages). Well, wasn't that easy?
You can of course use the vertical bar without using grouping parentheses, as well. Remember when I said that quantifiers repeat the element preceding them? Well, the vertical bar works a little differently: The expression "Title|Author" will also match either the string "Title" or the string "Author", just as the above example using grouping. The vertical bar selects between the entire expression preceding and following it. So, if you wanted to match the strings "Calibre" and "calibre" and wanted to select only between the upper- and lowercase "c", you'd have to use the expression "(c|C)alibre", where the grouping ensures that only the "c" will be selected. If you were to use "c|Calibre", you'd get a match on the string "c" or on the string "Calibre", which isn't what we wanted. In short: If in doubt, use grouping together with the vertical bar.

You missed something. In the beginning, you said there was a way to make a regular expression case insensitive?
Yes, I did, thanks for paying attention and reminding me. You can tell Calibre how you want certain things handled by using something called flags. You include flags in your expression by using the special construct

Code:

(?flags go here)

where, obviously, you'd replace "flags go here" with the specific flags you want. For ignoring case, the flag is "i", thus you include "(?i)" in your expression. Thus,

Code:

(?i)test

would match "Test", "tEst", "TEst" and any case variation you could think of.
Another useful flag lets the dot match any character at all, including the newline, the flag "s". If you want to use multiple flags in an expression, just put them in the same statement: "(?is)" would ignore case and make the dot match all. It doesn't matter which flag you state first, "(?si)" would be equivalent to the above. By the way, you should put flags at the very beginning of your expression. That way, they don't get mixed up with anything else.

When I'm using one of the Search and Replace features, I'd like to refer back to a part I've previously matched. Is there a way to do that?
What you want to do is called backreferencing, and there are actually two ways to do that: "normal" backreferencing and named backreferencing.
Normal backreferencing is easy: Any group you've matched in your regular expression can be referenced to in numerical order, the backreferencing is then done by a backslash followed by the groups' number, e.g. \1 for the first group. Example? Sure: Say you have some book entries with the author in LN, FN order and you want to edit them to FN LN. Say the entries are "Asimov, Isaac". Then you could search for

Code:

(\w+), (\w+)

and replace with

Code:

\2 \1

and your author entries would be changed to "Isaac Asimov".
Named backreferencing can be useful in more complex situations where you're liable to lose track of which group number contains what. Named groups are defined by using "(?P<id>)", where "id" is the name of the group. They can be backreferenced to using "\g<id>". Let's again consider the above example. You can define named groups by using the expression

Code:

(?P<authorlast>\w+), (?P<authorfirst>\w+)

and then use them in the replace text with

Code:

\g<authorfirst> \g<authorlast>

Well, that just about concludes the very short introduction to regular expressions. Hopefully I'll have shown you enough to at least get you started and to enable you to continue learning by yourself- a good starting point would be the Python documentation for regexpes.
One last word of warning, though: Regexpes are powerful, but also really easy to get wrong. Calibre provides really great testing possibilities to see if your expressions behave as you expect them to. Use them. Try not to shoot yourself in the foot. But should you, despite the warning, injure your foot (or any other body parts), try to learn from it.

Credits:
Thanks for helping with tips, corrections and such:

ldolse
kovidgoyal
chaley
dwanthny
kacir
Starson17
Skeeve

01-26-2011, 06:05 PM	#1
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	An introduction to regular expressions The intent of this introduction is not so much to explain all finesses of regular expression usage, but rather to explain enough to handle some common tasks in Calibre and get new users started and knowledgeable enough that they can further educate themselves using the (rather technical) explanation given in the Python documentation, linked through the Calibre manual. So, let's get started. First, a word of warning and a word of courage: This is, inevitably, going to be somewhat technical- after all, regular expressions are a technical tool for doing technical stuff. I'm going to have to use some jargon and concepts that may seem complicated or convoluted. I'm going to try to explain those concepts as clearly as I can, but really can't do without using them at all. That being said, don't be discouraged by any jargon, as I've tried to explain everything new. And while regular expressions themselves may seem like an arcane, black magic (or, to be more prosaic, a random string of mumbo-jumbo letters and signs), I promise that they are not all that complicated. Even those who understand regular expressions really well have trouble reading the more complex ones, but writing them isn't as difficult- you construct the expression step by step. So, take a step and follow me into the rabbit's hole. What on earth is a regular expression? A regular expression is a way to describe a particular string of characters (string for short). (Technical note: I'm using string here in the sense it is used in programming languages: a string of one or more characters, characters including actual characters, numbers, punctuation and so-called whitespaces (linebreaks, tabulators etc.). Keep in mind that computers have no concept of the semantics, the meaning, of the string- what you're essentially doing is describing groups of characters, not words with meaning. This includes numbers- those are considered just another type of character.) Please note that generally, uppercase and lowercase characters are not considered the same, thus "a" being a different character from "A" and so forth. In Calibre, regular expressions are case insensitive in the search bar, but not in the conversion options. There's a way to make every regular expression case insensitive, but we'll discuss that later.) It gets complicated because regular expressions allow for variations in the strings it matches, so one expression can match multiple strings, which is why people bother using them at all. More on that in a bit. Care to explain? Well, that's why we're here. First, this is the most important concept in regular expressions: A string in itself is a regular expression that matches itself. That is to say, if I wanted to match the string "Hello, World!" using a regular expression, the regular expression to use would be Code: Hello, World! And yes, it really is that simple. You'll notice, though, that this only matches the exact string "Hello, World!", not e.g. "Hello, wOrld!" or "hello, world!" or any other such variation. That doesn't sound too bad. What's next? Next is the beginning of the really good stuff. Remember where I said that regular expressions can match multiple strings? This is were it gets a little more complicated. Say, as a somewhat more practical exercise, the ebook you wanted to convert had a nasty footer counting the pages, like "Page 5 of 423". Obviously the page number would rise from 1 to 423, thus you'd have to match 423 different strings, right? Wrong, actually: regular expressions allow you to define sets of characters that are matched: To define a set, you put all the characters you want to be in the set into square brackets. So, for example, the set Code: [abc] would match either the character "a", "b" or "c". Sets will always only match one of the characters in the set. They "understand" character ranges, that is, if you wanted to match all the lower case characters, you'd use the set Code: [a-z] for lower- and uppercase characters you'd use Code: [a-zA-Z] and so on. Got the idea? So, obviously, using the expression Code: Page [0-9] of 423 you'd be able to match the first 9 pages, thus reducing the expressions needed to three: The second expression Code: Page [0-9][0-9] of 423 would match all two-digit page numbers, and I'm sure you can guess what the third expression would look like. Yes, go ahead. Write it down. Hey, neat! This is starting to make sense! I was hoping you'd say that. But brace yourself, now it gets even better! We just saw that using sets, we could match one of several characters at once. But you can even repeat a character or set, reducing the number of expressions needed to handle the above page number example to one. Yes, ONE! Excited? You should be! It works like this: Some so-called special characters, "+", "?" and "", repeat the single element preceding them.* (Element means either a single character, a character set, an escape sequence or a group (we'll learn about those last two later)- in short, any single entity in a regular expression.) These characters are called wildcards or quantifiers. To be more precise, "?" matches 0 or 1 of the preceding element, "" matches 0 or more* of the preceding element and "+" matches 1 or more of the preceding element. A few examples: The expression "a?" would match either "" (which is the empty string, not strictly useful in this case) or "a", the expression "a" would match "", "a", "aa" or any number of a's in a row, and, finally, the expression "a+" would match "a", "aa" or any number of a's in a row (Note: it wouldn't match the empty string!). Same deal for sets: The expression Code: [0-9]+ would match every integer number there is!* I know what you're thinking, and you're right: If you use that in the above case of matching page numbers, wouldn't that be the single one expression to match all the page numbers? Yes, the expression Code: Page [0-9]+ of 423 would match every page number in that book! A note on these quantifiers: They generally try to match as much text as possible, so be careful when using them. This is called "greedy behaviour"- I'm sure you get why. It gets problematic when you, say, try to match a tag. Consider, for example, the string "<p class="calibre2">Title here</p>" and let's say you'd want to match the opening tag (the part between the first pair of angle brackets, a little more on tags later). You'd think that the expression Code: <p.> would match that tag, but actually, it matches the whole string! (The character "." is another special character. It matches anything except* linebreaks, so, basically, the expression Code: .* would match any single line you can think of.) Instead, try using Code: <p.?> which makes the quantifier "" non-greedy. That expression would only match the first opening tag, as intended. There's actually another way to accomplish this: The expression Code: <p[^>]> will match that same opening tag- you'll see why after the next section. Just note that there quite frequently is more than one way to write a regular expression. Well, these special characters are very neat and all, but what if I wanted to match a dot or a question mark?* You can of course do that: Just put a backslash in front of any special character and it is interpreted as the literal character, without any special meaning. This pair of a backslash followed by a single character is called an escape sequence, and the act of putting a backslash in front of a special character is called escaping that character. An escape sequence is interpreted as a single element. There are of course escape sequences that do more than just escaping special characters, for example "\t" means a tabulator. We'll get to some of the escape sequences later. Oh, and by the way, concerning those special characters: Consider any character we discuss in this introduction as having some function to be special and thus needing to be escaped if you want the literal character. So, what are the most useful sets? Knew you'd ask. Some useful sets are Code: [0-9] matching a single number, Code: [a-z] matching a single lowercase letter, Code: [A-Z] matching a single uppercase letter, Code: [a-zA-Z] matching a single letter and Code: [a-zA-Z0-9] matching a single letter or number. You can also use an escape sequence as shorthand: Code: \d is equivalent to [0-9] \w is equivalent to [a-zA-Z0-9_] \s is equivalent to any whitespace ("Whitespace" is a term for anything that won't be printed. These characters include space, tabulator, line feed, form feed and carriage return.) As a last note on sets, you can also define a set as any character but those in the set. You do that by including the character "^" as the very first character in the set. Thus, Code: [^a] would match any character excluding "a". That's called complementing the set. Those escape sequence shorthands we saw earlier can also be complemented: "\D" means any non-number character, thus being equivalent to Code: [^0-9] The other shorthands can be complemented by, you guessed it, using the respective uppercase letter instead of the lowercase one. So, going back to the example Code: <p[^>]> from the previous section, now you can see that the character set it's using tries to match any character except for a closing angle bracket. But if I had a few varying strings I wanted to match, things get complicated?* Fear not, life still is good and easy. Consider this example: The book you're converting has "Title" written on every odd page and "Author" written on every even page. Looks great in print, right? But in ebooks, it's annoying. You can group whole expressions in normal parentheses, and the character "\|" will let you match either the expression to its right or the one to its left. Combine those and you're done. Too fast for you? Okay, first off, we group the expressions for odd and even pages, thus getting Code: (Title) (Author) as our two needed expressions. Now we make things simpler by using the vertical bar ("\|" is called the vertical bar character): If you use the expression Code: (Title\|Author) you'll either get a match for "Title" (on the odd pages) or you'd match "Author" (on the even pages). Well, wasn't that easy? You can of course use the vertical bar without using grouping parentheses, as well. Remember when I said that quantifiers repeat the element preceding them? Well, the vertical bar works a little differently: The expression "Title\|Author" will also match either the string "Title" or the string "Author", just as the above example using grouping. The vertical bar selects between the entire expression preceding and following it. So, if you wanted to match the strings "Calibre" and "calibre" and wanted to select only between the upper- and lowercase "c", you'd have to use the expression "(c\|C)alibre", where the grouping ensures that only the "c" will be selected. If you were to use "c\|Calibre", you'd get a match on the string "c" or on the string "Calibre", which isn't what we wanted. In short: If in doubt, use grouping together with the vertical bar. You missed something. In the beginning, you said there was a way to make a regular expression case insensitive? Yes, I did, thanks for paying attention and reminding me. You can tell Calibre how you want certain things handled by using something called flags. You include flags in your expression by using the special construct Code: (?flags go here) where, obviously, you'd replace "flags go here" with the specific flags you want. For ignoring case, the flag is "i", thus you include "(?i)" in your expression. Thus, Code: (?i)test would match "Test", "tEst", "TEst" and any case variation you could think of. Another useful flag lets the dot match any character at all, including the newline, the flag "s". If you want to use multiple flags in an expression, just put them in the same statement: "(?is)" would ignore case and make the dot match all. It doesn't matter which flag you state first, "(?si)" would be equivalent to the above. By the way, you should put flags at the very beginning of your expression. That way, they don't get mixed up with anything else. When I'm using one of the Search and Replace features, I'd like to refer back to a part I've previously matched. Is there a way to do that? What you want to do is called backreferencing, and there are actually two ways to do that: "normal" backreferencing and named backreferencing. Normal backreferencing is easy: Any group you've matched in your regular expression can be referenced to in numerical order, the backreferencing is then done by a backslash followed by the groups' number, e.g. \1 for the first group. Example? Sure: Say you have some book entries with the author in LN, FN order and you want to edit them to FN LN. Say the entries are "Asimov, Isaac". Then you could search for Code: (\w+), (\w+) and replace with Code: \2 \1 and your author entries would be changed to "Isaac Asimov". Named backreferencing can be useful in more complex situations where you're liable to lose track of which group number contains what. Named groups are defined by using "(?P<id>)", where "id" is the name of the group. They can be backreferenced to using "\g<id>". Let's again consider the above example. You can define named groups by using the expression Code: (?P<authorlast>\w+), (?P<authorfirst>\w+) and then use them in the replace text with Code: \g<authorfirst> \g<authorlast> Well, that just about concludes the very short introduction to regular expressions. Hopefully I'll have shown you enough to at least get you started and to enable you to continue learning by yourself- a good starting point would be the Python documentation for regexpes. One last word of warning, though: Regexpes are powerful, but also really easy to get wrong. Calibre provides really great testing possibilities to see if your expressions behave as you expect them to. Use them. Try not to shoot yourself in the foot. But should you, despite the warning, injure your foot (or any other body parts), try to learn from it. Credits: Thanks for helping with tips, corrections and such: ldolse kovidgoyal chaley dwanthny kacir Starson17 Skeeve Last edited by Manichean; 04-15-2014 at 05:15 AM. Reason: Corrected placement of flags in expression (noted by Skeeve)

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Problem with regular expressions	Manichean	Conversion	10	02-03-2011 03:27 PM
Regular expressions, Calibre and you- an introduction (Archived)	Manichean	Conversion	80	11-11-2010 08:37 AM
Help with Regular Expressions	ghostyjack	Workshop	2	01-08-2010 12:04 PM
Regular Expressions help needed	Phil_C	Workshop	20	10-03-2009 01:14 AM
BookDesigner v5 and regular expressions	ShineOn	Sony Reader	11	08-25-2008 05:06 PM