02-27-2011, 03:36 PM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
RegEx: Removing Page Numbers that have Spaces
I've tried my best to solve a problem I'm currently facing and I can't figure it out.
I've going from PDF to ePub and I'm trying to remove page numbers. I've gone into search and replace in the PDF and I've found the following: <hr> <A name=258></a>[page numbers with spaces] <br> So every time the page number is listed the numbers are separated with spaces along with two trailing spaces. For example, an actual entry is as follows: <hr> <A name=258></a>2 4 8 <br> I can't figure out how to have Calibre simply find those page numbers and remove them. What I wan't it to do is either: 1. search for the </a> and <br> and ignore what's in between them. That way it doesn't matter how many digits and spaces are in between those two tags 2. Tell Calibre to search for anything that has one digit, OR two digits, OR three digits. That'll get rid of everything. I've come up with this that clearly doesn't work: <hr>\n<A name=\d{1,3}></a>\d\s\d\s\d\s\s<br> The only problem with that is that it will only search for entries that it finds with three digits. I don't know how to make it search for one digit, or two, or three, or X. The <A name=???> is easy because there are no spaces but once spaces are introduced I can't wrap my head around it. Any help would be awesome! |
02-27-2011, 03:40 PM | #2 |
Junior Member
Posts: 2
Karma: 10
Join Date: Feb 2011
Device: Kindle
|
Well, I tried .* and that seemed to work:
<hr>\n<A name=\d{1,3}></a>.*<br> I'm not sure why it works so would anyone be able to explain? My understanding is that . matches any one character and * matches any of the previous character so .* would be saying "please match any characters" Last edited by captainslow; 02-27-2011 at 03:42 PM. |
Advert | |
|
02-27-2011, 04:14 PM | #3 |
Wizard
Posts: 3,130
Karma: 91256
Join Date: Feb 2008
Location: Germany
Device: Cybook Gen3
|
Using .* can be a little dangerous, as the asterisk on its own tries to match as much text as possible. Try using .*? which will match as little text as possible (the other behaviour is called greedy). You could also just use a character set, for example, using
Code:
<hr>\n<A name=\d{1,3}></a>[0-9 ]+<br> A note on your understanding of ".*": The dot matches any character (except for the newline, which requires a flag), and the asterisk extends matching of the previous expression by matching 0 or more of the previous expression. If you use the plus sign as quantifier, you'll match 1 or more of the previous expression, and the question mark matches 0 or 1 of the previous expression (except when used after another quantifier like above). Edit: I guess a still safer way to write the expression would be Code:
<hr>\s+<A name=\d{1,3}></a>[0-9 ]+<br> Last edited by Manichean; 02-27-2011 at 04:17 PM. |
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Old Thread] Removing page numbers. | ChaoZ | Calibre | 8 | 10-20-2014 03:02 PM |
My RegEx isn't doing what I hoped to remove page numbers and a fixed string | winterminute | Calibre | 6 | 12-19-2010 10:55 PM |
Removing headers/page numbers | greycobalt | Calibre | 3 | 10-10-2010 01:57 PM |
Removing Page Numbers | ManosHandsOfFate | Calibre | 6 | 09-28-2010 12:12 PM |
Removing page numbers? | Cap.T | Calibre | 1 | 02-21-2010 09:57 AM |