RegEx: Removing Page Numbers that have Spaces

captainslow · 02-27-2011, 03:36 PM

I've tried my best to solve a problem I'm currently facing and I can't figure it out.

I've going from PDF to ePub and I'm trying to remove page numbers. I've gone into search and replace in the PDF and I've found the following:

<hr>
<A name=258></a>[page numbers with spaces] 

So every time the page number is listed the numbers are separated with spaces along with two trailing spaces. For example, an actual entry is as follows:

<hr>
<A name=258></a>2 4 8 

I can't figure out how to have Calibre simply find those page numbers and remove them. What I wan't it to do is either:

1. search for the </a> and and ignore what's in between them. That way it doesn't matter how many digits and spaces are in between those two tags

2. Tell Calibre to search for anything that has one digit, OR two digits, OR three digits. That'll get rid of everything.

I've come up with this that clearly doesn't work:

<hr>\n<A name=\d{1,3}></a>\d\s\d\s\d\s\s 

The only problem with that is that it will only search for entries that it finds with three digits. I don't know how to make it search for one digit, or two, or three, or X.

The <A name=???> is easy because there are no spaces but once spaces are introduced I can't wrap my head around it. Any help would be awesome!

captainslow · 02-27-2011, 03:40 PM

Well, I tried .* and that seemed to work:

<hr>\n<A name=\d{1,3}></a>.* 

I'm not sure why it works so would anyone be able to explain?

My understanding is that . matches any one character and * matches any of the previous character so .* would be saying "please match any characters"

Manichean · 02-27-2011, 04:14 PM

Using .* can be a little dangerous, as the asterisk on its own tries to match as much text as possible. Try using .*? which will match as little text as possible (the other behaviour is called greedy). You could also just use a character set, for example, using

Code:

<hr>\n<A name=\d{1,3}></a>[0-9 ]+<br>

(Notice the space in the set) should work.
A note on your understanding of ".*": The dot matches any character (except for the newline, which requires a flag), and the asterisk extends matching of the previous expression by matching 0 or more of the previous expression. If you use the plus sign as quantifier, you'll match 1 or more of the previous expression, and the question mark matches 0 or 1 of the previous expression (except when used after another quantifier like above).

Edit: I guess a still safer way to write the expression would be

Code:

<hr>\s+<A name=\d{1,3}></a>[0-9 ]+<br>

(Notice the \s+ after the <hr>) since that will match differently encoded line breaks, while your expression only matches linebreaks encoded only by a newline. Might be academical, though

02-27-2011, 03:36 PM	#1
captainslow Junior Member Posts: 2 Karma: 10 Join Date: Feb 2011 Device: Kindle	RegEx: Removing Page Numbers that have Spaces I've tried my best to solve a problem I'm currently facing and I can't figure it out. I've going from PDF to ePub and I'm trying to remove page numbers. I've gone into search and replace in the PDF and I've found the following: <hr> <A name=258></a>[page numbers with spaces] <br> So every time the page number is listed the numbers are separated with spaces along with two trailing spaces. For example, an actual entry is as follows: <hr> <A name=258></a>2 4 8 <br> I can't figure out how to have Calibre simply find those page numbers and remove them. What I wan't it to do is either: 1. search for the </a> and <br> and ignore what's in between them. That way it doesn't matter how many digits and spaces are in between those two tags 2. Tell Calibre to search for anything that has one digit, OR two digits, OR three digits. That'll get rid of everything. I've come up with this that clearly doesn't work: <hr>\n<A name=\d{1,3}></a>\d\s\d\s\d\s\s<br> The only problem with that is that it will only search for entries that it finds with three digits. I don't know how to make it search for one digit, or two, or three, or X. The <A name=???> is easy because there are no spaces but once spaces are introduced I can't wrap my head around it. Any help would be awesome!

02-27-2011, 03:40 PM	#2
captainslow Junior Member Posts: 2 Karma: 10 Join Date: Feb 2011 Device: Kindle	Well, I tried .* and that seemed to work: <hr>\n<A name=\d{1,3}></a>.<br> I'm not sure why it works so would anyone be able to explain? My understanding is that . matches any one character and matches any of the previous character so .* would be saying "please match any characters" Last edited by captainslow; 02-27-2011 at 03:42 PM.

02-27-2011, 04:14 PM	#3
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Using .* can be a little dangerous, as the asterisk on its own tries to match as much text as possible. Try using .? which will match as little text as possible (the other behaviour is called greedy). You could also just use a character set, for example, using Code: <hr>\n<A name=\d{1,3}></a>[0-9 ]+<br> (Notice the space in the set) should work. A note on your understanding of ".": The dot matches any character (except for the newline, which requires a flag), and the asterisk extends matching of the previous expression by matching 0 or more of the previous expression. If you use the plus sign as quantifier, you'll match 1 or more of the previous expression, and the question mark matches 0 or 1 of the previous expression (except when used after another quantifier like above). Edit: I guess a still safer way to write the expression would be Code: <hr>\s+<A name=\d{1,3}></a>[0-9 ]+<br> (Notice the \s+ after the <hr>) since that will match differently encoded line breaks, while your expression only matches linebreaks encoded only by a newline. Might be academical, though Last edited by Manichean; 02-27-2011 at 04:17 PM.

Thread Tools	Search this Thread
Show Printable Version Email this Page	Search this Thread: Advanced Search

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
[Old Thread] Removing page numbers.	ChaoZ	Calibre	8	10-20-2014 03:02 PM
My RegEx isn't doing what I hoped to remove page numbers and a fixed string	winterminute	Calibre	6	12-19-2010 10:55 PM
Removing headers/page numbers	greycobalt	Calibre	3	10-10-2010 01:57 PM
Removing Page Numbers	ManosHandsOfFate	Calibre	6	09-28-2010 12:12 PM
Removing page numbers?	Cap.T	Calibre	1	02-21-2010 09:57 AM

Advert