Problem with regular expressions

Manichean · 01-12-2010, 07:03 AM

I'm having some trouble writing a regular expression to delete page headers in the conversion options. The page header I'm trying to delete basically looks like

Code:

<p class="calibre1">
Title</p><p class="calibre1">
Page 42 of 230</p>

so I figured the regexp needed should look like

Code:

Title</p><p class="calibre1">\nPage [0-9]* of [0-9]*

to match the part from "Title" to the total page number, which is what I want to remove. Now, this works fine if I just use the part up to "\n" or the part after it, which matches the first or the second line I want removed, respectively. But as soon as I try to cobble the two lines together, I don't get any match. I've tried every variation of \n,\s and so forth that I could think of, including slapping some * and ? behind it and fooling around with groups, nothing seems to work.
Seeing as I've never used regular expressions before and just skimmed over the Calibre user manual to piece it together, I'm sure there's something I'm missing, but I cant figure out what it is. What I can figure out is that I somehow don't get how to match a newline. Could anyone help?

kovidgoyal · 01-12-2010, 10:31 AM

Try

Title[^<]+

Manichean · 01-26-2010, 08:55 AM

Unfortunately, that doesn't work. Same problem, it just gets confused about the linebreak.
I thought about maybe passing a flag that the string it should match is on multiple lines, but I don't know how to do this and currently, I'm too busy to figure it out. I'll post again once I find a solution.

Archon · 02-02-2011, 10:21 PM

I messed with this a little. I don't know exactly what you are looking for but here is what I have. This should only match on a number followed by a followed by an end of the line.
Search:
of \d+$
Replace
<\p>

So this will find the last line of your three lines with the page number followed by the <\p> at the end of a line. Then replace only the <\p>. It looks like this before:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
sides, and above the jacket collar behind, uncombed. Both beards were short and scant.

Title
Page 42 of 230

The man from the east wore a standard straight sword, the plastic
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

And now after:
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>

Title
Page 42 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<

This just removes the "of XXX" page numbering part.
Is that what you were after?

Archon

ldolse · 02-02-2011, 11:39 PM

Did you try this:

Code:

Title</p><p class="calibre1">\s*Page\s*[0-9]+\s*of\s*[0-9]+

Is it showing up correctly as matching in the regex wizard, but not act removing it during conversion? Usually when this happens it's one of two things - there are also non-breaking spaces hiding amongst the real spaces, or there is a bug/limitation where Calibre is showing you html in the wizard that's not exactly the same as the html that is provided to the Search and Replace feature during conversion.

Edit:
Note if non-breaking spaces are your problem you can create a character class to include them. Instead of \s*, use this: [\su00a0]*

Perkin · 02-03-2011, 03:40 AM

How about a nice simple

Quote:

Title.*(?=)

Which should match 'Title' and everything upto but not including the next ''

Manichean · 02-03-2011, 03:57 AM

You people do realize that this thread is about a year old? I solved that issue quite some time ago. (The solution was me stopping to be stupid, by the way.)

Perkin · 02-03-2011, 05:31 AM

I thought it was odd, that you, who done the regex faq/guide couldn't manage it.
I did look at date, and thought orig post was December.

Archon · 02-03-2011, 05:34 AM

Quote:

You people do realize that this thread is about a year old? I solved that issue quite some time ago. (The solution was me stopping to be stupid, by the way.)

It's never too late to help a brother out. :-)

BTW what was your solution (besides stopping being stupid as you say)?

Maybe we could all learn from your experience.

Archon

Manichean · 02-03-2011, 05:42 AM

The problem was that I didn't use the regex wizard to test it, basically. I tried to use Notepad++, which doesn't allow for multiline regex matching. (I only found that out while writing the guide, actually.) The reason I did that was that I felt Notepad++ would be faster than Calibre, and I didn't fully understand the wizard. Also, had I known about character classes, especially \s, I might have found a solution sooner.

Archon · 02-03-2011, 02:27 PM

Thanks for your wisdom.

I will pass that along to my PeeCee using mates.

Archon

01-12-2010, 07:03 AM	#1
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Problem with regular expressions I'm having some trouble writing a regular expression to delete page headers in the conversion options. The page header I'm trying to delete basically looks like Code: <p class="calibre1"> Title</p><p class="calibre1"> Page 42 of 230</p> so I figured the regexp needed should look like Code: Title</p><p class="calibre1">\nPage [0-9]* of [0-9]* to match the part from "Title" to the total page number, which is what I want to remove. Now, this works fine if I just use the part up to "\n" or the part after it, which matches the first or the second line I want removed, respectively. But as soon as I try to cobble the two lines together, I don't get any match. I've tried every variation of \n,\s and so forth that I could think of, including slapping some * and ? behind it and fooling around with groups, nothing seems to work. Seeing as I've never used regular expressions before and just skimmed over the Calibre user manual to piece it together, I'm sure there's something I'm missing, but I cant figure out what it is. What I can figure out is that I somehow don't get how to match a newline. Could anyone help? Last edited by Manichean; 01-12-2010 at 07:07 AM.

01-12-2010, 10:31 AM	#2
kovidgoyal creator of calibre Posts: 45,304 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Try Title</p><p class="calibre1">[^<]+</p>

02-02-2011, 10:21 PM	#4
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	I messed with this a little. I don't know exactly what you are looking for but here is what I have. This should only match on a number followed by a </p> followed by an end of the line. Search: of \d+</p>$ Replace <\p> So this will find the last line of your three lines with the page number followed by the <\p> at the end of a line. Then replace only the <\p>. It looks like this before: >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> sides, and above the jacket collar behind, uncombed. Both beards were short and scant. <p class="calibre1"> Title</p><p class="calibre1"> Page 42 of 230</p> The man from the east wore a standard straight sword, the plastic <<<<<<<<<<<<<<<<<<<<<<<<<<<<< And now after: >>>>>>>>>>>>>>>>>>>>>>>>>>>>> <p class="calibre1"> Title</p><p class="calibre1"> Page 42 </p> <<<<<<<<<<<<<<<<<<<<<<<<<<<< This just removes the "of XXX" page numbering part. Is that what you were after? Archon

02-02-2011, 11:39 PM	#5
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	Did you try this: Code: Title</p><p class="calibre1">\sPage\s[0-9]+\sof\s[0-9]+ Is it showing up correctly as matching in the regex wizard, but not act removing it during conversion? Usually when this happens it's one of two things - there are also non-breaking spaces hiding amongst the real spaces, or there is a bug/limitation where Calibre is showing you html in the wizard that's not exactly the same as the html that is provided to the Search and Replace feature during conversion. Edit: Note if non-breaking spaces are your problem you can create a character class to include them. Instead of \s, use this: [\su00a0] Last edited by ldolse; 02-02-2011 at 11:43 PM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Custom Regular Expressions for adding book information	bigbot3	Calibre	1	12-25-2010 06:28 PM
Regular expressions, Calibre and you- an introduction (Archived)	Manichean	Conversion	80	11-11-2010 07:37 AM
Help with Regular Expressions	ghostyjack	Workshop	2	01-08-2010 11:04 AM
Regular Expressions help needed	Phil_C	Workshop	20	10-03-2009 12:14 AM
BookDesigner v5 and regular expressions	ShineOn	Sony Reader	11	08-25-2008 04:06 PM

01-26-2010, 08:55 AM	#3
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	Unfortunately, that doesn't work. Same problem, it just gets confused about the linebreak. I thought about maybe passing a flag that the string it should match is on multiple lines, but I don't know how to do this and currently, I'm too busy to figure it out. I'll post again once I find a solution.

02-03-2011, 03:57 AM	#7
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	You people do realize that this thread is about a year old? I solved that issue quite some time ago. (The solution was me stopping to be stupid, by the way.)

02-03-2011, 05:31 AM	#8
Perkin Guru Posts: 657 Karma: 64171 Join Date: Sep 2010 Location: Kent, England, Sol 3, ZZ9 plural Z Alpha Device: Sony PRS-300, Kobo Aura HD, iPad (Marvin)	I thought it was odd, that you, who done the regex faq/guide couldn't manage it. I did look at date, and thought orig post was December.

02-03-2011, 05:42 AM	#10
Manichean Wizard Posts: 3,130 Karma: 91256 Join Date: Feb 2008 Location: Germany Device: Cybook Gen3	The problem was that I didn't use the regex wizard to test it, basically. I tried to use Notepad++, which doesn't allow for multiline regex matching. (I only found that out while writing the guide, actually.) The reason I did that was that I felt Notepad++ would be faster than Calibre, and I didn't fully understand the wizard. Also, had I known about character classes, especially \s, I might have found a solution sooner.

02-03-2011, 02:27 PM	#11
Archon Zealot Posts: 110 Karma: 5176 Join Date: Dec 2010 Device: Mac OSX, iPad, iPod, & Nook	Thanks for your wisdom. I will pass that along to my PeeCee using mates. Archon

Advert

Advert