Regular Expression Help

smartmart · 10-16-2010, 09:46 AM

Hi, i've converted a pdf to awz with the Amazon service, now i want convert it to mobi with Calibre (so i can add metadata and TOC).

I've a problem with chapter recognition, every chapter start with "Chapter XXX." so my regex is:
//*[re:test(., "chapter", "i")]

It works with the original pdf but not with the awz.
it matchs the word "chapter" in the text (sometimes there is the word "chapter" in the script) but it doesn't match the real chapters.

So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse):
 foo foo fooCHAPTER 1 foo foo foo 

Is this the problem?
How can i fix it?

Thx

desertgrandma · 10-16-2010, 10:34 AM

Welcome to MobileRead, smartmart

Help should be arriving soon.

BookGnome · 10-16-2010, 11:17 AM

Quote:

Originally Posted by smartmart

So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse):
 foo foo fooCHAPTER 1 foo foo foo

I'm not sure how you need to specify it with Calibre's custom syntax, but your regex itself is flawed. Here's a working regex in Python:

Code:

>>> import re
>>> myString = '<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>'
>>> re.findall('Chapter \d+', myString, re.I)
['CHAPTER 1']

A lot depends on how consistent the input file is, but this should catch any instance of the word 'chapter' followed by one or more numbers, without regard to case. How to wrap that in Calibre's regex DSL is a question for the Calibre gurus.

ldolse · 10-16-2010, 12:22 PM

I'm terrible with xpath, but I have a hunch you're screwed trying to search for text just free floating throughout the book in the body tag.

You're best bet is to take the html from debug info and do a find replace in a text editor with regex search/replace support.

Search for this:

Code:

(Chapter\s+\d+)

and replace it with this:

Code:

<h2>\1</h2>

Depending on the editor you use it might be $(1), or $1, or whatever instead of '\1' as I used above - check the documentation for your editor.

Then import the edited html file to Calibre, and have Calibre convert using the zipped html source instead of the pdf. Calibre's default chapter detection xpath will automatically pick the chapters up if your search and replace properly wrapped the html in <h2> tags.

smartmart · 10-17-2010, 04:07 AM

Quote:

Originally Posted by BookGnome

I'm not sure how you need to specify it with Calibre's custom syntax, but your regex itself is flawed. Here's a working regex in Python:

Code:

>>> import re
>>> myString = '<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>'
>>> re.findall('Chapter \d+', myString, re.I)
['CHAPTER 1']

A lot depends on how consistent the input file is, but this should catch any instance of the word 'chapter' followed by one or more numbers, without regard to case. How to wrap that in Calibre's regex DSL is a question for the Calibre gurus.

I know, i've used a wide regex only for testing purpose

Thx Idosle, but i'm searching for a solution in calibre (if it's possible) so i can use the setting every time.

PS: i don't use the pdf to mobi from calibre because it fails with the wrap.
It seems that every page of the pdf is a paragraph.

ldolse · 10-17-2010, 05:19 AM

There is no solution in Calibre to do it the way you're trying to do it. Amazon is creating a really screwy mobi file, Calibre hasn't been programmed to handle that scenario, and It's unlikely to happen anytime soon.

If you're seeing that Calibre isn't unwrapping the lines when you use it to convert from pdf to mobi it means that your line unwrapping factor under pdf input is incorrectly set. It seems to be set to zero on a number of user's systems. Set the line unwrapping factor to 0.45, this is the default and generally provides the best results.

10-16-2010, 09:46 AM	#1
smartmart Junior Member Posts: 2 Karma: 10 Join Date: Oct 2010 Device: Kindle	Regular Expression Help Hi, i've converted a pdf to awz with the Amazon service, now i want convert it to mobi with Calibre (so i can add metadata and TOC). I've a problem with chapter recognition, every chapter start with "Chapter XXX." so my regex is: //*[re:test(., "chapter", "i")] It works with the original pdf but not with the awz. it matchs the word "chapter" in the text (sometimes there is the word "chapter" in the script) but it doesn't match the real chapters. So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse): <p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p> Is this the problem? How can i fix it? Thx

10-16-2010, 12:22 PM	#4
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	I'm terrible with xpath, but I have a hunch you're screwed trying to search for text just free floating throughout the book in the body tag. You're best bet is to take the html from debug info and do a find replace in a text editor with regex search/replace support. Search for this: Code: (Chapter\s+\d+) and replace it with this: Code: <h2>\1</h2> Depending on the editor you use it might be $(1), or $1, or whatever instead of '\1' as I used above - check the documentation for your editor. Then import the edited html file to Calibre, and have Calibre convert using the zipped html source instead of the pdf. Calibre's default chapter detection xpath will automatically pick the chapters up if your search and replace properly wrapped the html in <h2> tags.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regular Expression Help	Azhad	Calibre	86	09-27-2011 02:37 PM
Need Help Creating a Regular Expression	Worm	Calibre	9	08-18-2010 01:20 PM
Regular Expression Help Needed	dloyer4	Calibre	1	07-25-2010 10:37 PM
Help with the regular expression	Dysonco	Calibre	9	03-22-2010 10:45 PM
I don't know how to use wilcards and regular expression....	superanima	Sigil	4	02-21-2010 09:42 AM

10-16-2010, 10:34 AM	#2
desertgrandma Enjoying the show.... Posts: 14,270 Karma: 10462843 Join Date: Jun 2008 Location: Arizona Device: A K1, Kindle Paperwhite, an Ipod, IPad2, Iphone, an Ipad Mini & macAir	Welcome to MobileRead, smartmart Help should be arriving soon.

10-17-2010, 05:19 AM	#6
ldolse Wizard Posts: 1,337 Karma: 123455 Join Date: Apr 2009 Location: Malaysia Device: PRS-650, iPhone	There is no solution in Calibre to do it the way you're trying to do it. Amazon is creating a really screwy mobi file, Calibre hasn't been programmed to handle that scenario, and It's unlikely to happen anytime soon. If you're seeing that Calibre isn't unwrapping the lines when you use it to convert from pdf to mobi it means that your line unwrapping factor under pdf input is incorrectly set. It seems to be set to zero on a number of user's systems. Set the line unwrapping factor to 0.45, this is the default and generally provides the best results.

Advert

Advert