10-16-2010, 10:46 AM | #1 |
Junior Member
Posts: 2
Karma: 10
Join Date: Oct 2010
Device: Kindle
|
Regular Expression Help
Hi, i've converted a pdf to awz with the Amazon service, now i want convert it to mobi with Calibre (so i can add metadata and TOC).
I've a problem with chapter recognition, every chapter start with "Chapter XXX." so my regex is: //*[re:test(., "chapter", "i")] It works with the original pdf but not with the awz. it matchs the word "chapter" in the text (sometimes there is the word "chapter" in the script) but it doesn't match the real chapters. So i've saved the debug and i've seen that the chapters are not in a html tag (the text anyway is child of the body tag ofcourse): <p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p> Is this the problem? How can i fix it? Thx |
10-16-2010, 11:34 AM | #2 |
Enjoying the show....
Posts: 14,270
Karma: 10462843
Join Date: Jun 2008
Location: Arizona
Device: A K1, Kindle Paperwhite, an Ipod, IPad2, Iphone, an Ipad Mini & macAir
|
Welcome to MobileRead, smartmart
Help should be arriving soon. |
Advert | |
|
10-16-2010, 12:17 PM | #3 | |
Voracious Reader
Posts: 4
Karma: 62
Join Date: Sep 2010
Device: Kindle
|
Finding chapters with a simple regex
Quote:
Code:
>>> import re >>> myString = '<p> foo foo foo</p>CHAPTER 1<p> foo foo foo </p>' >>> re.findall('Chapter \d+', myString, re.I) ['CHAPTER 1'] |
|
10-16-2010, 01:22 PM | #4 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
I'm terrible with xpath, but I have a hunch you're screwed trying to search for text just free floating throughout the book in the body tag.
You're best bet is to take the html from debug info and do a find replace in a text editor with regex search/replace support. Search for this: Code:
(Chapter\s+\d+) Code:
<h2>\1</h2> Then import the edited html file to Calibre, and have Calibre convert using the zipped html source instead of the pdf. Calibre's default chapter detection xpath will automatically pick the chapters up if your search and replace properly wrapped the html in <h2> tags. |
10-17-2010, 05:07 AM | #5 | |
Junior Member
Posts: 2
Karma: 10
Join Date: Oct 2010
Device: Kindle
|
Quote:
Thx Idosle, but i'm searching for a solution in calibre (if it's possible) so i can use the setting every time. PS: i don't use the pdf to mobi from calibre because it fails with the wrap. It seems that every page of the pdf is a paragraph. Last edited by smartmart; 10-17-2010 at 05:30 AM. |
|
Advert | |
|
10-17-2010, 06:19 AM | #6 |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
There is no solution in Calibre to do it the way you're trying to do it. Amazon is creating a really screwy mobi file, Calibre hasn't been programmed to handle that scenario, and It's unlikely to happen anytime soon.
If you're seeing that Calibre isn't unwrapping the lines when you use it to convert from pdf to mobi it means that your line unwrapping factor under pdf input is incorrectly set. It seems to be set to zero on a number of user's systems. Set the line unwrapping factor to 0.45, this is the default and generally provides the best results. |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expression Help | Azhad | Calibre | 86 | 09-27-2011 03:37 PM |
Need Help Creating a Regular Expression | Worm | Calibre | 9 | 08-18-2010 02:20 PM |
Regular Expression Help Needed | dloyer4 | Calibre | 1 | 07-25-2010 11:37 PM |
Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 11:45 PM |
I don't know how to use wilcards and regular expression.... | superanima | Sigil | 4 | 02-21-2010 10:42 AM |