05-26-2010, 12:14 AM | #46 | |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
|
Quote:
|
|
05-26-2010, 01:12 AM | #47 |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Sure, vinco,
Gimme a d/l link and I'll take a stab at it... |
Advert | |
|
05-26-2010, 03:53 AM | #48 |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
Ok, so vinco sent me the pdf he's having trouble with, and I can confirm that the previously mentioned regexes, which highlight the proper matches in the tester, don't remove those matches when converting.
I have no idea why, but it's definitely a bug. Oddly, when I used the resulant epub (which still had the page numbers) as the input, and adjusted the regex to match the page numbers and surrounding tags in the epub, it correctly removed them in the output. (vinco, this is your temporary workaround solution). So why is the syntax highlighter showing the regex matches, but the converter not removing them? |
05-26-2010, 10:58 AM | #49 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
|
Thanks for the assist, tonyx3. I'll put that workaround into force for now.
|
05-26-2010, 12:37 PM | #50 |
creator of calibre
Posts: 44,805
Karma: 25490602
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
because the wizard doesn't operate on exactly the same html as the conversion pipeline, typically there may be white space difference between the two.
|
Advert | |
|
05-26-2010, 01:14 PM | #51 |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
|
Other than doing a series of conversions, do you have any workaround suggestions? I can get a copy of the PDF to you as well if interested.
|
05-26-2010, 01:27 PM | #52 |
creator of calibre
Posts: 44,805
Karma: 25490602
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
use the debug option to look at the actual intermediate html generated by the conversion process.
|
05-26-2010, 08:53 PM | #53 | |
Junior Member
Posts: 7
Karma: 10
Join Date: May 2010
Device: Nook
|
Quote:
Code:
Since nothing material was destroyed when the Eddorians were forced into the next plane of<br> existence, their historical records also have become available. Those records-folios and tapes and<br> playable discs of platinum alloy, resistant indefinitely even to Eddore's noxious atmosphere agree with<br> those of the Arisians upon this point. Immediately before the Coalescence began there was one, and only<br> <b>Page 1</b><br> <hr> <A name=2></a>one, planetary solar system in the Second Galaxy; and, until the advent of Eddore, the Second Galaxy<br> was entirely devoid of intelligent life. <br> |
|
05-27-2010, 12:19 AM | #54 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
So in this example, it looks like the problem is that the wizard can't tell the difference between a regular space and a non-breaking space, right?
That would be a problem. A 'white space difference' as Kovid said. Quote:
Is there some reason for this? I mean, I'm sure there's some reason, but is it absolutely necessary? It seems like it would be better if we were able to write and test our regexes based on the code that the conversion pipeline actually uses, to avoid errors like this one. |
|
05-27-2010, 01:18 AM | #55 |
creator of calibre
Posts: 44,805
Karma: 25490602
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Well yeah, but the conversion pipeline cant be run (for various technical reasons) inside the GUI, so the GUI basically uses a trick to use an approximation of the conversion pipeline. It works fine in most cases, where you don't have unusual input files, but in some cases, like this, the approximation isn't good enough.
I could of course run the conversion pipeline in a separate process and then take the output of that into the GUI, but that is too much work. I prefer to spend the time just improving the PDF engine so it removes headers and footers automatically. |
05-27-2010, 04:07 AM | #56 | |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
I see. So calibre uses two different pdf-to-html engines?
The one used in the conversion pipeline is obviously returning different results from the one used in the regex wizard. Quote:
Unfortunately, I've never once had the defaults work on removing headers or footers from PDF's. I've always had to write my own regex. And on multiple occasions I've had them match perfectly in the preview, and then not get removed in the conversion. (which is one reason I wish the preview html matched the conversion html) I'm sure PDF conversion, given the format's nature, must be one of the bigger headaches in developing the conversion system. |
|
05-27-2010, 05:54 AM | #57 |
US Navy, Retired
Posts: 9,879
Karma: 13806776
Join Date: Feb 2009
Location: North Carolina
Device: Icarus Illumina XL HD, Kindle PaperWhite SE 11th Gen
|
I believe he is referring to improving a not yet released PDF engine. One which non of us has had a chance to try yet because it isn't finished.
Last edited by DoctorOhh; 05-27-2010 at 07:47 PM. |
05-27-2010, 07:02 AM | #58 |
Connoisseur
Posts: 55
Karma: 10
Join Date: Jan 2010
Device: Nexus One
|
|
06-12-2010, 12:54 AM | #59 |
Member
Posts: 13
Karma: 954
Join Date: Jun 2010
Device: Mobipocket reader on Blackberry, XO using FBreader, Kindle
|
Hi. I've been using Calibre for a few weeks and I'm really enjoying it.
I adopted a regular expression for Adding from this thread that does a great job for my files: Code:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+) Author name - Book title (htm).zip So this will get imported with "Book title (htm)" as the title, rather than just "Book title". Then I have to manually merge things. The parenthesis might be (htm), or (rtf), or (txt), etc... I can get it to ignore the parenthesis by adding a [(] to the end of my regex, but then that breaks the adding for files that don't have a ( in them. I'm new to regex, and I've done some reading of reference suggested from inside of Calibre (which is how I learned enough to put my little addition on), but I've been trying to figure out a way to use the | operator unsucessfully. I'd be pleased with any solution that works, and if you have the time a brief description of why it works. My expectation is that I want to match ( or nothing, but not sure how to do the nothing. ie, is there some way to tell it to start over if a match fails? Thanks in advance. |
06-12-2010, 08:50 AM | #60 | |
Wizard
Posts: 4,004
Karma: 177841
Join Date: Dec 2009
Device: WinMo: IPAQ; Android: HTC HD2, Archos 7o; Java:Gravity T
|
Quote:
Code:
^((?P<author>([^\_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+) ([-#] ?)?(?P<series_index>[0-9.]+)?\s*-\s*)?(?P<title>[^(]+) Last edited by Starson17; 06-12-2010 at 09:12 AM. |
|
Tags |
regex, regular expressions |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expression Help | smartmart | Calibre | 5 | 10-17-2010 06:19 AM |
Need Help Creating a Regular Expression | Worm | Calibre | 9 | 08-18-2010 02:20 PM |
Regular Expression Help Needed | dloyer4 | Calibre | 1 | 07-25-2010 11:37 PM |
Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 11:45 PM |
I don't know how to use wilcards and regular expression.... | superanima | Sigil | 4 | 02-21-2010 10:42 AM |