09-26-2008, 12:47 PM | #1 |
Member
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
|
Regular Expression Help
Hi there
Here's my problem: I got a bunch of pdf files named like those examples: Name Surname - Name of the Series 01 - Title of the Boook.pdf or Name Surname - Title of the Boook.pdf For the first one I use this: (?P<author>[^_]+) - (?P<series>[^_]+) (?P<series_index>[0-9]+) - (?P<title>.+) And for the second example I use: (?P<author>[^_]+) - (?P<title>.+) The problem is that the parsing cut the last word, so the title result in "Title of the" Anyway, is possible to join those 2 expression so the parsing understand when there's a series space in the filename or not ( xxx - xxx instead of xxx - xxx 3 - xxx) ? The other problem I got is that calibre look inside the pdf for the title and author field, and sometime this result in some garbled text, is there a way to override this and use only the data parsed from the filename? Thanks in advance for any advices. P.S. sorry for my subpar english |
09-26-2008, 01:09 PM | #2 |
creator of calibre
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
just put question marks after each + sign
|
Advert | |
|
09-26-2008, 01:22 PM | #3 |
Member
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
|
Thanks, that at least fix the "Title of the" problem ;D
so now the expression are: (?P<author>[^_]+) - (?P<series>[^_]+) (?P<series_index>[0-9]+) - (?P<title>.+) ? and (?P<author>[^_]+) - (?P<title>.+) ? no way to make only one smart enough to skip the series and series index if the filename is xxx -xxx.pdf ?I was looking in something like (?<!...) but I can't figure it out.. Thanks anyway |
09-26-2008, 01:36 PM | #4 |
creator of calibre
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
try enclosing the series part inside another group and make that group optional with {0,1}
|
10-01-2008, 07:25 AM | #5 |
Member
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
|
Ok, I'm getting crazy...
this is expression I got now: (?P<author>[^_]+) - *(?P<series>[^_]*) (?P<series_index>[0-9]*) -? (?P<title>[^_].+) ? it recognize: Name Surname - Name of the Series 01 - Title of the Book.pdf and Name Surname - Title of the Book.pdf (notice the 3 spaces after the - ) I can't, for the love of God, erase those leading spaces from the expression... Can anybody help? I don't ssssspeck sssspython well... |
Advert | |
|
10-01-2008, 01:51 PM | #6 |
creator of calibre
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
replace the spaces in your expression with \s*
|
10-02-2008, 06:04 AM | #7 |
Member
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
|
At last
if anybody need it, here it is (?P<author>[^_-]+) -?\s*(?P<series>[^_0-9-]*)(?P<series_index>[0-9]*)\s*-\s*(?P<title>[^_].+) ? |
06-15-2009, 11:54 AM | #8 |
Grand Sorcerer
Posts: 6,393
Karma: 12408443
Join Date: Jun 2009
Location: Madrid, Spain
Device: Kobo Clara/Aura One/Forma,XiaoMI 5, iPad, Huawei MediaPad, YotaPhone 2
|
Hi,
I'm new here and I'm trying to order my library. I have a problem with the regexp . I'm not able to load the <series_index>, it's loaded into the title. For example, if I have "[Women of the Otherworld-8]- Personal demon", it will put:
I'm not able to change it. |
06-17-2009, 06:12 PM | #9 |
Guru
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
|
Terise,
The problem is your filenames are not in the exact format that the regex is expecting. It wants to find " - " (space dash space) between the different fields. Your filename example does not make use of that exact field delimiter. It would work correctly if you had "Women of the Otherworld 8 - Personal Demon". I believe the brackets will also be a problem as they might be considered an end of word / whitespace (the "\s" portion of the regex). |
08-26-2009, 03:37 AM | #10 |
Junior Member
Posts: 3
Karma: 10
Join Date: Aug 2009
Device: ipod touch
|
Hi, I've had a search through a few of these threads to see if anyone has asked this question before, but I couldn't find it, so apologies if I missed it somewhere.
A lot of my filenames are in the following format: Surname, Firstname - Title or Surname, Firstname - Series # - Title My problem is that when I start the expression with the default (?P<author>[^_]+) it puts the author details in back to front and messes up the author sort as well. How do I go about reversing the surname and the first name in the expression so that the Author field is populated correctly? I've looked at the guide for regular expressions, but it's a bit above my head at the moment, although I'm persevering to try and wrap my head around it. |
08-26-2009, 03:07 PM | #11 | |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2009
Device: Windows Mobile
|
Quote:
Here's my latest RegExp: Code:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)
I tried using something like this to define multiple orderings. But, I can't reuse a group name. But then, with all the different formats the above RegExp can handle now, it would probably match anything with a reversed order anyway. Code:
((?<author>...) - (?<title>...))|((?<title>...) - (?<author>...)) |
|
08-27-2009, 01:26 PM | #12 |
Junior Member
Posts: 2
Karma: 10
Join Date: Aug 2009
Device: iPhone
|
I've been looking through the forums trying to find an answer to this one:
I'd like to have my ePub files ONLY be the title. I changed the expression on the advanced tab to (?P<title>.+) pub it's still adding hyphens and the author name. I know I'm missing something, but what? Lori |
08-27-2009, 03:18 PM | #13 | |
Reader
Posts: 85
Karma: 6124
Join Date: Jul 2009
Device: PRS-505
|
Quote:
Your expression: (?P<title>.+) is not quite specific enough. The dot "." acts as a wildcard search character (it can match anything) and the plus "+" acts as a multiplier. So your expression says "Match any character any number of times, and put that into the 'title' container. It's just running a little rampant. Try something like this: Code:
(?P<title>.+?) - (?P<title> This part says that anything in the parenthesis is going to be put into a container called "<title>" that you can use later. Calibre uses this internally to populate the various fields in it's database. .+? This part says "Match any character, repeat that, but do it lazily". The question mark at the end makes a multiplier go lazy, meaning that it will only match as much as it has to. Without the ?, the multiplier goes crazy, and you usually end up matching everything, forever. ) - This closes the group, and then matches the following space and the dash after that. We need that dash as a way of saying "This isn't part of what I'm looking for" which is why we place it outside of the parenthesis. This expression work on my completely boring "Book Title - nothing important.txt" filename, but you'll need to see if it fits your needs. This expression will *only* work on file names where the Book Title is the first thing in the file name. I don't have enough experience with knowing how file names are constructed for books yet. Last edited by sircastor; 08-27-2009 at 03:25 PM. Reason: fixed for copying |
|
08-28-2009, 03:23 AM | #14 | |
Reader
Posts: 85
Karma: 6124
Join Date: Jul 2009
Device: PRS-505
|
Quote:
Unless I'm missing something, I would skip trying to get your expression to handle different orders. |
|
08-28-2009, 03:44 AM | #15 |
Liseuse Lover
Posts: 869
Karma: 1035404
Join Date: Jul 2008
Location: Netherlands
Device: PRS-505
|
Perhaps we should make a sticky of a regex thread (or make a "ask your regex question" thread) - I know there are always a lot of questions about it; it is such a superbly powerful filtering mechanism yet very daunting and confusing for beginners.
|
Tags |
regex, regular expressions |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Regular Expression Help | smartmart | Calibre | 5 | 10-17-2010 06:19 AM |
Need Help Creating a Regular Expression | Worm | Calibre | 9 | 08-18-2010 02:20 PM |
Regular Expression Help Needed | dloyer4 | Calibre | 1 | 07-25-2010 11:37 PM |
Help with the regular expression | Dysonco | Calibre | 9 | 03-22-2010 11:45 PM |
I don't know how to use wilcards and regular expression.... | superanima | Sigil | 4 | 02-21-2010 10:42 AM |