Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre

Notices

Closed Thread
 
Thread Tools Search this Thread
Old 09-26-2008, 12:47 PM   #1
Azhad
Member
Azhad began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
Regular Expression Help

Hi there

Here's my problem: I got a bunch of pdf files named like those examples:

Name Surname - Name of the Series 01 - Title of the Boook.pdf

or

Name Surname - Title of the Boook.pdf

For the first one I use this:

(?P<author>[^_]+) - (?P<series>[^_]+) (?P<series_index>[0-9]+) - (?P<title>.+)

And for the second example I use:
(?P<author>[^_]+) - (?P<title>.+)

The problem is that the parsing cut the last word, so the title result in "Title of the"

Anyway, is possible to join those 2 expression so the parsing understand when there's a series space in the filename or not ( xxx - xxx instead of xxx - xxx 3 - xxx) ?

The other problem I got is that calibre look inside the pdf for the title and author field, and sometime this result in some garbled text, is there a way to override this and use only the data parsed from the filename?

Thanks in advance for any advices.

P.S.
sorry for my subpar english
Azhad is offline  
Old 09-26-2008, 01:09 PM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
just put question marks after each + sign
kovidgoyal is offline  
Advert
Old 09-26-2008, 01:22 PM   #3
Azhad
Member
Azhad began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
Thanks, that at least fix the "Title of the" problem ;D

so now the expression are:

(?P<author>[^_]+) - (?P<series>[^_]+) (?P<series_index>[0-9]+) - (?P<title>.+) ?

and

(?P<author>[^_]+) - (?P<title>.+) ?

no way to make only one smart enough to skip the series and series index if the filename is xxx -xxx.pdf ?I was looking in something like (?<!...) but I can't figure it out..

Thanks anyway
Azhad is offline  
Old 09-26-2008, 01:36 PM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
try enclosing the series part inside another group and make that group optional with {0,1}
kovidgoyal is offline  
Old 10-01-2008, 07:25 AM   #5
Azhad
Member
Azhad began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
Ok, I'm getting crazy...

this is expression I got now:

(?P<author>[^_]+) - *(?P<series>[^_]*) (?P<series_index>[0-9]*) -? (?P<title>[^_].+) ?

it recognize:
Name Surname - Name of the Series 01 - Title of the Book.pdf
and
Name Surname - Title of the Book.pdf (notice the 3 spaces after the - )

I can't, for the love of God, erase those leading spaces from the expression...
Can anybody help?
I don't ssssspeck sssspython well...
Azhad is offline  
Advert
Old 10-01-2008, 01:51 PM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,530
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
replace the spaces in your expression with \s*
kovidgoyal is offline  
Old 10-02-2008, 06:04 AM   #7
Azhad
Member
Azhad began at the beginning.
 
Posts: 23
Karma: 10
Join Date: Dec 2007
Location: Rome, Italy
Device: PRS-500, PRS-505, Milestone, Galaxy Tab
At last
if anybody need it, here it is

(?P<author>[^_-]+) -?\s*(?P<series>[^_0-9-]*)(?P<series_index>[0-9]*)\s*-\s*(?P<title>[^_].+) ?
Azhad is offline  
Old 06-15-2009, 11:54 AM   #8
Terisa de morgan
Grand Sorcerer
Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.Terisa de morgan ought to be getting tired of karma fortunes by now.
 
Terisa de morgan's Avatar
 
Posts: 6,393
Karma: 12408443
Join Date: Jun 2009
Location: Madrid, Spain
Device: Kobo Clara/Aura One/Forma,XiaoMI 5, iPad, Huawei MediaPad, YotaPhone 2
Hi,

I'm new here and I'm trying to order my library. I have a problem with the regexp . I'm not able to load the <series_index>, it's loaded into the title.

For example, if I have "[Women of the Otherworld-8]- Personal demon", it will put:
  • series: [Women of the Otherworld
  • series_index: <empty>
  • title: 8]- Personal demon

I'm not able to change it.
Terisa de morgan is online now  
Old 06-17-2009, 06:12 PM   #9
Sabardeyn
Guru
Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.
 
Sabardeyn's Avatar
 
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Terise,

The problem is your filenames are not in the exact format that the regex is expecting. It wants to find " - " (space dash space) between the different fields. Your filename example does not make use of that exact field delimiter.

It would work correctly if you had "Women of the Otherworld 8 - Personal Demon".

I believe the brackets will also be a problem as they might be considered an end of word / whitespace (the "\s" portion of the regex).
Sabardeyn is offline  
Old 08-26-2009, 03:37 AM   #10
TTC
Junior Member
TTC began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2009
Device: ipod touch
Hi, I've had a search through a few of these threads to see if anyone has asked this question before, but I couldn't find it, so apologies if I missed it somewhere.

A lot of my filenames are in the following format:
Surname, Firstname - Title
or
Surname, Firstname - Series # - Title

My problem is that when I start the expression with the default (?P<author>[^_]+) it puts the author details in back to front and messes up the author sort as well.

How do I go about reversing the surname and the first name in the expression so that the Author field is populated correctly? I've looked at the guide for regular expressions, but it's a bit above my head at the moment, although I'm persevering to try and wrap my head around it.
TTC is offline  
Old 08-26-2009, 03:07 PM   #11
GinoAMelone
Junior Member
GinoAMelone began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2009
Device: Windows Mobile
Quote:
Originally Posted by TTC View Post
Hi, I've had a search through a few of these threads to see if anyone has asked this question before, but I couldn't find it, so apologies if I missed it somewhere.

A lot of my filenames are in the following format:
Surname, Firstname - Title
or
Surname, Firstname - Series # - Title

My problem is that when I start the expression with the default (?P<author>[^_]+) it puts the author details in back to front and messes up the author sort as well.

How do I go about reversing the surname and the first name in the expression so that the Author field is populated correctly? I've looked at the guide for regular expressions, but it's a bit above my head at the moment, although I'm persevering to try and wrap my head around it.
What he said!!!

Here's my latest RegExp:
Code:
^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)
It can currently match any of these formats:
  • Series Name 1 - Book Title.txt
  • Author - Series Name 1 - Book Title.txt
  • Author - Series Name - 1 - Book Title.txt
  • Author - Book Title.txt
  • Series Name - 1 - Book Title.txt
The only things it can't handle right now are:
  1. The format of author's name mentioned by TTC
  2. Different ordering of the fields in the file name

I tried using something like this to define multiple orderings. But, I can't reuse a group name. But then, with all the different formats the above RegExp can handle now, it would probably match anything with a reversed order anyway.
Code:
((?<author>...) - (?<title>...))|((?<title>...) - (?<author>...))
GinoAMelone is offline  
Old 08-27-2009, 01:26 PM   #12
lorijames
Junior Member
lorijames began at the beginning.
 
Posts: 2
Karma: 10
Join Date: Aug 2009
Device: iPhone
I've been looking through the forums trying to find an answer to this one:

I'd like to have my ePub files ONLY be the title. I changed the expression on the advanced tab to (?P<title>.+) pub it's still adding hyphens and the author name. I know I'm missing something, but what?

Lori
lorijames is offline  
Old 08-27-2009, 03:18 PM   #13
sircastor
Reader
sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.
 
sircastor's Avatar
 
Posts: 85
Karma: 6124
Join Date: Jul 2009
Device: PRS-505
Quote:
Originally Posted by lorijames View Post
I've been looking through the forums trying to find an answer to this one:

I'd like to have my ePub files ONLY be the title. I changed the expression on the advanced tab to (?P<title>.+) pub it's still adding hyphens and the author name. I know I'm missing something, but what?

Lori
Okay, I don't have much experience with Calibre, but I do know a lot about regex...

Your expression: (?P<title>.+) is not quite specific enough. The dot "." acts as a wildcard search character (it can match anything) and the plus "+" acts as a multiplier. So your expression says "Match any character any number of times, and put that into the 'title' container. It's just running a little rampant.

Try something like this:
Code:
(?P<title>.+?) -
Here's the explanation (if you care)

(?P<title>
This part says that anything in the parenthesis is going to be put into a container called "<title>" that you can use later. Calibre uses this internally to populate the various fields in it's database.

.+?
This part says "Match any character, repeat that, but do it lazily". The question mark at the end makes a multiplier go lazy, meaning that it will only match as much as it has to. Without the ?, the multiplier goes crazy, and you usually end up matching everything, forever.

) -
This closes the group, and then matches the following space and the dash after that. We need that dash as a way of saying "This isn't part of what I'm looking for" which is why we place it outside of the parenthesis.

This expression work on my completely boring "Book Title - nothing important.txt" filename, but you'll need to see if it fits your needs. This expression will *only* work on file names where the Book Title is the first thing in the file name. I don't have enough experience with knowing how file names are constructed for books yet.

Last edited by sircastor; 08-27-2009 at 03:25 PM. Reason: fixed for copying
sircastor is offline  
Old 08-28-2009, 03:23 AM   #14
sircastor
Reader
sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.sircastor got an A in P-Chem.
 
sircastor's Avatar
 
Posts: 85
Karma: 6124
Join Date: Jul 2009
Device: PRS-505
Quote:
Originally Posted by GinoAMelone View Post
[/LIST]The only things it can't handle right now are:[*]Different ordering of the fields in the file name[/LIST]
I'm fairly certain this is actually not possible the regex matches that occur work because you know the order of the data as it's fed. If Authors were always in a predictable, recognizable format, you'd be able to identify them, but the way that Calibre is being fed the information, makes it so that it happens in one sweep. You have to know where to look for the Author (or title, or series, etc.) as well as what you're looking for. The most basic expressions here are ones that are written to break on hyphens, because they're predictable.

Unless I'm missing something, I would skip trying to get your expression to handle different orders.
sircastor is offline  
Old 08-28-2009, 03:44 AM   #15
acidzebra
Liseuse Lover
acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.acidzebra ought to be getting tired of karma fortunes by now.
 
acidzebra's Avatar
 
Posts: 869
Karma: 1035404
Join Date: Jul 2008
Location: Netherlands
Device: PRS-505
Perhaps we should make a sticky of a regex thread (or make a "ask your regex question" thread) - I know there are always a lot of questions about it; it is such a superbly powerful filtering mechanism yet very daunting and confusing for beginners.
acidzebra is offline  
Closed Thread

Tags
regex, regular expressions


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Regular Expression Help smartmart Calibre 5 10-17-2010 06:19 AM
Need Help Creating a Regular Expression Worm Calibre 9 08-18-2010 02:20 PM
Regular Expression Help Needed dloyer4 Calibre 1 07-25-2010 11:37 PM
Help with the regular expression Dysonco Calibre 9 03-22-2010 11:45 PM
I don't know how to use wilcards and regular expression.... superanima Sigil 4 02-21-2010 10:42 AM


All times are GMT -4. The time now is 04:43 AM.


MobileRead.com is a privately owned, operated and funded community.