View Single Post
Old 05-27-2009, 04:41 PM   #14
Sabardeyn
Guru
Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.Sabardeyn ought to be getting tired of karma fortunes by now.
 
Sabardeyn's Avatar
 
Posts: 644
Karma: 1242364
Join Date: May 2009
Location: The Right Coast
Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)
Quote:
Originally Posted by darkmonk View Post
(?P<author>((?!\s-\s).)*)\s-(?:\s(?P<series>((?!\s-s).)*)\s-)?\s(?P<title>.*)
I noticed a typo in the regex. I believe the highlighted area should be "\s-\s". Right now you're looking for a literal text string of "any white space character-s" instead of the equivalent of " - " between the <series> and <title> fields. Just FYI in case you copied & pasted this from calibre.


Try this combined regex, it should handle almost everything:
(?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*)

Playing around with things, I managed to insert the series index portion of Gwynevan's regex into Darkmonk's regex. So far it meets all of the criteria I posted with the following exceptions:
  • Extra " - "s within the <author> or <series name> fields mangles importation
  • Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
  • Leetspeak has limited importation - depending on the exact character combination used. (Which is fine, there is no way to account for every combination of letters and characters with Unicode.)
  • Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.

Last edited by Sabardeyn; 05-27-2009 at 05:25 PM. Reason: Clarified results of testing
Sabardeyn is offline   Reply With Quote