Quote:
Originally Posted by darkmonk
(?P<author>((?!\s-\s).)*)\s-(?:\s(?P<series>((?!\s-s).)*)\s-)?\s(?P<title>.*)
|
I noticed a typo in the regex. I believe the highlighted area should be "\s-
\s". Right now you're looking for a literal text string of "
any white space character-s" instead of the equivalent of " - " between the <series> and <title> fields. Just FYI in case you copied & pasted this from calibre.
Try this combined regex, it should handle almost everything:
(?P<author>((?!\s-\s).)*)\s-(?:\s((?P<series>.+) (?P<series_index>\d+)((?!\s-\s).)*)\s-)?\s(?P<title>.*)
Playing around with things, I managed to insert the series index portion of Gwynevan's regex into Darkmonk's regex. So far it meets all of the criteria I posted with the following exceptions:
- Extra " - "s within the <author> or <series name> fields mangles importation
- Hyphenated names like "Smith-Jones" are fine, "Smith - Jones" mangles importation.
- Leetspeak has limited importation - depending on the exact character combination used. (Which is fine, there is no way to account for every combination of letters and characters with Unicode.)
- Titles generally are not affected by the above since the regex allows all characters once the series name and index is obtained.