Regular Expression Help - Page 2

lorijames · 08-28-2009, 07:11 AM

What I'm having trouble with are the file names that come out when I use the conversion tool. I'm trying to find a way to change the naming convention.

For example: File Brazen.pdf comes out as Brazen - Unknown.epub
Sometime the author's name will populate.

I'd like to eliminate the author name altogether along with the hyphen and space so it's Brazen.epub

Lori

JvdW · 08-29-2009, 03:27 PM

Just to let you know that I might have found something that might help you too .
Googling for some help I found two programs that really helped me, YMMV:
Regex Coach : http://weitz.de/regex-coach/
Kodos : http://kodos.sourceforge.net/
Where I found Regex Coach the better one with more possiblities and better info on what is happening.

Regards,

Joop

tetujin · 11-13-2009, 10:01 AM

for people who are coming to this topic, here is another regex helper app -- web based and free, though you'll have to translate from perl regex to python regex once you're done; mostly just some syntax stuff, for instance - perl has no (?P<tag>) captures, just '()'. i use this site often to help me when i'm stuck on a particular regex problem.

http://www.gskinner.com/RegExr/

it's flash-based, but very worthwhile. the tooltip debug info and highlighting really make it worthwhile, imo.

ooayeloo · 12-16-2009, 02:50 AM

so i'm having a problem getting calibre to read the titles & author for my books. I realized that I have it saved in a different format than what it's asking for, but is there anyway to get calibre to read my particular format without me actually having to change the 400 ebooks i have on my computer?

below are some samples on how my books are saved.

Nora Roberts_True Betrayals.lit
or
Nora Roberts_TSI 01. Dance Upon the Air.lit

is there a regular expressions i can use that will allow calibre to read these titles?

thanks in advance!

itimpi · 12-16-2009, 03:39 AM

With .LIT files Calibre will normally take this information from the metadata stored inside the .LIT file at the time you add the files to the Calibre library. Is there any particular reason that you want it taken from the filename instead?

ooayeloo · 12-16-2009, 04:14 PM

For some reason, when it takes it from the metadata files, it is completely off. Some of them aren't even the right titles, or the author often time becomes "unknown". so i was hoping to pull it form the file name, so that it would look the way I want it to look

DedTV · 12-30-2009, 08:10 PM

Quote:

Originally Posted by sircastor

Unless I'm missing something, I would skip trying to get your expression to handle different orders.

I use Bulk Rename Utility (Freeware) and rename the files so they're all consistent which I can then use Calibre to easily handle.
It allows you to rename files using a flexible Match expression, and a flexible Replacement expression so swapping fields is easy. Since all my files are formatted as either Surname, Firstname - whatever.txt or Firstname Lastname - whatever.txt you can build off the comma to match the files you want swapped. I use the regex.

Code:

^(\w+)[ ]*,[ ]*([^-]+?)[ ]*-[ ]*(.*)

with Replace set to \2 \1 - \3. so it only selects the files where the author has a comma in it for renaming and leaves the ones already formatted correctly alone to get a consistent naming scheme.

In case that's all just gibberish, here's an visual example:
http://i97.photobucket.com/albums/l2...Authorswap.jpg

From there the expressions in this thread work perfectly to import into Calibre.

mezme · 01-01-2010, 01:40 AM

Speaking of regex help... maybe some of the experts on here can help this beginner out

My files are formatted in the following ways:
option A -> Author ~ Title
option B -> Author ~ Title - [Series 00]
option C -> Author ~ Title - [Collection 00 - Series 00]

I have finally got the following regex to work correctly for option B and option C
(?P<author>.+?) ~ (?P<title>.+?)(\s-\s)\[(?P<series>.*)\s(?P<series_index>[0-9.]*)\]?

Here is what I get when I run the REGEX on the following format(Author ~ Title) I get the following, which is what I DO NOT want... I want it to also separate the author and title even when there isn't a series
Title = Author ~ Title
Author = No Match
Series = No Match

If I run it with (Author ~ Title - [Series]) I get the following which is what I want:
Title = Title
Author = Author
Series = Series [Series Index]

If I run it with (Author ~ Title - [Collection - Series]) I get the following which is what I want:
Title = Title
Author = Author
Series = Collection - Series [Series Index]

However, I won't recognize Option A... how can I get it to read the author and title correctly if there is NO series?

Sabardeyn · 01-04-2010, 06:48 PM

Ok, I'm not where I can get to my regex software, but as near as I can tell from the expression...

It seems that you've been greedy (the + operator) without giving back at the end. So effectively you use the entire filename for the first expression test (Author), but it fails because it is not supposed to have a tilde in it.

You need to make use of the "give back" operator to release portions of the filename to limit it to just the Author portion. Unfortunately I cannot remember the operator at the moment, nor the "phrasing" for doing so.

mezme · 01-08-2010, 12:56 AM

Thanks! after playing with it I finally got it working:

(?P<author>[^-]+)\x20-\x20(?P<title>[^-]+)(?:-\s+\[(?P<series>[^.]+?)(?P<series_index>\d+)?\])?

Successfully parses the following formats:
author - title
author - title - [series]
author - title - [series Number]
author - title - [collection - series Number]
author - title - [collection number - series Number]

Tom2112 · 01-10-2010, 10:04 PM

Does anyone have a reg ex similar to this one:

^((?P<author>([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)

But... (there's always a but) I need it to remove the comma between the author's first and last names.

Such as in this filename:

Last, First - Series 01 - Title.lit

Which gets imported like this:
Author: First Last,

THANKS in advance,
Tom

Tom2112 · 01-10-2010, 10:29 PM

This regex does the same thing, as far as I can figure:

(?P<author>.+?) - ((?P<series>.+?) (?P<series_index>[0-9]+) - )?(?P<title>.+)

They both detect a series name and index and read them properly whether they're there or not. But it still has the comma problem in the author's name.

Tom2112 · 01-18-2010, 11:31 AM

Anyone?

rogue_ronin · 01-18-2010, 10:56 PM

My first take is that I think the problem comes down to the fact that the (?<author>) function has to include the comma because you have to find the beginning and the end of the name -- if there were separate functions for First and Last you could exclude the comma.

Even placing the comma in its own set via parentheses, there's no obvious way to replace it with nothing -- or filter it from the match.

Now, I'm no expert. Perhaps there's a tricky way to exclude from the return a subset of that return.

I just took a look at the Calibre regex help -- I thought maybe this would do it:

Quote:

(?:...)
A non-grouping version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.

I tried your more complicated regex in Calibre's test window, modifying it to "not retrieve" the comma:

Code:

^((?P<author>([^\-_0-9]+)(?:,)([^\-_0-9]+)(?=\s*-\s*)(?!\s*-\s*[0-9.]+)|\b))(\s*-\s*)?((?P<series>[^0-9\-]+)(\s*-\s*)?(?P<series_index>[0-9.]+)\s*-\s*)?(?P<title>[^\-_0-9]+)

but it returned the exact result your original regex did.

Now I realize that what it means is that in a more normal regex, the objects in such a set are not placed into the numeric variables for reuse later [ie: \1 \2 \3 or $1 $2 $3 depending on your regex flavor.] But because that comma is contained within a larger set, that larger set is returned to the label <author>.

BTW, your original regex found the Author as "Last, First" not "First Last," in the test window, so I cannot comment to its effectiveness.

m a r

Starson17 · 01-19-2010, 12:51 PM

Quote:

Originally Posted by rogue_ronin

My first take is that I think the problem comes down to the fact that the (?<author>) function has to include the comma because you have to find the beginning and the end of the name -- if there were separate functions for First and Last you could exclude the comma.

I agree - there is no way to not find the comma as part of the author's name.

Quote:

BTW, your original regex found the Author as "Last, First" not "First Last," in the test window, so I cannot comment to its effectiveness.

The reason for the difference is that he has the "Swap author firstname and lastname" option checked.

I think it's an error to keep the comma in the lastname. There's no way to get rid of it that I can see short of fixing the code, so I searched the code:

lines 135 to 144 of meta.py have:

Code:

            if prefs['swap_author_names'] and mi.authors:
                def swap(a):
                    parts = a.split()
                    if len(parts) > 1:
                        t = parts[-1]
                        parts = parts[:-1]
                        parts.insert(0, t)
                    return ' '.join(parts)
                mi.authors = [swap(x) for x in mi.authors]

I'm a rank beginner in python, but I can read this, and if the swap option is checked it makes an array of character strings from the author's name using the split() function, then finds the last element in that array (variable "t") and sticks it at the beginning of that array. Presumably, split() leaves the comma at the end of the next to last element in that array (which becomes the lastname and the last element after swap() runs).

Dropping the last char of the parts[:-1] string, if it is a comma, will work for simple cases. However, this area of the code could probably be improved even more. For example, the code above will change "Tolkien, J R R" to "R Tolkien, J R"

I have the bare minimum of skill to improve this code, but I suspect someone who is more familiar with python, ebooks and the philosophy of calibre could do better. For example, do you want to split the name at the comma, instead of at the last character string, or are commas in author names common? Do you want to do something special with "John T Smith, Jr." or "Smith, John T, Jr." or add more checkbox options or what? Anyone who wants to compile a list of various author name formats, single and multiple that might be encountered and comments on what the code should do in each case could help whoever wants to improve this code.

I'd suggest anyone who wants this improved should add a ticket and get back here to post the ticket number and their comments on exactly how the improvement should work.

BTW, If anyone wants a simple fix to their own code, adding the two lines below will do it:

Code:

            if prefs['swap_author_names'] and mi.authors:
                def swap(a):
                    parts = a.split()
                    if len(parts) > 1:
                        t = parts[-1]
                        parts = parts[:-1]
                        if parts[-1].endswith(','):
                            parts[-1]=parts[-1][:-1]
                        parts.insert(0, t)
                    return ' '.join(parts)
                mi.authors = [swap(x) for x in mi.authors]

For those who haven't ever played with source code or programming, it's not really that hard. Kovid and python have made it easy. In Windows, you just need to get one program (Bazaar) and run it once to retrieve the source code, then set an environment variable to tell calibre to use it. I do simple fixes like this for special cases.

edit:
An even simpler fix is to change the split() in the original code to split(','). This splits on the comma (assumes that the firstname and lastname are separated by a comma - as mine all are). This correctly swaps a name like "Tolkien, J R R."

08-28-2009, 07:11 AM	#16
lorijames Junior Member Posts: 2 Karma: 10 Join Date: Aug 2009 Device: iPhone	Maybe I'm going about this the wrong way? What I'm having trouble with are the file names that come out when I use the conversion tool. I'm trying to find a way to change the naming convention. For example: File Brazen.pdf comes out as Brazen - Unknown.epub Sometime the author's name will populate. I'd like to eliminate the author name altogether along with the hyphen and space so it's Brazen.epub Lori

01-01-2010, 01:40 AM	#23
mezme Connoisseur Posts: 59 Karma: 10 Join Date: Dec 2009 Device: PRS700	Speaking of regex help... maybe some of the experts on here can help this beginner out My files are formatted in the following ways: option A -> Author ~ Title option B -> Author ~ Title - [Series 00] option C -> Author ~ Title - [Collection 00 - Series 00] I have finally got the following regex to work correctly for option B and option C (?P<author>.+?) ~ (?P<title>.+?)(\s-\s)\[(?P<series>.)\s(?P<series_index>[0-9.])\]? Here is what I get when I run the REGEX on the following format(Author ~ Title) I get the following, which is what I DO NOT want... I want it to also separate the author and title even when there isn't a series Title = Author ~ Title Author = No Match Series = No Match If I run it with (Author ~ Title - [Series]) I get the following which is what I want: Title = Title Author = Author Series = Series [Series Index] If I run it with (Author ~ Title - [Collection - Series]) I get the following which is what I want: Title = Title Author = Author Series = Collection - Series [Series Index] However, I won't recognize Option A... how can I get it to read the author and title correctly if there is NO series? Last edited by mezme; 01-01-2010 at 01:46 AM.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Regular Expression Help	smartmart	Calibre	5	10-17-2010 05:19 AM
Need Help Creating a Regular Expression	Worm	Calibre	9	08-18-2010 01:20 PM
Regular Expression Help Needed	dloyer4	Calibre	1	07-25-2010 10:37 PM
Help with the regular expression	Dysonco	Calibre	9	03-22-2010 10:45 PM
I don't know how to use wilcards and regular expression....	superanima	Sigil	4	02-21-2010 09:42 AM

08-29-2009, 03:27 PM	#17
JvdW Zealot Posts: 115 Karma: 150 Join Date: Jul 2008 Location: Netherlands Veenendaal Device: Palm T5, Sony PRS-505, Nook Color	Just to let you know that I might have found something that might help you too . Googling for some help I found two programs that really helped me, YMMV: Regex Coach : http://weitz.de/regex-coach/ Kodos : http://kodos.sourceforge.net/ Where I found Regex Coach the better one with more possiblities and better info on what is happening. Regards, Joop

11-13-2009, 10:01 AM	#18
tetujin Junior Member Posts: 1 Karma: 10 Join Date: Nov 2009 Device: none	for people who are coming to this topic, here is another regex helper app -- web based and free, though you'll have to translate from perl regex to python regex once you're done; mostly just some syntax stuff, for instance - perl has no (?P<tag>) captures, just '()'. i use this site often to help me when i'm stuck on a particular regex problem. http://www.gskinner.com/RegExr/ it's flash-based, but very worthwhile. the tooltip debug info and highlighting really make it worthwhile, imo.

12-16-2009, 02:50 AM	#19
ooayeloo Junior Member Posts: 2 Karma: 10 Join Date: Dec 2009 Device: stanza	so i'm having a problem getting calibre to read the titles & author for my books. I realized that I have it saved in a different format than what it's asking for, but is there anyway to get calibre to read my particular format without me actually having to change the 400 ebooks i have on my computer? below are some samples on how my books are saved. Nora Roberts_True Betrayals.lit or Nora Roberts_TSI 01. Dance Upon the Air.lit is there a regular expressions i can use that will allow calibre to read these titles? thanks in advance!

12-16-2009, 03:39 AM	#20
itimpi Wizard Posts: 4,553 Karma: 950151 Join Date: Nov 2008 Device: Sony PRS-950, iphone/ipad (Marvin/iBooks/QuickReader)	With .LIT files Calibre will normally take this information from the metadata stored inside the .LIT file at the time you add the files to the Calibre library. Is there any particular reason that you want it taken from the filename instead?

12-16-2009, 04:14 PM	#21
ooayeloo Junior Member Posts: 2 Karma: 10 Join Date: Dec 2009 Device: stanza	For some reason, when it takes it from the metadata files, it is completely off. Some of them aren't even the right titles, or the author often time becomes "unknown". so i was hoping to pull it form the file name, so that it would look the way I want it to look

01-04-2010, 06:48 PM	#24
Sabardeyn Guru Posts: 644 Karma: 1242364 Join Date: May 2009 Location: The Right Coast Device: PC (Calibre), Nexus 7 2013 (Moon+ Pro), HTC HD2/Leo (Freda)	Ok, I'm not where I can get to my regex software, but as near as I can tell from the expression... It seems that you've been greedy (the + operator) without giving back at the end. So effectively you use the entire filename for the first expression test (Author), but it fails because it is not supposed to have a tilde in it. You need to make use of the "give back" operator to release portions of the filename to limit it to just the Author portion. Unfortunately I cannot remember the operator at the moment, nor the "phrasing" for doing so.

01-08-2010, 12:56 AM	#25
mezme Connoisseur Posts: 59 Karma: 10 Join Date: Dec 2009 Device: PRS700	Thanks! after playing with it I finally got it working: (?P<author>[^-]+)\x20-\x20(?P<title>[^-]+)(?:-\s+\[(?P<series>[^.]+?)(?P<series_index>\d+)?\])? Successfully parses the following formats: author - title author - title - [series] author - title - [series Number] author - title - [collection - series Number] author - title - [collection number - series Number]

01-10-2010, 10:04 PM	#26
Tom2112 Tablet eReader Posts: 45 Karma: 12620 Join Date: Dec 2009 Location: Western PA Device: Samsung Galaxy Tab 7, iPad, Dell Streak 7, Moto RAZR MAXX	Does anyone have a reg ex similar to this one: ^((?P<author>([^\-_0-9]+)(?=\s-\s)(?!\s-\s[0-9.]+)\|\b))(\s-\s)?((?P<series>[^0-9\-]+)(\s-\s)?(?P<series_index>[0-9.]+)\s-\s)?(?P<title>[^\-_0-9]+) But... (there's always a but) I need it to remove the comma between the author's first and last names. Such as in this filename: Last, First - Series 01 - Title.lit Which gets imported like this: Author: First Last, THANKS in advance, Tom

01-10-2010, 10:29 PM	#27
Tom2112 Tablet eReader Posts: 45 Karma: 12620 Join Date: Dec 2009 Location: Western PA Device: Samsung Galaxy Tab 7, iPad, Dell Streak 7, Moto RAZR MAXX	This regex does the same thing, as far as I can figure: (?P<author>.+?) - ((?P<series>.+?) (?P<series_index>[0-9]+) - )?(?P<title>.+) They both detect a series name and index and read them properly whether they're there or not. But it still has the comma problem in the author's name.

01-18-2010, 11:31 AM	#28
Tom2112 Tablet eReader Posts: 45 Karma: 12620 Join Date: Dec 2009 Location: Western PA Device: Samsung Galaxy Tab 7, iPad, Dell Streak 7, Moto RAZR MAXX	Anyone?

Advert

Advert