using templates/pyhon and custom columns to extract specific data from tags

smoothrolla · 11-10-2011, 06:03 AM

Hi Guys

I recently found out how to copy all data from one column (tags) to another customer column by using search and replace, thanks Chaley!

However i only want to copy specific data from the tags, using a template/python function so i dont have to do it manually.

I started to learn about templates and python last night and got pretty far:

ie

i created a column called #testcomposite

first i tried templates to extract only known genres from the tags column:

Code:

{#testcomposite:'list_intersection(field('tags'),'Adult, Adventure, Anthologies, Biography, Childrens, Classics, Drugs, Fantasy, Food, Football, Health, History, Historical, Horror, Humour, Inspirational, Modern, Music, Mystery, Non-Fiction, Poetry, Political, Philosophy, Psychological, Reference, Religion, Romance, Science, Science Fiction, Self Help, Short Stories, Sociology, Spirituality, Suspense, Thriller, Travel, Vampires, War, Western, Writing, Young Adult',',')'}

That worked, but this was slow, so i figured out how to create a python function by adapting the list_intersection function:

function:getgenre, 1 param

Code:

def evaluate(self, formatter, kwargs, mi, locals, val):
list1 = val
list2 = 'Adult, Adventure, Anthologies, Biography, Childrens, Classics, Drugs, Fantasy, Food, Football, Health, History, Historical, Horror, Humour, Inspirational, Modern, Music, Mystery, Non-Fiction, Poetry, Political, Philosophy, Psychological, Reference, Religion, Romance, Science, Science Fiction, Self Help, Short Stories, Sociology, Spirituality, Suspense, Thriller, Travel, Vampires, War, Western, Writing, Young Adult'
separator = ','
l1 = [l.strip() for l in list1.split(separator) if l.strip()]
l2 = [icu_lower(l.strip()) for l in list2.split(separator) if l.strip()]
res = []
for i in l1:
if icu_lower(i) in l2:
res.append(i)
return ', '.join(res)

called with template:

Code:

{#testcomposite:'getgenre(field('tags'))'}

So thats great, runs real quick, was quite amazed i got this far

However, what i would like is something like this:
Extract all the known genres from the tags (like above)
but also if i come across a tag which contains *mystery* (like Mystery & Detective) then add genre "Mystery" to the #testcomposite column

so something like this
if tag item like '*horror*' or tag item='Scarey' or tag item='Spooky' then add 'Horror' etc

Any help is appreicated, in either template or python (or both!)

PS, i am a programmer, but python and calibre is all very new to me and a little lower level language than im used to.

PPS, im amazed at home flexable this program is, hats off to the creator(s)!

Thanks very much!

chaley · 11-10-2011, 12:27 PM

First, you can make the python function faster by changing the code as follows:

Code:

def evaluate(self, formatter, kwargs, mi, locals, val):
    list1 = val
    l2 = ['adult', 'adventure', 'anthologies', 'biography', ..., 'young adult']
    l1 = [l.strip() for l in list1.split(',') if l.strip()]
    l1lcase = [icu_lower(l) for l in l1]
    res = set()
    for idx,item in enumerate(l1lcase):
        if item in l2:
            res.add(l1[idx])
    return ', '.join(res)

The reason to use a set for res is to avoid having the same entry in the result more than once. This will matter in the code below.

You can do the 'like' examples using something like:

Code:

    for item in l1lcase:
        if 'horror' in item or item in ['scary', 'spooky']:
            res.add('Horror')
            break
    for item in l1lcase:
        if 'mystery' in item or 'detective' in item:
            res.add('Mystery')
            break

When the "in" operator is applied as "string in string", it is a "contains" operation.

The set is necessary here because the added item might already be in the result, thus adding it more than once.

smoothrolla · 11-10-2011, 03:22 PM

Thanks Charley!

I got the speeded up script working great, thanks for that.
I half understand it, slowly getting there

I decided the code needed reworking so it looks for partial matches for all the genres i have provided (41 of them), and then the new code to map scarey to Horror etc (rather than have 41 for loops)

Here is the new code:

Code:

def evaluate(self, formatter, kwargs, mi, locals, val):
    list1 = val
    l2 = ['adult', 'adventure', 'anthologies', 'biography', 'childrens', 'classics', 'drugs', 'fantasy', 'food', 'football', 'health', 'history', 'historical', 'horror', 'humour', 'inspirational', 'modern', 'music', 'mystery', 'non-fiction', 'poetry', 'political', 'philosophy', 'psychological', 'reference', 'religion', 'romance', 'science', 'science fiction', 'self help', 'short stories', 'sociology', 'spirituality', 'suspense', 'thriller', 'travel', 'vampires', 'war', 'western', 'writing', 'young adult']
    l1 = [l.strip() for l in list1.split(',') if l.strip()]
    l1lcase = [icu_lower(l) for l in l1]
    res = set()
    for idx,item in enumerate(l1lcase):
        if item in l2:
            res.add(l1[idx])

    for item in l1lcase:
        for item2 in l2:
            if item2 in item:
                res.add(item2)
                break

    for item in l1lcase:
        if 'scary' in item or 'spooky' in item:
            res.add('Horror')
            break

    return ', '.join(res)

But this bit of that code adds tags in lowercase:

Code:

    
for item in l1lcase:
        for item2 in l2:
            if item2 in item:
                res.add(item2)
                break

I tried using
res.add(titlecase(item2))

but that thows an error

Maybe i need to keep the list2 in titlecase and lowercase it as i go, ill try to figure it out but if you can put me on the right path i would really appreciate it.

Thanks!

smoothrolla · 11-10-2011, 04:03 PM

Ok i got a solution, probably inelegant though

i create another list in titlecase of the tags i want to do a partial search for, as i decided i didnt want to search for them all (for example science is in science fiction so i got both tags which i didnt really want)

Code:

def evaluate(self, formatter, kwargs, mi, locals, val):
    list1 = val
    l2 = ['adult', 'adventure', 'anthologies', 'biography', 'childrens', 'classics', 'drugs', 'fantasy', 'food', 'football', 'health', 'history', 'historical', 'horror', 'humour', 'inspirational', 'modern', 'music', 'mystery', 'non-fiction', 'poetry', 'political', 'philosophy', 'psychological', 'reference', 'religion', 'romance', 'science', 'science fiction', 'self help', 'short stories', 'sociology', 'spirituality', 'suspense', 'thriller', 'travel', 'vampires', 'war', 'western', 'writing', 'young adult']
    l1 = [l.strip() for l in list1.split(',') if l.strip()]
    l1lcase = [icu_lower(l) for l in l1]
    res = set()
    for idx,item in enumerate(l1lcase):
        if item in l2:
            res.add(l1[idx])

    l3 = ['Adult', 'Adventure', 'Anthologies', 'Biography', 'Childrens', 'Classics', 'Drugs', 'Fantasy', 'Food', 'Football', 'Health', 'History', 'Historical', 'Horror', 'Humour', 'Inspirational', 'Modern', 'Music', 'Mystery', 'Non-Fiction', 'Poetry', 'Political', 'Philosophy', 'Psychological', 'Reference', 'Religion', 'Romance', 'Science fiction', 'Self Help', 'Short Stories', 'Sociology', 'Spirituality', 'Suspense', 'Thriller', 'Travel', 'Vampires', 'War', 'Western', 'Writing', 'Young Adult']

    for item in l1lcase:
        for item2 in l3:
            check = item2.lower()
            if check in item:
                res.add(item2)
                break

    for item in l1lcase:
        if 'scary' in item or 'spooky' in item:
            res.add('Horror')
            break

    return ', '.join(res)

need to do some more tests but its looking good

Thanks again for your help!

chaley · 11-10-2011, 04:04 PM

Quote:

Originally Posted by smoothrolla

Thanks Charley!

chaley, not charley.

Quote:

But this bit of that code adds tags in lowercase:

Code:

for item in l1lcase:
        for item2 in l2:
            if item2 in item:
                res.add(item2)
                break

That is the point of the 'enumerate'. The arrays l1 and l1lcase are ordered and indexed the same, so

Code:

for idx,item in enumerate(l1lcase):
        for item2 in l2:
            if item2 in item:
                res.add(l1[idx])
                break

will add the cased version of item2 to the result.

The enumerate operator returns the index and the value (a tuple in python terms), which in this case is the index and the lowercase version of the value. Because l1lcase and l1 are parallel arrays, the l1[idx] gets the equivalent item for the one in l1lcase, which is the cased version.

smoothrolla · 11-10-2011, 04:18 PM

Quote:

chaley, not charley.

sorry, I wasnt sure

Quote:

The enumerate operator returns the index and the value (a tuple in python terms), which in this case is the index and the lowercase version of the value. Because l1lcase and l1 are parallel arrays, the l1[idx] gets the equivalent item for the one in l1lcase, which is the cased version.

Great thanks, i wanted to code something like that but unsure how in this language, very frustrating.

I come up with a slightly different way of doing it, posted a minute before your reply so not sure if you saw it, its probably laughable mind you

thanks

smoothrolla · 11-10-2011, 05:12 PM

I thought i would post my final approach here incase someone else finds it usefull in the future

i need to add some more tags->genre mappings (like football->sports etc) but you get the idea

Code:

def evaluate(self, formatter, kwargs, mi, locals, val):
    # turn the tags into an array and create a lowercase version
    tagslist      = [l.strip() for l in val.split(',') if l.strip()]
    tagslistlcase = [icu_lower(l) for l in tagslist]

    # my list of genres i want, and create a lowercase version
    genrelist      = ['Adult', 'Adventure', 'Anthologies', 'Biography', 'Childrens', 'Classics', 'Drugs', 'Fantasy', 'Food', 'Football', 'Health', 'History', 'Historical', 'Horror', 'Humour', 'Inspirational', 'Modern', 'Music', 'Mystery', 'Non-Fiction', 'Poetry', 'Political', 'Philosophy', 'Psychological', 'Reference', 'Religion', 'Romance', 'Science', 'Science Fiction', 'Self Help', 'Short Stories', 'Sociology', 'Spirituality', 'Suspense', 'Thriller', 'Travel', 'Vampires', 'War', 'Western', 'Writing', 'Young Adult']
    genrelistlcase = [icu_lower(l) for l in genrelist]

    res = set()

    # loop through the genres
    for idx,genre in enumerate(genrelistlcase):
        # loop through the tags and see if the genre is contained in a tag
        for tag in tagslistlcase:
            if genre in tag:
                # dont add science if it was found in science fiction
                if genre != 'science' or (genre == 'science' and 'science fiction' not in tag):
                    # add to array
                    res.add(genrelist[idx])
                    break

    # final loop through the tags to look for specific tags i want to map to a genre
    for tag in tagslistlcase:
        if 'religious' in tag or 'christian' in tag:
            res.add('Religion')
        if 'children' in tag:
            res.add('Childrens')

    # join the array into a string and return
    return ', '.join(res)

11-10-2011, 06:03 AM	#1
smoothrolla Member Posts: 13 Karma: 10 Join Date: Nov 2011 Device: kindle	using templates/pyhon and custom columns to extract specific data from tags Hi Guys I recently found out how to copy all data from one column (tags) to another customer column by using search and replace, thanks Chaley! However i only want to copy specific data from the tags, using a template/python function so i dont have to do it manually. I started to learn about templates and python last night and got pretty far: ie i created a column called #testcomposite first i tried templates to extract only known genres from the tags column: Code: {#testcomposite:'list_intersection(field('tags'),'Adult, Adventure, Anthologies, Biography, Childrens, Classics, Drugs, Fantasy, Food, Football, Health, History, Historical, Horror, Humour, Inspirational, Modern, Music, Mystery, Non-Fiction, Poetry, Political, Philosophy, Psychological, Reference, Religion, Romance, Science, Science Fiction, Self Help, Short Stories, Sociology, Spirituality, Suspense, Thriller, Travel, Vampires, War, Western, Writing, Young Adult',',')'} That worked, but this was slow, so i figured out how to create a python function by adapting the list_intersection function: function:getgenre, 1 param Code: def evaluate(self, formatter, kwargs, mi, locals, val): list1 = val list2 = 'Adult, Adventure, Anthologies, Biography, Childrens, Classics, Drugs, Fantasy, Food, Football, Health, History, Historical, Horror, Humour, Inspirational, Modern, Music, Mystery, Non-Fiction, Poetry, Political, Philosophy, Psychological, Reference, Religion, Romance, Science, Science Fiction, Self Help, Short Stories, Sociology, Spirituality, Suspense, Thriller, Travel, Vampires, War, Western, Writing, Young Adult' separator = ',' l1 = [l.strip() for l in list1.split(separator) if l.strip()] l2 = [icu_lower(l.strip()) for l in list2.split(separator) if l.strip()] res = [] for i in l1: if icu_lower(i) in l2: res.append(i) return ', '.join(res) called with template: Code: {#testcomposite:'getgenre(field('tags'))'} So thats great, runs real quick, was quite amazed i got this far However, what i would like is something like this: Extract all the known genres from the tags (like above) but also if i come across a tag which contains mystery (like Mystery & Detective) then add genre "Mystery" to the #testcomposite column so something like this if tag item like 'horror' or tag item='Scarey' or tag item='Spooky' then add 'Horror' etc Any help is appreicated, in either template or python (or both!) PS, i am a programmer, but python and calibre is all very new to me and a little lower level language than im used to. PPS, im amazed at home flexable this program is, hats off to the creator(s)! Thanks very much! Last edited by smoothrolla; 11-10-2011 at 11:38 AM.

11-10-2011, 12:27 PM	#2
chaley Grand Sorcerer Posts: 12,409 Karma: 8012652 Join Date: Jan 2010 Location: Notts, England Device: Kobo Libra 2	First, you can make the python function faster by changing the code as follows: Code: def evaluate(self, formatter, kwargs, mi, locals, val): list1 = val l2 = ['adult', 'adventure', 'anthologies', 'biography', ..., 'young adult'] l1 = [l.strip() for l in list1.split(',') if l.strip()] l1lcase = [icu_lower(l) for l in l1] res = set() for idx,item in enumerate(l1lcase): if item in l2: res.add(l1[idx]) return ', '.join(res) The reason to use a set for res is to avoid having the same entry in the result more than once. This will matter in the code below. You can do the 'like' examples using something like: Code: for item in l1lcase: if 'horror' in item or item in ['scary', 'spooky']: res.add('Horror') break for item in l1lcase: if 'mystery' in item or 'detective' in item: res.add('Mystery') break When the "in" operator is applied as "string in string", it is a "contains" operation. The set is necessary here because the added item might already be in the result, thus adding it more than once.

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
how to move tags data into a new custom column	smoothrolla	Library Management	6	05-30-2018 07:19 AM
Custom Columns - How are you using yours?	nynaevelan	Library Management	19	04-18-2011 12:42 AM
Can custom book data be displayed in a custom column?	kiwidude	Development	9	03-02-2011 05:35 AM
Techniques to use plugboards, custom columns and templates	kovidgoyal	Library Management	0	01-26-2011 04:21 PM
ADD Books & extract tags from title?	johnb0647	Calibre	3	01-08-2011 05:36 PM

Advert

Advert