Metadata Source Plugin, advice needed

Boilerplate4U · 03-31-2021, 04:57 AM

I made an attempt a while ago to develop some Metadata Source Plugins but gave up since I did not grasp the integration good enough. Now I'm gonna make another attempt but hope for some advice before I start:

Could someone please point out a really basic example with just a) input-params, ie search title, author, etc and b) return-params ie where to return stuff to Calibe like description, cover etc. Even pseudo code is ok, I just want to understand what methods should be used for input and output.
If you enable multiple Metadata Source Plugins, will they run in serie och parallel mode? Reason why I'm asking is if there is a need to create your own process manager within the plugin to handle multiple requests.

Thanks in advance!

Boilerplate4U · 03-31-2021, 06:57 AM

Found some that were not too old that gave me some further insight. This is my guess how the process is intended to work:

Implement method identify() from class Source:

Code:

identify(self, log, result_queue, abort, title=None, authors=None,
                 identifiers={}, timeout=30)

Use search parameters passed by identify() like title, authors and identifiers.
Code:
```
result = search (title, authors, identifiers)
```
Possibly save cover url for future use by download_cover()
Code:
```
self.save_cover_url = result.cover_url
```

Return result using result_queue.put(Metadata)

Code:

md = Metadata(result.title, result.authors)
result_queue.put(md)

Any thoughts?

davidfor · 03-31-2021, 09:16 AM

That's pretty much it. But, I'm not sure what "not too old" means in this case. All the available metadata source plugins are still valid. And even those that are not, it is probably because the site has changed and no one is interested in fixing them. My usual suggestion is to pick one that you like the code, and use it as a model. The complications in the plugins are not in the basic structure, it is how to parse the pages to find the details. Unless you are lucky and the site has an API or some other simple structure for the data, that is were you will be spending most of the time.

Boilerplate4U · 03-31-2021, 12:12 PM

What I meant is that many of the older variants are far too complex in their structure to find the central thread. It seems like the api documents would need some simple "howtos" that explain the basic process. Some simple source code templates would be useful as well.

Scraping content using xpath is quite easy nowadays when you have helper tools like xPather.com and Google Xpath Helper. Some of the lager content sites I'm going to use also offer api:s like marcxml/marc21 and json-ld which makes things a lot easier.

When I have some time, I gonna fix a basic HOWTO together with a corresponding code example for future references and put it on github.

Btw, do you know if the metadata plugins are running in parallel or in series (ie when multiple plugins are activated) ??

davidfor · 03-31-2021, 10:06 PM

Quote:

Originally Posted by Boilerplate4U

What I meant is that many of the older variants are far too complex in their structure to find the central thread.

Which are you looking at? They all basically come down to the identify method building a search query, running it and examining the results to find the matching books. Then they get the full details for each of the matches.

The complications come in the form of how to to that search and if you need to try various searches. Most will check for an identifier for the site, then an ISBN and then do a search with title and author and possibly a title only search if that makes sense for the site.

But, the interface is described in the calibre source code calibre/ebooks/metadata/sources/base.py.

Quote:

It seems like the api documents would need some simple "howtos" that explain the basic process. Some simple source code templates would be useful as well.

Honestly, I don't think a simple source code would help. Because it isn't simple. And it would need a simple site to scrape for it to be useful. And those do not exist. And writing anything like this means someone has to have time. I don't think that there have been enough people interested in writing the metadata source plugins, or even possible sources, to make more extensive documentation worth it. The people who are doing these plugins

Quote:

Scraping content using xpath is quite easy nowadays when you have helper tools like xPather.com and Google Xpath Helper.

I'll have to have a look at these the next time a site changes and I have to fix it.

Quote:

Some of the lager content sites I'm going to use also offer api:s like marcxml/marc21 and json-ld which makes things a lot easier.

Exactly what sort of metadata are you trying to get???????? I'm sure it is a typo, but...

What sites are you looking at? I am a bit surprised that there are large ebook sites that do not already have metadata source plugins. Unless they are non-English sites. Or maybe very specialise repositories.

You might also find problems with sites that offer APIs if they need any sort of authentication. If you are doing this for personal use, it probably won't be an issue. But, some of the sites will either limit the access based on a developer key or need individual access and that makes them harder for the general user use.

Quote:

When I have some time, I gonna fix a basic HOWTO together with a corresponding code example for future references and put it on github.

Btw, do you know if the metadata plugins are running in parallel or in series (ie when multiple plugins are activated) ??

The different sources are run in parallel. There is no interaction between each of them. They return results that calibre then handles.

Boilerplate4U · 04-02-2021, 05:26 AM

A cold pint of lager always helps!

Thanks, but base.py doesn't really provide any direct hints regarding how to grasp the workflow that currently is the essential part for me.

As of searching, I think it's quite straight forward, either use ids like isbn otherwise title/author. Some of the scientific sites I plan to use offers a api-key by register an account that allows volume access for free (tho I'm quite sure there is some kind of limit anyhow)

I believe source code samples does really matter like in the spirit of the design philosophy Specification By Example (imho).

Thanks about the info regarding how the plugins are executing within calibre.

Somewhat OT, but do you have any knowledge about scraping epub for metadata using EPUBMetadataReader? It seem that the <dc:identifier> is not used to extract isbn from content.opf. Question: Do you possibly have a clue as to which source file that may cope with this?

Snippet from content.opf:

Code:

<package version="2.0" unique-identifier="bookid">
  <metadata>
    <dc:identifier id="bookid">9781783984343</dc:identifier>
    <dc:title >Reactive Programming with Scala and Akka</dc:title>
    <dc:publisher >Packt Publishing</dc:publisher>
    <dc:language >en</dc:language>
    <meta name="cover" content="cover-image"/>
  </metadata>

davidfor · 04-02-2021, 06:30 AM

Quote:

Originally Posted by Boilerplate4U

A cold pint of lager always helps!

Thanks, but base.py doesn't really provide any direct hints regarding how to grasp the workflow that currently is the essential part for me.

base.py gives you the API the metadata source plugins need to implement. How you implement them is up to you. But, identify is called with the search parameters. And the results, as Metadata objects, are added to "result_queue". How that happens is up to you. It depends on the site. If you are looking only with identifiers, the search part is relatively simple. If the site offers a good API, it becomes even easier.

Quote:

As of searching, I think it's quite straight forward, either use ids like isbn otherwise title/author.

Yes, that is it. But, exactly how you do it is up to you. And the site. But most of the metadata plugins follow the pattern I outlined above:

Do the following search, but stop when matches are found:
1. Using the site's identifier if it is known
2. Using the ISBN if it is known and the site supports it.
3. Using the Title and Author
4. Using the Title.
For the close matches, get the full details and build Metadata objects to return.

Quote:

Some of the scientific sites I plan to use offers a api-key by register an account that allows volume access for free (tho I'm quite sure there is some kind of limit anyhow)

Which can be a problem. Kovid had to remove support for WorldCat because they changed the rules on the API so that the limits were just to low for the potential number of users.

Quote:

I believe source code samples does really matter like in the spirit of the design philosophy Specification By Example (imho).

Thanks about the info regarding how the plugins are executing within calibre.

Somewhat OT, but do you have any knowledge about scraping epub for metadata using EPUBMetadataReader? It seem that the <dc:identifier> is not used to extract isbn from content.opf. Question: Do you possibly have a clue as to which source file that may cope with this?

Snippet from content.opf:

Code:

<package version="2.0" unique-identifier="bookid">
  <metadata>
    <dc:identifier id="bookid">9781783984343</dc:identifier>
    <dc:title >Reactive Programming with Scala and Akka</dc:title>
    <dc:publisher >Packt Publishing</dc:publisher>
    <dc:language >en</dc:language>
    <meta name="cover" content="cover-image"/>
  </metadata>

I haven't looked at it for a long time, but I think the EPUBMetadataReader metadata reader plugin is only reading the OPF file. That is where it gets the ISBN, but, your example is not how the ISBN is identified. The following line is:

Code:

    <dc:identifier opf:scheme="ISBN">9781927464243</dc:identifier>

The "opf:scheme" describes how the identifier is used.

Boilerplate4U · 04-02-2021, 06:35 AM

Could it be this one?
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/metadata/opf2.py

My guess the none match in the xpath search "isbn_path" is caused by the "and" clause as in: "and (re:match(@scheme, "isbn", "i") or re:match(@opf:scheme, "isbn", "i"))]) marked in red below:

Code:

class OPF(object):  # {{{

    MIMETYPE         = 'application/oebps-package+xml'
    NAMESPACES       = {
                        None: "http://www.idpf.org/2007/opf",
                        'dc': "http://purl.org/dc/elements/1.1/",
                        'opf': "http://www.idpf.org/2007/opf",
                       }
    META             = '{%s}meta' % NAMESPACES['opf']
    xpn = NAMESPACES.copy()
    xpn.pop(None)
    xpn['re'] = 'http://exslt.org/regular-expressions'
    XPath = functools.partial(etree.XPath, namespaces=xpn)
    CONTENT          = XPath('self::*[re:match(name(), "meta$", "i")]/@content')
    TEXT             = XPath('string()')

    metadata_path   = XPath('descendant::*[re:match(name(), "metadata", "i")]')
    metadata_elem_path = XPath(
        'descendant::*[re:match(name(), concat($name, "$"), "i") or (re:match(name(), "meta$", "i") '
        'and re:match(@name, concat("^calibre:", $name, "$"), "i"))]')
    title_path      = XPath('descendant::*[re:match(name(), "title", "i")]')
    authors_path    = XPath('descendant::*[re:match(name(), "creator", "i") and (@role="aut" or @opf:role="aut" or (not(@role) and not(@opf:role)))]')
    bkp_path        = XPath('descendant::*[re:match(name(), "contributor", "i") and (@role="bkp" or @opf:role="bkp")]')
    tags_path       = XPath('descendant::*[re:match(name(), "subject", "i")]')
    isbn_path       = XPath('descendant::*[re:match(name(), "identifier", "i") and '
                            '(re:match(@scheme, "isbn", "i") or re:match(@opf:scheme, "isbn", "i"))]')

kovidgoyal · 04-02-2021, 06:38 AM

Code:

from calibre.ebooks.metadata.meta import get_metadata
mi = get_metadata(open('/t/t.epub', 'rb'), 'epub')
print(mi.identifiers)

Boilerplate4U · 04-02-2021, 06:51 AM

Thanks, but any idea regarding the isbn xpath mismatch?

kovidgoyal · 04-02-2021, 07:05 AM

thats working as intended. Read the EPUB 2 spec. You need to specify the scheme, otherwise its just a meaningless number not an isbn.

Boilerplate4U · 04-02-2021, 07:10 AM

Quote:

Originally Posted by davidfor

The "opf:scheme" describes how the identifier is used.

Yeah, according to http://idpf.github.io/epub-registries/identifiers there are a number scheme attributes (like DOI, ISBN, JDCN, UUID) but it seem they are usually are omitted at least in the ones I've check recently. ISBN seem to be default when it's undefined which should in this case work as the fallback type.

Boilerplate4U · 04-02-2021, 07:19 AM

Quote:

Originally Posted by kovidgoyal

thats working as intended. Read the EPUB 2 spec. You need to specify the scheme, otherwise its just a meaningless number not an isbn.

Yes of course, but try to convince all the publishers that don't stick to the standard. Most of the epub I've checked so far omits the scheme type, that's the sad reality.

So one solution is to add an option to use ISBN when type is undefined. What do you think?

davidfor · 04-02-2021, 08:46 AM

Quote:

Originally Posted by Boilerplate4U

Yes of course, but try to convince all the publishers that don't stick to the standard. Most of the epub I've checked so far omits the scheme type, that's the sad reality.

So one solution is to add an option to use ISBN when type is undefined. What do you think?

That is my experience as well. But, most don't have the ISBN at all. If they have an identifier at all, it is usually a UUID. When the do have ISBNs, they appear to be correctly identified according to the standards.

And I did do a quick check of 20 or so epubs that I have purchased. Only three of four had the ISBN and they had it correctly identified by the schema. The rest either had a UUID, or nothing at all.

Boilerplate4U · 04-02-2021, 09:36 AM

According to the standard specifications the scheme attribute seems optional, this is probably the main reason why so many publishers omit it.

"The identifier element has an optional OPF scheme attribute defined by this specification. The scheme attribute names the system or authority that generated or assigned the text contained within the identifier element, for example "ISBN" or "DOI." The values of the scheme attribute are case sensitive only when the particular scheme requires it.

This specification does not standardize or endorse any particular publication identifier scheme. Specific uses of URLs or ISBNs are not yet addressed by this specification. Identifier schemes are not currently defined by Dublin Core"

03-31-2021, 04:57 AM	#1
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	Metadata Source Plugin, advice needed I made an attempt a while ago to develop some Metadata Source Plugins but gave up since I did not grasp the integration good enough. Now I'm gonna make another attempt but hope for some advice before I start: Could someone please point out a really basic example with just a) input-params, ie search title, author, etc and b) return-params ie where to return stuff to Calibe like description, cover etc. Even pseudo code is ok, I just want to understand what methods should be used for input and output. If you enable multiple Metadata Source Plugins, will they run in serie och parallel mode? Reason why I'm asking is if there is a need to create your own process manager within the plugin to handle multiple requests. Thanks in advance!

03-31-2021, 06:57 AM	#2
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	Found some that were not too old that gave me some further insight. This is my guess how the process is intended to work: Implement method identify() from class Source: Code: identify(self, log, result_queue, abort, title=None, authors=None, identifiers={}, timeout=30) Use search parameters passed by identify() like title, authors and identifiers. Code: result = search (title, authors, identifiers) Possibly save cover url for future use by download_cover() Code: self.save_cover_url = result.cover_url Return result using result_queue.put(Metadata) Code: md = Metadata(result.title, result.authors) result_queue.put(md) Any thoughts?

03-31-2021, 12:12 PM	#4
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	What I meant is that many of the older variants are far too complex in their structure to find the central thread. It seems like the api documents would need some simple "howtos" that explain the basic process. Some simple source code templates would be useful as well. Scraping content using xpath is quite easy nowadays when you have helper tools like xPather.com and Google Xpath Helper. Some of the lager content sites I'm going to use also offer api:s like marcxml/marc21 and json-ld which makes things a lot easier. When I have some time, I gonna fix a basic HOWTO together with a corresponding code example for future references and put it on github. Btw, do you know if the metadata plugins are running in parallel or in series (ie when multiple plugins are activated) ?? Last edited by Boilerplate4U; 03-31-2021 at 12:26 PM.

04-02-2021, 05:26 AM	#6
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	A cold pint of lager always helps! Thanks, but base.py doesn't really provide any direct hints regarding how to grasp the workflow that currently is the essential part for me. As of searching, I think it's quite straight forward, either use ids like isbn otherwise title/author. Some of the scientific sites I plan to use offers a api-key by register an account that allows volume access for free (tho I'm quite sure there is some kind of limit anyhow) I believe source code samples does really matter like in the spirit of the design philosophy Specification By Example (imho). Thanks about the info regarding how the plugins are executing within calibre. Somewhat OT, but do you have any knowledge about scraping epub for metadata using EPUBMetadataReader? It seem that the <dc:identifier> is not used to extract isbn from content.opf. Question: Do you possibly have a clue as to which source file that may cope with this? Snippet from content.opf: Code: <package version="2.0" unique-identifier="bookid"> <metadata> <dc:identifier id="bookid">9781783984343</dc:identifier> <dc:title >Reactive Programming with Scala and Akka</dc:title> <dc:publisher >Packt Publishing</dc:publisher> <dc:language >en</dc:language> <meta name="cover" content="cover-image"/> </metadata> Last edited by Boilerplate4U; 04-02-2021 at 05:38 AM.

04-02-2021, 06:38 AM	#9
kovidgoyal creator of calibre Posts: 44,416 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	Code: from calibre.ebooks.metadata.meta import get_metadata mi = get_metadata(open('/t/t.epub', 'rb'), 'epub') print(mi.identifiers)

03-31-2021, 09:16 AM	#3
davidfor Grand Sorcerer Posts: 24,905 Karma: 47303822 Join Date: Jul 2011 Location: Sydney, Australia Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos	That's pretty much it. But, I'm not sure what "not too old" means in this case. All the available metadata source plugins are still valid. And even those that are not, it is probably because the site has changed and no one is interested in fixing them. My usual suggestion is to pick one that you like the code, and use it as a model. The complications in the plugins are not in the basic structure, it is how to parse the pages to find the details. Unless you are lucky and the site has an API or some other simple structure for the data, that is were you will be spending most of the time.

04-02-2021, 06:51 AM	#10
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	Thanks, but any idea regarding the isbn xpath mismatch?

04-02-2021, 07:05 AM	#11
kovidgoyal creator of calibre Posts: 44,416 Karma: 23977332 Join Date: Oct 2006 Location: Mumbai, India Device: Various	thats working as intended. Read the EPUB 2 spec. You need to specify the scheme, otherwise its just a meaningless number not an isbn.

04-02-2021, 09:36 AM	#15
Boilerplate4U Enthusiast Posts: 38 Karma: 10 Join Date: Jul 2020 Device: Kobo Clara HD	According to the standard specifications the scheme attribute seems optional, this is probably the main reason why so many publishers omit it. "https://www.w3.org/Submission/2017/SUBM-epub-packages-20170125/#sec-opf-dcidentifier" "http://idpf.org/epub/20/spec/OPF_2.0_final_spec.html#AppendixA" (paragraph 2.2.10) "The identifier element has an optional OPF scheme attribute defined by this specification. The scheme attribute names the system or authority that generated or assigned the text contained within the identifier element, for example "ISBN" or "DOI." The values of the scheme attribute are case sensitive only when the particular scheme requires it. This specification does not standardize or endorse any particular publication identifier scheme. Specific uses of URLs or ISBNs are not yet addressed by this specification. Identifier schemes are not currently defined by Dublin Core"

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Read a book's metadata in a Metadata source plugin?	J-H	Development	2	03-30-2021 09:08 AM
[Metadata Source Plugin] wikidata	compurandom	Plugins	46	11-27-2020 11:32 PM
[Metadata Source Plugin] Empty Plugin? (Fake Identifier)	mneimeyer	Plugins	3	11-11-2019 08:07 PM
Advice needed for plugin running background thread	Phssthpok	Plugins	21	01-16-2016 10:31 AM
[Metadata Source Plugin] Amazon.it	nandocuci	Plugins	2	05-18-2011 02:36 AM

Advert

Advert