03-31-2021, 04:57 AM | #1 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Metadata Source Plugin, advice needed
I made an attempt a while ago to develop some Metadata Source Plugins but gave up since I did not grasp the integration good enough. Now I'm gonna make another attempt but hope for some advice before I start:
Thanks in advance! |
03-31-2021, 06:57 AM | #2 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Found some that were not too old that gave me some further insight. This is my guess how the process is intended to work:
Any thoughts? |
Advert | |
|
03-31-2021, 09:16 AM | #3 |
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
That's pretty much it. But, I'm not sure what "not too old" means in this case. All the available metadata source plugins are still valid. And even those that are not, it is probably because the site has changed and no one is interested in fixing them. My usual suggestion is to pick one that you like the code, and use it as a model. The complications in the plugins are not in the basic structure, it is how to parse the pages to find the details. Unless you are lucky and the site has an API or some other simple structure for the data, that is were you will be spending most of the time.
|
03-31-2021, 12:12 PM | #4 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
What I meant is that many of the older variants are far too complex in their structure to find the central thread. It seems like the api documents would need some simple "howtos" that explain the basic process. Some simple source code templates would be useful as well.
Scraping content using xpath is quite easy nowadays when you have helper tools like xPather.com and Google Xpath Helper. Some of the lager content sites I'm going to use also offer api:s like marcxml/marc21 and json-ld which makes things a lot easier. When I have some time, I gonna fix a basic HOWTO together with a corresponding code example for future references and put it on github. Btw, do you know if the metadata plugins are running in parallel or in series (ie when multiple plugins are activated) ?? Last edited by Boilerplate4U; 03-31-2021 at 12:26 PM. |
03-31-2021, 10:06 PM | #5 | |||||
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
The complications come in the form of how to to that search and if you need to try various searches. Most will check for an identifier for the site, then an ISBN and then do a search with title and author and possibly a title only search if that makes sense for the site. But, the interface is described in the calibre source code calibre/ebooks/metadata/sources/base.py. Quote:
Quote:
Quote:
What sites are you looking at? I am a bit surprised that there are large ebook sites that do not already have metadata source plugins. Unless they are non-English sites. Or maybe very specialise repositories. You might also find problems with sites that offer APIs if they need any sort of authentication. If you are doing this for personal use, it probably won't be an issue. But, some of the sites will either limit the access based on a developer key or need individual access and that makes them harder for the general user use. Quote:
|
|||||
Advert | |
|
04-02-2021, 05:26 AM | #6 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
A cold pint of lager always helps!
Thanks, but base.py doesn't really provide any direct hints regarding how to grasp the workflow that currently is the essential part for me. As of searching, I think it's quite straight forward, either use ids like isbn otherwise title/author. Some of the scientific sites I plan to use offers a api-key by register an account that allows volume access for free (tho I'm quite sure there is some kind of limit anyhow) I believe source code samples does really matter like in the spirit of the design philosophy Specification By Example (imho). Thanks about the info regarding how the plugins are executing within calibre. Somewhat OT, but do you have any knowledge about scraping epub for metadata using EPUBMetadataReader? It seem that the <dc:identifier> is not used to extract isbn from content.opf. Question: Do you possibly have a clue as to which source file that may cope with this? Snippet from content.opf: Code:
<package version="2.0" unique-identifier="bookid"> <metadata> <dc:identifier id="bookid">9781783984343</dc:identifier> <dc:title >Reactive Programming with Scala and Akka</dc:title> <dc:publisher >Packt Publishing</dc:publisher> <dc:language >en</dc:language> <meta name="cover" content="cover-image"/> </metadata> Last edited by Boilerplate4U; 04-02-2021 at 05:38 AM. |
04-02-2021, 06:30 AM | #7 | ||||
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
Quote:
Quote:
Quote:
Code:
<dc:identifier opf:scheme="ISBN">9781927464243</dc:identifier> |
||||
04-02-2021, 06:35 AM | #8 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Could it be this one?
https://github.com/kovidgoyal/calibre/blob/master/src/calibre/ebooks/metadata/opf2.py My guess the none match in the xpath search "isbn_path" is caused by the "and" clause as in: "and (re:match(@scheme, "isbn", "i") or re:match(@opf:scheme, "isbn", "i"))]) marked in red below: Code:
class OPF(object): # {{{
MIMETYPE = 'application/oebps-package+xml'
NAMESPACES = {
None: "http://www.idpf.org/2007/opf",
'dc': "http://purl.org/dc/elements/1.1/",
'opf': "http://www.idpf.org/2007/opf",
}
META = '{%s}meta' % NAMESPACES['opf']
xpn = NAMESPACES.copy()
xpn.pop(None)
xpn['re'] = 'http://exslt.org/regular-expressions'
XPath = functools.partial(etree.XPath, namespaces=xpn)
CONTENT = XPath('self::*[re:match(name(), "meta$", "i")]/@content')
TEXT = XPath('string()')
metadata_path = XPath('descendant::*[re:match(name(), "metadata", "i")]')
metadata_elem_path = XPath(
'descendant::*[re:match(name(), concat($name, "$"), "i") or (re:match(name(), "meta$", "i") '
'and re:match(@name, concat("^calibre:", $name, "$"), "i"))]')
title_path = XPath('descendant::*[re:match(name(), "title", "i")]')
authors_path = XPath('descendant::*[re:match(name(), "creator", "i") and (@role="aut" or @opf:role="aut" or (not(@role) and not(@opf:role)))]')
bkp_path = XPath('descendant::*[re:match(name(), "contributor", "i") and (@role="bkp" or @opf:role="bkp")]')
tags_path = XPath('descendant::*[re:match(name(), "subject", "i")]')
isbn_path = XPath('descendant::*[re:match(name(), "identifier", "i") and '
'(re:match(@scheme, "isbn", "i") or re:match(@opf:scheme, "isbn", "i"))]')
|
04-02-2021, 06:38 AM | #9 |
creator of calibre
Posts: 44,416
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
from calibre.ebooks.metadata.meta import get_metadata mi = get_metadata(open('/t/t.epub', 'rb'), 'epub') print(mi.identifiers) |
04-02-2021, 06:51 AM | #10 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Thanks, but any idea regarding the isbn xpath mismatch?
|
04-02-2021, 07:05 AM | #11 |
creator of calibre
Posts: 44,416
Karma: 23977332
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
thats working as intended. Read the EPUB 2 spec. You need to specify the scheme, otherwise its just a meaningless number not an isbn.
|
04-02-2021, 07:10 AM | #12 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Yeah, according to http://idpf.github.io/epub-registries/identifiers there are a number scheme attributes (like DOI, ISBN, JDCN, UUID) but it seem they are usually are omitted at least in the ones I've check recently. ISBN seem to be default when it's undefined which should in this case work as the fallback type.
Last edited by Boilerplate4U; 04-02-2021 at 07:19 AM. |
04-02-2021, 07:19 AM | #13 | |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
Quote:
So one solution is to add an option to use ISBN when type is undefined. What do you think? |
|
04-02-2021, 08:46 AM | #14 | |
Grand Sorcerer
Posts: 24,905
Karma: 47303822
Join Date: Jul 2011
Location: Sydney, Australia
Device: Kobo:Touch,Glo, AuraH2O, GloHD,AuraONE, ClaraHD, Libra H2O; tolinoepos
|
Quote:
And I did do a quick check of 20 or so epubs that I have purchased. Only three of four had the ISBN and they had it correctly identified by the schema. The rest either had a UUID, or nothing at all. |
|
04-02-2021, 09:36 AM | #15 |
Enthusiast
Posts: 38
Karma: 10
Join Date: Jul 2020
Device: Kobo Clara HD
|
According to the standard specifications the scheme attribute seems optional, this is probably the main reason why so many publishers omit it.
"The identifier element has an optional OPF scheme attribute defined by this specification. The scheme attribute names the system or authority that generated or assigned the text contained within the identifier element, for example "ISBN" or "DOI." The values of the scheme attribute are case sensitive only when the particular scheme requires it. This specification does not standardize or endorse any particular publication identifier scheme. Specific uses of URLs or ISBNs are not yet addressed by this specification. Identifier schemes are not currently defined by Dublin Core" |
Tags |
metadata source plugin, template |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Read a book's metadata in a Metadata source plugin? | J-H | Development | 2 | 03-30-2021 09:08 AM |
[Metadata Source Plugin] wikidata | compurandom | Plugins | 46 | 11-27-2020 11:32 PM |
[Metadata Source Plugin] Empty Plugin? (Fake Identifier) | mneimeyer | Plugins | 3 | 11-11-2019 08:07 PM |
Advice needed for plugin running background thread | Phssthpok | Plugins | 21 | 01-16-2016 10:31 AM |
[Metadata Source Plugin] Amazon.it | nandocuci | Plugins | 2 | 05-18-2011 02:36 AM |