[GUI Plugin] Extract ISBN - Page 3

ldolse · 03-30-2011, 09:41 PM

Quote:

Originally Posted by kiwidude

My main concern would be false positives as you say from telephone numbers or similar. If you come up with something that you are confident will not suffer from that issue then I'm sure everyone would be grateful for your effort.

There is a 'check_isbn' function that is already in use in the various calibre metadata plugins that do some validation on whether a specific string of numbers is truly an ISBN vs a random string of numbers like a phone number. These get used before the metadata plugins send an ISBN to a metadata provider, but they should be good for this too.

from calibre.ebooks.metadata import check_isbn

kiwidude · 03-31-2011, 03:49 AM

@Idolse - thx, yes I do indeed already make use of that in the plugin, so if that should be a sufficient failsafe then that is good news

drMerry · 03-31-2011, 08:16 AM

An update.

Check isbn is indeed used and functions well I see.
I have made this version.
Works 2 times faster than original.
I scanned 600 epubs that had no isbn (Not checked if there was ISBN inside it)
I got 100 new ISBN-nrs

Seems nice, BUT:
I had 2 non- (but valid) isbn-nr's
There were isbn-nr's in the file. The numbers I found, where there because of a bad epub conversion.

You can not use \d. you have to use 0-9 because with \d calibre freezes on some files.
I have some trouble with multi-line

I can detect:

NUR 123
ISBN 1234567890

and

NUR 123
ISBN 123 456.78
9

0

and

123 456.789

0

but NOT
NUR 123
1234567890

In this case 1231234567 is returned as posible isbn and found bad
(EDIT: ADDED 7, Off-course I do not get 213123456..)

Maybe someone can find a solution?

I build in some restrictions to avoid some problems
13 or 10 0's is a valid isbn, but you don't want to extract that
I also test isbn 13-numbers if they start with 978 or 979. If not, I do not even test validity.

I'm a bad programmer in case of changelog, made some log info
I changed extract_isbn_code
Added strings on top of the file
changed the regex
changed loor_for_isbn_in_text

I'm not a py programmer so I someone knows a better way to do the txt.replace (strip all whitespaces (including \n and \r) and removing - and .)

At the other hand, I have sometimes put an isbn including - into the meta-info and calibre updated the info itself. so maybe only \n\r needs to be removed?
(in this case you don't even have to (and can't) test for 10 / 13 isbn. So it should go even faster

I also included a pdf with legal isbn-ranges. If you add this check, next to the validity check, you're 99.99999% sure it is an ISBN-number

drMerry · 03-31-2011, 10:11 AM

I just tested some new ebooks
PDF is still extreme slow.
The pdf-slowness is because of the pdftohtml process. This uses on all my pc's 50% of my cpu (1 complete core). Maybe a bug in calibre?

There will be more errors if u try to index an math-book or a technical manual (Because of the large number of large numbers)
But that will be a problem for a minority of users (including me).
Maybe you can add an option to only check numbers with isbn notations in front (like it is at this moment)

theducks · 03-31-2011, 11:17 AM

Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)

I backed in the check digit by trying [0-9X] until Calibre gave me a Green

ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.

kiwidude · 03-31-2011, 11:37 AM

drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.

In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.

As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker

. If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.

I don't want to have a whole bunch of options on this plugin, it is why I have resisted putting a menu onto it as there are too many permutations. I think of how I see people using it - they will give it a one click shot at trying to find an ISBN, and after that they will use a metadata download type lookup based on title/author matching. I really don't see them wasting a lot of time bothering making multiple attempts on the same book using different options? If it fails and they believe there "really must" be an ISBN in there, they will view the book and type it in if it means that much to them (which they will have to do for any graphical based PDFs anyways).

However that is just my opinion on how I see people using it.

If it handles 98% of the book ISBNs out there that is still an improvement without it.

drMerry · 03-31-2011, 12:39 PM

Quote:

Originally Posted by kiwidude

drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.

I think so.
I had a pdf of 700 pages.
163 MB
Took me more than half an hour to know your (also with my regex) tagger could not find an isbn

Quote:

Originally Posted by kiwidude

In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that.

I mean that a book with a lot of numbers have more change to have a number that is conform ISBN-standard. So this could give a false positive.

Quote:

Originally Posted by kiwidude

As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker

. If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it.

I often see pdf-files with isbn crossed over the front page (because the ocr can not handle the forntpage/picture.) Rest of document is good in this case.
This is off-course a ocr error and I can understand you do not want to invest in bad ocr. Because I've seen it often in books with isbn on the front cover, I myself should add the newline option. To test isbn numbers and try to recover a good isbn outof iop830l|Ix would be something else.
On the other hand, If I do not add the \s in the regex, I can not retrieve isbn numbers with the last number right before a linefeed.

@your opinion about 98% and a lot of sub-options:
agree

drMerry · 03-31-2011, 12:42 PM

Quote:

Originally Posted by theducks

Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)

That's a nasty one. If this is the case for e lot of books of this period, it would be a drawback.

Quote:

Originally Posted by theducks

I backed in the check digit by trying [0-9X] until Calibre gave me a Green

ISBN-10 confirmation.
Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.

I can confirm I also have never seen it mixed. You have not seen dots between them?

theducks · 03-31-2011, 02:08 PM

Quote:

Originally Posted by drMerry

You have not seen dots between them?

You really expect me to remember

a possible 1 or 2 out of 900+

All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results

(90% passed)

All checking the (failed) entry against the book printing was it was not a 'fat fingering' problem

I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.

drMerry · 03-31-2011, 06:37 PM

Quote:

Originally Posted by theducks

You really expect me to remember

a possible 1 or 2 out of 900+

You don't?
not a real e-book reader than

Quote:

Originally Posted by theducks

All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results

(90% passed)
...
I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page.

I myself am not sure about the fact if I've seen dots.
spaces and --- are sure.
Every new added character will slow down the process a bit (noticeable on large number of pages to be scanned).

But I think for speeding up the process we will have to wait for the mentioned replacement of pdftohtml

Loeffel · 04-02-2011, 09:36 AM

Hi,
normally the extraction runs fine, but if I try to scan many ebooks at once with the auto feature, the plugin hangs and the only way to go on is to kill Calibre (don't know at which number of books).
First I thought, it is ok, but then I saw that always, when this happens, the plugin doesn't go on, it stops at the first ebook. Scan this book only or just a few, no problem.
I can scan about 300 books at once without problems.

kiwidude · 04-02-2011, 09:45 AM

@Loeffel - I would be suprised if it is "hanging", I think it more likely you are hitting some large PDFs that it is struggling with time to analyse. If you run the plugin in debug mode (Ctrl+Shift+R) you should see it continuing to display output as the input converters do their thing.

Loeffel · 04-02-2011, 11:35 AM

I have no real big ebooks, but some in different formats perhaps that's the problem. Is there any way to tell the plugin just to search the first format found?

kiwidude · 04-02-2011, 11:58 AM

Go to the customisation for the plugin, and you can set its behaviour. However I think by default the alternate search is set to only check the first format in preferred input order.

It doesn't necessarily have to be a massive PDF, but just PDFs in general will slow it down, by how much depends on the content I think moreso than size. If it has lots of graphics I think that makes it grind rather slowly. There's a few posts in this thread about it if you read back. Now 0.7.53 is out I can start experimenting more seriously with the "first 10 pages/last 5 pages" approach to scanning which hopefully will improve things.

Loeffel · 04-02-2011, 08:39 PM

I found it. I will let it run while I'm sleeping. I will see what happened when I come back. If it just looks like or if it really f...s up.

03-31-2011, 10:11 AM	#34
drMerry Addict Posts: 293 Karma: 21022 Join Date: Mar 2011 Location: NL Device: Sony PRS-650	I just tested some new ebooks PDF is still extreme slow. The pdf-slowness is because of the pdftohtml process. This uses on all my pc's 50% of my cpu (1 complete core). Maybe a bug in calibre? There will be more errors if u try to index an math-book or a technical manual (Because of the large number of large numbers) But that will be a problem for a minority of users (including me). Maybe you can add an option to only check numbers with isbn notations in front (like it is at this moment) Last edited by drMerry; 03-31-2011 at 10:14 AM. Reason: pdf-error info

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
Extract ISBN from PDF?	mdroberts	Calibre	14	12-16-2016 07:32 AM
[Old Thread] Extract ISBN from file name	ChristianQ	Calibre	59	12-09-2015 05:08 AM
[GUI Plugin] Plugin Updater Deprecated	kiwidude	Plugins	159	06-19-2011 12:27 PM
[Old Thread] Auto Extract ISBN-Feature request	UnraisedArc	Calibre	60	03-23-2011 09:31 AM
Displaying ISBN column in the main GUI	tilleydog	Library Management	26	02-25-2011 04:08 AM

03-31-2011, 03:49 AM	#32
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Idolse - thx, yes I do indeed already make use of that in the plugin, so if that should be a sufficient failsafe then that is good news

03-31-2011, 11:17 AM	#35
theducks Well trained by Cats Posts: 31,005 Karma: 60358908 Join Date: Aug 2009 Location: The Central Coast of California Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A	Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed) I backed in the check digit by trying [0-9X] until Calibre gave me a Green ISBN-10 confirmation. Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters.

03-31-2011, 11:37 AM	#36
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully. In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that. As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker . If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it. I don't want to have a whole bunch of options on this plugin, it is why I have resisted putting a menu onto it as there are too many permutations. I think of how I see people using it - they will give it a one click shot at trying to find an ISBN, and after that they will use a metadata download type lookup based on title/author matching. I really don't see them wasting a lot of time bothering making multiple attempts on the same book using different options? If it fails and they believe there "really must" be an ISBN in there, they will view the book and type it in if it means that much to them (which they will have to do for any graphical based PDFs anyways). However that is just my opinion on how I see people using it. If it handles 98% of the book ISBNs out there that is still an improvement without it.

04-02-2011, 09:36 AM	#41
Loeffel Connoisseur Posts: 58 Karma: 10 Join Date: Mar 2011 Device: Kindle 3 3G	Hi, normally the extraction runs fine, but if I try to scan many ebooks at once with the auto feature, the plugin hangs and the only way to go on is to kill Calibre (don't know at which number of books). First I thought, it is ok, but then I saw that always, when this happens, the plugin doesn't go on, it stops at the first ebook. Scan this book only or just a few, no problem. I can scan about 300 books at once without problems.

04-02-2011, 09:45 AM	#42
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	@Loeffel - I would be suprised if it is "hanging", I think it more likely you are hitting some large PDFs that it is struggling with time to analyse. If you run the plugin in debug mode (Ctrl+Shift+R) you should see it continuing to display output as the input converters do their thing.

04-02-2011, 11:35 AM	#43
Loeffel Connoisseur Posts: 58 Karma: 10 Join Date: Mar 2011 Device: Kindle 3 3G	I have no real big ebooks, but some in different formats perhaps that's the problem. Is there any way to tell the plugin just to search the first format found?

04-02-2011, 11:58 AM	#44
kiwidude Calibre Plugins Developer Posts: 4,728 Karma: 2197770 Join Date: Oct 2010 Location: Australia Device: Kindle Oasis	Go to the customisation for the plugin, and you can set its behaviour. However I think by default the alternate search is set to only check the first format in preferred input order. It doesn't necessarily have to be a massive PDF, but just PDFs in general will slow it down, by how much depends on the content I think moreso than size. If it has lots of graphics I think that makes it grind rather slowly. There's a few posts in this thread about it if you read back. Now 0.7.53 is out I can start experimenting more seriously with the "first 10 pages/last 5 pages" approach to scanning which hopefully will improve things.

04-02-2011, 08:39 PM	#45
Loeffel Connoisseur Posts: 58 Karma: 10 Join Date: Mar 2011 Device: Kindle 3 3G	I found it. I will let it run while I'm sleeping. I will see what happened when I come back. If it just looks like or if it really f...s up.

Advert

Advert