03-30-2011, 10:41 PM | #31 | |
Wizard
Posts: 1,337
Karma: 123455
Join Date: Apr 2009
Location: Malaysia
Device: PRS-650, iPhone
|
Quote:
from calibre.ebooks.metadata import check_isbn |
|
03-31-2011, 04:49 AM | #32 |
Calibre Plugins Developer
Posts: 4,694
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Idolse - thx, yes I do indeed already make use of that in the plugin, so if that should be a sufficient failsafe then that is good news
|
Advert | |
|
03-31-2011, 09:16 AM | #33 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
An update.
Check isbn is indeed used and functions well I see. I have made this version. Works 2 times faster than original. I scanned 600 epubs that had no isbn (Not checked if there was ISBN inside it) I got 100 new ISBN-nrs Seems nice, BUT: I had 2 non- (but valid) isbn-nr's There were isbn-nr's in the file. The numbers I found, where there because of a bad epub conversion. You can not use \d. you have to use 0-9 because with \d calibre freezes on some files. I have some trouble with multi-line I can detect: NUR 123 ISBN 1234567890 and NUR 123 ISBN 123 456.78 9 0 and 123 456.789 0 but NOT NUR 123 1234567890 In this case 1231234567 is returned as posible isbn and found bad (EDIT: ADDED 7, Off-course I do not get 213123456..) Maybe someone can find a solution? I build in some restrictions to avoid some problems 13 or 10 0's is a valid isbn, but you don't want to extract that I also test isbn 13-numbers if they start with 978 or 979. If not, I do not even test validity. I'm a bad programmer in case of changelog, made some log info I changed extract_isbn_code Added strings on top of the file changed the regex changed loor_for_isbn_in_text I'm not a py programmer so I someone knows a better way to do the txt.replace (strip all whitespaces (including \n and \r) and removing - and .) At the other hand, I have sometimes put an isbn including - into the meta-info and calibre updated the info itself. so maybe only \n\r needs to be removed? (in this case you don't even have to (and can't) test for 10 / 13 isbn. So it should go even faster I also included a pdf with legal isbn-ranges. If you add this check, next to the validity check, you're 99.99999% sure it is an ISBN-number Last edited by kiwidude; 05-28-2012 at 11:34 AM. Reason: Remove attachment so others do not get confused |
03-31-2011, 11:11 AM | #34 |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
I just tested some new ebooks
PDF is still extreme slow. The pdf-slowness is because of the pdftohtml process. This uses on all my pc's 50% of my cpu (1 complete core). Maybe a bug in calibre? There will be more errors if u try to index an math-book or a technical manual (Because of the large number of large numbers) But that will be a problem for a minority of users (including me). Maybe you can add an option to only check numbers with isbn notations in front (like it is at this moment) Last edited by drMerry; 03-31-2011 at 11:14 AM. Reason: pdf-error info |
03-31-2011, 12:17 PM | #35 |
Well trained by Cats
Posts: 30,579
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
Some of my really old Dead Tree ™ books had SAN or ISBN's without the check digit (9 chars long) ( I think there may have been confusion at the time, that the check digit was not part of the ISBN to be printed)
I backed in the check digit by trying [0-9X] until Calibre gave me a Green ISBN-10 confirmation. Note: (USA publishers) space or dash were the only separators I saw. never mixed. never interrupted with other characters. |
Advert | |
|
03-31-2011, 12:37 PM | #36 |
Calibre Plugins Developer
Posts: 4,694
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
drMerry - I will take a look at your changes once 0.7.53 goes out. Kovid has made some changes which will allow scanning pdfs with the new pdf engine for just a selected number of pages from the front and back. We will have to see whether that significantly improves the performance or not. My initial testing of scanning the whole document actually found that the new engine is currently slower than the existing one, but that should change when not doing the whole document hopefully.
In terms of options to check numbers on technical manuals etc, I don't see where the issue will be. You say there will be more "errors" - do you mean more matches that are rejected? The logic I have will remain the same in terms of stopping searching after finding a valid ISBN. So surely the only issue will be for a manual that does not have an ISBN but does have lots of numbers in it will run slower? If I am able to somehow try to only scan a small front/back portion of all books (not just pdf ones) that shouldn't be an issue. I will look into that. As for all the variations of ISBN being split across lines - I will be honest with my selfishness and repeat my statement above that I really don't care if there are really badly scanned documents that this fails to pickup an ISBN from. It is just a tool, not a miracle worker . If your ISBNs are so badly formatted the rest of the content of that document will surely also be dire - not getting an ISBN may force you to open it and see for yourself and perhaps either decide to look for a better copy or edit it. I don't want to have a whole bunch of options on this plugin, it is why I have resisted putting a menu onto it as there are too many permutations. I think of how I see people using it - they will give it a one click shot at trying to find an ISBN, and after that they will use a metadata download type lookup based on title/author matching. I really don't see them wasting a lot of time bothering making multiple attempts on the same book using different options? If it fails and they believe there "really must" be an ISBN in there, they will view the book and type it in if it means that much to them (which they will have to do for any graphical based PDFs anyways). However that is just my opinion on how I see people using it. If it handles 98% of the book ISBNs out there that is still an improvement without it. |
03-31-2011, 01:39 PM | #37 | |||
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
I had a pdf of 700 pages. 163 MB Took me more than half an hour to know your (also with my regex) tagger could not find an isbn Quote:
Quote:
This is off-course a ocr error and I can understand you do not want to invest in bad ocr. Because I've seen it often in books with isbn on the front cover, I myself should add the newline option. To test isbn numbers and try to recover a good isbn outof iop830l|Ix would be something else. On the other hand, If I do not add the \s in the regex, I can not retrieve isbn numbers with the last number right before a linefeed. @your opinion about 98% and a lot of sub-options: agree |
|||
03-31-2011, 01:42 PM | #38 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
Quote:
I can confirm I also have never seen it mixed. You have not seen dots between them? |
|
03-31-2011, 03:08 PM | #39 |
Well trained by Cats
Posts: 30,579
Karma: 58055868
Join Date: Aug 2009
Location: The Central Coast of California
Device: Kobo Libra2,Kobo Aura2v1, K4NT(Fixed: New Bat.), Galaxy Tab A
|
You really expect me to remember a possible 1 or 2 out of 900+
All I remember, was my attempt to Validate ISBN's in my collection Database (Paradox DOS), returned inconsistent results (90% passed) All checking the (failed) entry against the book printing was it was not a 'fat fingering' problem I vaguely remember some non-US published books having more than 1 ('Country specific') ISBN on the copyright page. |
03-31-2011, 07:37 PM | #40 | |
Addict
Posts: 293
Karma: 21022
Join Date: Mar 2011
Location: NL
Device: Sony PRS-650
|
You don't?
not a real e-book reader than Quote:
spaces and --- are sure. Every new added character will slow down the process a bit (noticeable on large number of pages to be scanned). But I think for speeding up the process we will have to wait for the mentioned replacement of pdftohtml |
|
04-02-2011, 10:36 AM | #41 |
Connoisseur
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
|
Hi,
normally the extraction runs fine, but if I try to scan many ebooks at once with the auto feature, the plugin hangs and the only way to go on is to kill Calibre (don't know at which number of books). First I thought, it is ok, but then I saw that always, when this happens, the plugin doesn't go on, it stops at the first ebook. Scan this book only or just a few, no problem. I can scan about 300 books at once without problems. |
04-02-2011, 10:45 AM | #42 |
Calibre Plugins Developer
Posts: 4,694
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
@Loeffel - I would be suprised if it is "hanging", I think it more likely you are hitting some large PDFs that it is struggling with time to analyse. If you run the plugin in debug mode (Ctrl+Shift+R) you should see it continuing to display output as the input converters do their thing.
|
04-02-2011, 12:35 PM | #43 |
Connoisseur
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
|
I have no real big ebooks, but some in different formats perhaps that's the problem. Is there any way to tell the plugin just to search the first format found?
|
04-02-2011, 12:58 PM | #44 |
Calibre Plugins Developer
Posts: 4,694
Karma: 2162246
Join Date: Oct 2010
Location: Australia
Device: Kindle Oasis
|
Go to the customisation for the plugin, and you can set its behaviour. However I think by default the alternate search is set to only check the first format in preferred input order.
It doesn't necessarily have to be a massive PDF, but just PDFs in general will slow it down, by how much depends on the content I think moreso than size. If it has lots of graphics I think that makes it grind rather slowly. There's a few posts in this thread about it if you read back. Now 0.7.53 is out I can start experimenting more seriously with the "first 10 pages/last 5 pages" approach to scanning which hopefully will improve things. |
04-02-2011, 09:39 PM | #45 |
Connoisseur
Posts: 58
Karma: 10
Join Date: Mar 2011
Device: Kindle 3 3G
|
I found it. I will let it run while I'm sleeping. I will see what happened when I come back. If it just looks like or if it really f...s up.
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Extract ISBN from PDF? | mdroberts | Calibre | 14 | 12-16-2016 08:32 AM |
[Old Thread] Extract ISBN from file name | ChristianQ | Calibre | 59 | 12-09-2015 06:08 AM |
[GUI Plugin] Plugin Updater **Deprecated** | kiwidude | Plugins | 159 | 06-19-2011 01:27 PM |
[Old Thread] Auto Extract ISBN-Feature request | UnraisedArc | Calibre | 60 | 03-23-2011 10:31 AM |
Displaying ISBN column in the main GUI | tilleydog | Library Management | 26 | 02-25-2011 05:08 AM |