MobileRead Forums - View Single Post

trying · 12-12-2015, 09:08 PM

As noted by Krazykiwi, you can get 403 errors if you try to download "bare" goodreads book urls too many times. To investigate this issue further, I looked into the details of how the Goodreads Metadata Source Plugin works.

When you first try to download metadata for a book that doesn't have a "goodreads:" (or "isbn:") entry in the identifiers field, the plugin does a goodreads search and then parses the HTML response to get the first matching book's url. This url is not a bare url so it shouldn't trigger the 403 error.

The next time you try to download metadata for that book, it will now have a "goodreads:" identifier, won't do a search, and attempts to get metadata by just directly downloading
www.goodreads.com/book/show/{TheGoodreadsID} (see __init__.py lines 114-115).

I speculate that this problem is more noticeable because the Description/Comments/Summary metadata broke recently and a new plugin version was required. So more people have been re-downloading Goodreads metadata for books that already have a "goodreads:" identifier.

You can fix this problem by changing the identify() method in __init__.py, line 115, to automatically do what Krazykiwi was doing manually. Just add a trailing "-" to the url as in the following:

Code:

  if goodreads_id:
      matches.append('%s/book/show/%s-' % (Goodreads.BASE_URL, goodreads_id))

Optionally, to see when the plugin is actually redoing a search to get a book url, right after __init__.py, line 238:

Code:

   result_url = Goodreads.BASE_URL + first_result_url_node[0]

you can add:

Code:

   log.info('First search results book url: %s' % result_url)

I just got metadata for 290 books and it took 13 minutes, 23 seconds (2.76 seconds per book). 16 books failed to get metadata but they were all "No matches found with query" errors.

The plugin does not use the Goodreads API but is instead scraping the book's html page so it's not limited to 1 request per second. I'm not sure why it's so slow (a custom C# metadata downloader I wrote can grab 500+ books in a few minutes)? I didn't bother to figure this out though since it would probably unfairly load down the goodreads servers.

12-12-2015, 09:08 PM	#237
trying Member Posts: 21 Karma: 104 Join Date: Oct 2013 Device: none	As noted by Krazykiwi, you can get 403 errors if you try to download "bare" goodreads book urls too many times. To investigate this issue further, I looked into the details of how the Goodreads Metadata Source Plugin works. When you first try to download metadata for a book that doesn't have a "goodreads:" (or "isbn:") entry in the identifiers field, the plugin does a goodreads search and then parses the HTML response to get the first matching book's url. This url is not a bare url so it shouldn't trigger the 403 error. The next time you try to download metadata for that book, it will now have a "goodreads:" identifier, won't do a search, and attempts to get metadata by just directly downloading www.goodreads.com/book/show/{TheGoodreadsID} (see __init__.py lines 114-115). I speculate that this problem is more noticeable because the Description/Comments/Summary metadata broke recently and a new plugin version was required. So more people have been re-downloading Goodreads metadata for books that already have a "goodreads:" identifier. You can fix this problem by changing the identify() method in __init__.py, line 115, to automatically do what Krazykiwi was doing manually. Just add a trailing "-" to the url as in the following: Code: if goodreads_id: matches.append('%s/book/show/%s-' % (Goodreads.BASE_URL, goodreads_id)) Optionally, to see when the plugin is actually redoing a search to get a book url, right after __init__.py, line 238: Code: result_url = Goodreads.BASE_URL + first_result_url_node[0] you can add: Code: log.info('First search results book url: %s' % result_url) I just got metadata for 290 books and it took 13 minutes, 23 seconds (2.76 seconds per book). 16 books failed to get metadata but they were all "No matches found with query" errors. The plugin does not use the Goodreads API but is instead scraping the book's html page so it's not limited to 1 request per second. I'm not sure why it's so slow (a custom C# metadata downloader I wrote can grab 500+ books in a few minutes)? I didn't bother to figure this out though since it would probably unfairly load down the goodreads servers.