10-24-2023, 10:26 PM | #31 |
Connoisseur
Posts: 72
Karma: 10
Join Date: Dec 2010
Device: Kindle Oasis
|
But with the lead story entirely missing…
|
10-24-2023, 10:28 PM | #32 |
Connoisseur
Posts: 72
Karma: 10
Join Date: Dec 2010
Device: Kindle Oasis
|
Sorry. Some stories come through correctly others return an empty item entitled “Too many requests “
|
11-20-2023, 06:27 AM | #33 |
creator of calibre
Posts: 44,569
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
And just for posterity, this is how far I got with reversing the JS. I can extract the encrypted key and iv and encrypted data, the problem is in get_decryption_key() for some reason the wsj server isnt returning the decrupted key. The same request in a browser works, so I am guessing htere is some cookie missing or the server does some tls sniffing.
Code:
from html5_parser import parse import json from calibre import browser from mechanize import Request from urllib.parse import urlparse def extract_json_data(raw_html): from pprint import pprint pprint root = parse(raw_html) d = json.loads(root.xpath('//script[@id="__NEXT_DATA__"]')[0].text) page_props = d['props']['pageProps'] ed = page_props['encryptedDataHash'] encrypted_data = ed['content'] iv = ed['iv'] encrypted_key = page_props['encryptedDocumentKey'] url = root.xpath('//link[@rel="canonical"]')[0].get('href') return {'url': url, 'encrypted_data': encrypted_data, 'iv': iv, 'encrypted_key': encrypted_key} def get_browser_for_wsj(*a, **kw): br = browser() br.set_cookie('wsjregion', 'na,us', '.wsj.com') br.set_cookie('gdprApplies', 'false', '.wsj.com') br.set_cookie('ccpaApplies', 'false', '.wsj.com') br.set_cookie('vcdpaApplies', 'false', '.wsj.com') br.set_cookie('regulationApplies', 'gdpr%3Afalse%2Ccpra%3Afalse%2Cvcdpa%3Afalse', '.wsj.com') br.set_handle_gzip(True) br.addheaders += [ ('Accept', '*/*'), ('Accept-Language', 'en-GB,en-US;q=0.9,en;q=0.8'), ] return br def get_decryption_key(br, data, referer): from pprint import pprint pprint purl = urlparse(referer) rq = Request('https://www.wsj.com/client', headers={ 'Cache-Control': 'max-age=0', 'Referer': referer, 'X-Encrypted-Document-Key': data['encrypted_key'], 'X-Original-Host': 'www.wsj.com', 'X-Original-Referrer': '', 'X-Original-Url': purl.path, }) br.set_debug_http(True) try: res = br.open(rq) except Exception as err: if hasattr(err, 'read'): raise Exception('decryption key request failed with error: {} and body: {}'.format(err, err.read().decode('utf-8', 'replace'))) raise if res.code != 200: raise ValueError(f'decryption key request returned non OK HTTP result code: {res.code}') r = json.loads(res.read()) key = r['documentKey'] if not key: pprint(r) raise ValueError('No document key returned') def get_wsj_article(url='https://www.wsj.com/world/middle-east/u-n-world-leaders-push-to-get-gaza-aid-flowing-after-biden-pledge-3b59283b'): br = get_browser_for_wsj() res = br.open(url) raw_html = res.read() data = extract_json_data(raw_html) get_decryption_key(br, data, res.geturl()) if __name__ == '__main__': get_wsj_article() |
11-20-2023, 02:47 PM | #34 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
in get_decryption_key (line 43) try 'Referer': 'https://www.drudgereport.com/'
Was able to get documentKey successfully Last edited by unkn0wn; 11-20-2023 at 02:52 PM. |
11-21-2023, 01:24 AM | #35 |
creator of calibre
Posts: 44,569
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Yeah, that works but now the issue is how to decrypt using the key and iv, the obvious candidate, AES-CTR doesnt seem to work
Code:
import base64 import json from html5_parser import parse from mechanize import Request from urllib.parse import urlparse from calibre import browser def extract_json_data(raw_html): from pprint import pprint pprint root = parse(raw_html) d = json.loads(root.xpath('//script[@id="__NEXT_DATA__"]')[0].text) page_props = d['props']['pageProps'] ed = page_props['encryptedDataHash'] encrypted_data = base64.standard_b64decode(ed['content']) iv = base64.standard_b64decode(ed['iv']) encrypted_key = page_props['encryptedDocumentKey'] url = root.xpath('//link[@rel="canonical"]')[0].get('href') return {'url': url, 'encrypted_data': encrypted_data, 'iv': iv, 'encrypted_key': encrypted_key} def get_browser_for_wsj(*a, **kw): br = browser() br.set_cookie('wsjregion', 'na,us', '.wsj.com') br.set_cookie('gdprApplies', 'false', '.wsj.com') br.set_cookie('ccpaApplies', 'false', '.wsj.com') br.set_cookie('vcdpaApplies', 'false', '.wsj.com') br.set_cookie('regulationApplies', 'gdpr%3Afalse%2Ccpra%3Afalse%2Cvcdpa%3Afalse', '.wsj.com') br.set_handle_gzip(True) br.addheaders += [ ('Accept', '*/*'), ('Accept-Language', 'en-GB,en-US;q=0.9,en;q=0.8'), ] return br def get_decryption_key(br, data, referer='https://www.drudgereport.com/'): from pprint import pprint pprint purl = urlparse(referer) rq = Request('https://www.wsj.com/client', headers={ 'Cache-Control': 'max-age=0', 'Referer': referer, 'X-Encrypted-Document-Key': data['encrypted_key'], 'X-Original-Host': 'www.wsj.com', 'X-Original-Referrer': '', 'X-Original-Url': purl.path, }) br.set_debug_http(True) try: res = br.open(rq) except Exception as err: if hasattr(err, 'read'): raise Exception('decryption key request failed with error: {} and body: {}'.format(err, err.read().decode('utf-8', 'replace'))) raise if res.code != 200: raise ValueError(f'decryption key request returned non OK HTTP result code: {res.code}') r = json.loads(res.read()) key = r['documentKey'] if not key: pprint(r) raise ValueError('No document key returned') return base64.standard_b64decode(key) def decrypt_article(data): from Crypto.Cipher import AES from Crypto.Util import Counter ciphertext = data['encrypted_data'] # ciphertext += b'\0' * (16 - len(ciphertext) % 16) print(11111111, len(ciphertext), len(data['iv']), int.from_bytes(data['iv'])) counter = Counter.new(nbits=128, initial_value=int.from_bytes(data['iv'])) cipher = AES.new(data['key'], AES.MODE_CTR, counter=counter) return cipher.decrypt(ciphertext) def get_wsj_article(url='https://www.wsj.com/world/middle-east/u-n-world-leaders-push-to-get-gaza-aid-flowing-after-biden-pledge-3b59283b'): br = get_browser_for_wsj() res = br.open(url) raw_html = res.read() data = extract_json_data(raw_html) data['key'] = get_decryption_key(br, data) return decrypt_article(data) if __name__ == '__main__': data = get_wsj_article() print(data) print( b'content' in data) |
11-21-2023, 04:32 AM | #36 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
In get_decryption_key the 'X-Encrypted-Document-Key' : data[' encrypted_key'] should not be base 64 decoded. We will not get the documentKey.
I tried but the decoded output is unreadable, maybe cause it's still 64decoded. At this point I'm mostly unaware of how decryption works here. Last edited by unkn0wn; 11-21-2023 at 04:55 AM. |
11-21-2023, 06:50 AM | #37 |
creator of calibre
Posts: 44,569
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
yeah it's not base64 decoded, as it has to be sent in a header. As I said, it remains to figure out what decryption algorithm is used, either by stepping through the JS in a debugger or reversing it. From a quick read of the JS it looks like some variant of AES with a 16 byte "iv". I tried a few of the more obvious ones like AES-256-CTR but no luck.
|
11-24-2023, 04:56 AM | #38 |
Fanatic
Posts: 543
Karma: 82944
Join Date: May 2021
Device: kindle
|
if anyones interested, try to figure out the decryption method used here
Code:
{ 'url': 'https://www.wsj.com/tech/ai/openai-leadership-hangs-in-balance-as-sam-altmans-counte-rebellion-gains-steam-47276fa8', 'encrypted_data': '', 'iv': 'J0mre5ohZnHgK/RgHOTYhQ==', 'encrypted_key': 'TY4XXz7TLdVFkd7pXhRZfqaRLYYdtpyCFrKnKe9EXfvaCfMOPo2dP/kC6TBmdCL7/IT7leMxY05OBv9gQkGVZgqCcI7lTLscMfvhhnmCjieb/NH3qbOkwwD+c0QXYosmf2aKYhUafSozz8ngBg6Q385j9pS36+sEfW6X3vFc/X+khJ7tChceWPIcM1JU8zs99bMomN451Vbhz6vUc+1W0bCk6hJ4yX1WGRlWRbM1vd88pEBmterZN+icij1+2g==', 'key': 'bnQbZ9urHPWcAMC/RmPO/JyAXfKAGC6Jl7oqEjc1O+k=' } key is documentKey try to decrypt the encrypted_data to readable text/html. |
Tags |
calibre, wsj, wsj.com |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Saving articles from news downloads | sparks56 | Calibre | 1 | 02-17-2017 10:46 PM |
How to insert links between articles? | oecherprinte | Recipes | 3 | 11-27-2013 05:37 AM |
ReadItLater recipe only downloads 10 saved articles? | usuario74 | Recipes | 1 | 02-20-2011 05:24 PM |
calibre only downloads some articles from FT | St28 | Recipes | 0 | 01-21-2011 10:25 AM |
Sharing/saving articles in news downloads for Kindle | f1nkster | Calibre | 4 | 07-28-2010 02:53 PM |