![]() |
#46 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#47 | |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 11,470
Karma: 13095790
Join Date: Aug 2007
Location: Grass Valley, CA
Device: EB 1150, EZ Reader, Literati, iPad 2 & Air 2, iPhone 7
|
Quote:
Dale |
|
![]() |
![]() |
Advert | |
|
![]() |
#48 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Code:
pydoc str |
![]() |
![]() |
![]() |
#49 |
Member
![]() Posts: 10
Karma: 10
Join Date: Jun 2007
Location: Slovakia
Device: HTC Touch Diamond, Sony Reader 505
|
If I understand it correctly, rpartition divides a string into a 3-member array. This doesn't really help me that much, as I don't "speak" python and it's different from the languages that I know. So... if I could ask some python-knowledgable person to give me the exact command for the string conversion... I assume it would cost you about 5 secs of your life
![]() Thank you in advance... in return I offer (rusty) pascal & vbscript support ![]() i need http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html to become http://www.sme.sk/clanok_tlac.asp?cl=3592953 replace('/c/', '/clanok_tlac.asp?cl=') is step one... but after that i'm stuck |
![]() |
![]() |
![]() |
#50 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Ah well here you go
Code:
url = 'http://www.sme.sk/c/3592953/Ceskoslovenska-esej.html'.rpartition('/')[0].replace('c/', 'clanok_tlac.asp?cl=') |
![]() |
![]() |
Advert | |
|
![]() |
#51 |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Text links being dropped
Kovid,
I noticed that web2lrf ignores/deletes words entirely that have underlying links. This makes some articles a little hard to understand since key words are sometimes left out. As an example, in the following article the names "David Beckham," "Adidas," and "Pepsi" are all deleted/ignored when it is converted to an lrf. http://www.nytimes.com/2007/11/17/bu...gewanted=print I noticed the same thing happens when downloading the html file and running it through html2lrf. I've attached the lrf I generated as an example. Is there something about linked text that makes it difficult to parse? Or is this simply a bug that needs to be eliminated? Thanks a lot for your help. BTW, still trying to get some profiles made. Not knowing Python is proving to be a rather large stumbling block, however. Last edited by JTravers; 11-21-2007 at 04:55 AM. Reason: added lrf attachment |
![]() |
![]() |
![]() |
#52 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That's a bug, actually a regression I introduced a few versions back. It will be fixed in the next release.
|
![]() |
![]() |
![]() |
#53 | |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Quote:
http://docs.python.org/tut/tut.html |
|
![]() |
![]() |
![]() |
#54 | |
Groupie
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 182
Karma: 1078201
Join Date: Sep 2007
Device: iPad Air 2
|
Quote:
![]() I'm really looking forward to getting some more interesting web content onto my 505. BTW, does web2lrf only accept RSS feeds as input, or can one give it a regular webpage to process? |
|
![]() |
![]() |
![]() |
#55 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
|
![]() |
![]() |
![]() |
#56 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Can you stop the processing after the html has been cleaned up but before the html file tree is removed? (Or how do you get web2html?)
|
![]() |
![]() |
![]() |
#57 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
web2disk
|
![]() |
![]() |
![]() |
#58 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
Does web2disk really do the cleanup ot the html code? If I only want the files I suppose wget will work also. Or do web2disk do something that wget does not do?
|
![]() |
![]() |
![]() |
#59 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,151
Karma: 22670164
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
It's optimized for downloading websites for conversion to ebooks. Has link filters and recursion level control and a bunch of other features
Code:
web2disk --help |
![]() |
![]() |
![]() |
#60 |
Grand Sorcerer
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 7,452
Karma: 7185064
Join Date: Oct 2007
Location: Linköpng, Sweden
Device: Kindle Voyage, Nexus 5, Kindle PW
|
But if you run web2lrf it seems like the cleanup is done just before the conversion to another format. With --debug it says:
[INFO] convert_from.py:330: Processing 7108374.stm [INFO] convert_from.py:283: Parsing HTML... [INFO] convert_from.py:318: Written preprocessed HTML to /tmp/html2lrf-verbose.html [INFO] convert_from.py:333: Converting to BBeB... But since "web2disk bbc" is not implemented I have not been able to get the result after the preprocessing so I have not been able to check how it looks. |
![]() |
![]() |
![]() |
Tags |
libprs500, web2lrf |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
web2lrf to capture blog archive? | Deputy-Dawg | Sony Reader Dev Corner | 1 | 02-14-2008 11:41 PM |
web2lrf: La Repubblica | alexxxm | Sony Reader | 1 | 11-13-2007 12:27 PM |