![]() |
#1 |
Enthusiast
![]() Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
|
Understanding html input plugin
Can someone point me to the documentation or source for the html input plugin? I need to understand better what it is doing. Sorry for the stupid questions, but I am learning as I go. Deeply appreciative that Calibre exists and of what it does. Fred
|
![]() |
![]() |
![]() |
#2 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,859
Karma: 26594666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
Look for plugins/html_input.py in the source code.
|
![]() |
![]() |
Advert | |
|
![]() |
#3 |
Enthusiast
![]() Posts: 28
Karma: 10
Join Date: May 2010
Device: Kindle
|
So it looks to me that the function doing all the file rewriting is rewrite_links which is defined in base.py and removes all absolute links. Then it looks for a CSS file and cssutils parses that. Is this broadly correct?
Code:
def rewrite_links(root, link_repl_func, resolve_base_href=False): ''' Rewrite all the links in the document. For each link ``link_repl_func(link)`` will be called, and the return value will replace the old link. Note that links may not be absolute (unless you first called ``make_links_absolute()``), and may be internal (e.g., ``'#anchor'``). They can also be values like ``'mailto:email'`` or ``'javascript:expr'``. If the ``link_repl_func`` returns None, the attribute or tag text will be removed completely. ''' from cssutils import parseString, parseStyle, replaceUrls, log log.setLevel(logging.WARN) if resolve_base_href: resolve_base_href(root) for el, attrib, link, pos in iterlinks(root, find_links_in_css=False): new_link = link_repl_func(link.strip()) if new_link == link: continue if new_link is None: # Remove the attribute or element content if attrib is None: el.text = '' else: del el.attrib[attrib] continue if attrib is None: new = el.text[:pos] + new_link + el.text[pos+len(link):] el.text = new else: cur = el.attrib[attrib] if not pos and len(cur) == len(link): # Most common case el.attrib[attrib] = new_link else: new = cur[:pos] + new_link + cur[pos+len(link):] el.attrib[attrib] = new def set_property(v): if v.CSS_PRIMITIVE_VALUE == v.cssValueType and \ v.CSS_URI == v.primitiveType: v.setStringValue(v.CSS_URI, link_repl_func(v.getStringValue())) for el in root.iter(): try: tag = el.tag except UnicodeDecodeError: continue if tag == XHTML('style') and el.text and \ (_css_url_re.search(el.text) is not None or '@import' in el.text): stylesheet = parseString(el.text) replaceUrls(stylesheet, link_repl_func) repl = stylesheet.cssText if isbytestring(repl): repl = repl.decode('utf-8') el.text = '\n'+ repl + '\n' if 'style' in el.attrib: text = el.attrib['style'] if _css_url_re.search(text) is not None: try: stext = parseStyle(text) except: # Parsing errors are raised by cssutils continue for p in stext.getProperties(all=True): v = p.cssValue if v.CSS_VALUE_LIST == v.cssValueType: for item in v: set_property(item) elif v.CSS_PRIMITIVE_VALUE == v.cssValueType: set_property(v) repl = stext.cssText.replace('\n', ' ').replace('\r', ' ') if isbytestring(repl): repl = repl.decode('utf-8') el.attrib['style'] = repl |
![]() |
![]() |
![]() |
#4 |
creator of calibre
![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() Posts: 44,859
Karma: 26594666
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
|
That is the resolving of linked resources, yes. You are probably more interested in HTML parsing. Look at parse_utils.py and preprocess.py for that.
|
![]() |
![]() |
![]() |
|
![]() |
||||
Thread | Thread Starter | Forum | Replies | Last Post |
Plugin not customizable: Plugin: HTML Output does not need customization | flyingfoxlee | Conversion | 2 | 02-24-2012 03:24 AM |
telling the input plugin to allow a rel=nofollow | nimblebooks | Conversion | 0 | 02-22-2012 06:01 PM |
HTML input plugin stripping text within toc tags in child html file | nimblebooks | Conversion | 3 | 02-21-2012 04:24 PM |
Plugin which uses net as input and output | medve | Development | 0 | 12-04-2011 04:20 PM |
Looking For MHT Input Conversion Plugin | FlooseMan Dave | Plugins | 4 | 03-30-2010 06:52 PM |