Post your Useful Plugin Code Fragments Here

KevinH · 12-14-2015, 01:49 PM

Please reserve this thread for plugin developers and others to share their code fragments useful for Sigil plugins. Any questions about them should be directed to the Plugin Development "sticky" thread.

Thanks!

KevinH

KevinH · 12-15-2015, 09:54 AM

Code:

    # Example of using the provided stream based QuickParser
    # to parse metadataxml (to look for cover id)
    # Also rebuilds the metadata xml in res
    ps = bk.qp
    ps.setContent(bk.getmetadataxml())
    res = []
    coverid = None
    # parse the metadataxml, store away cover_id and rebuild it
    for text, tagprefix, tagname, tagtype, tagattr in ps.parse_iter():
        if text is not None:
            # print(text)
            res.append(text)
        else:
            # print(tagprefix, tagname, tagtype, tagattr)
            if tagname == "meta" and tagattr.get("name",'') == "cover":
                coverid = tagattr["content"]
            res.append(ps.tag_info_to_xml(tagname, tagtype, tagattr))
    original_metadata = "".join(res)

rubeus · 12-15-2015, 01:21 PM

You need:

Python Interpreter > 3 and PIL library installed

or

the internal builtin Python Interpreter from 0.9.0 and up.

Code:

from PIL import Image
from io import BytesIO

Code:

    for (id, href, mime) in bk.image_iter():
        im = Image.open(BytesIO(bk.readfile(id)))
        (width, height) = im.size
        print ('id={} href={} mime={} width={} height={}'.format(id, href, mime, width,height))

DiapDealer · 01-02-2016, 01:51 PM

Creating self-deleting temp folders with python's contextmanager:

Code:

from contextlib import contextmanager

@contextmanager
def make_temp_directory():
    import tempfile
    import shutil
    temp_dir = tempfile.mkdtemp()
    yield temp_dir
    shutil.rmtree(temp_dir)

Then in your plugin, you can simply do something like:

Code:

with make_temp_directory() as temp_dir:
    do
    stuff
    with
    things
    in
    the
    temp_dir

It's not perfect, but barring any untrapped errors (or platform-specific permission problems), "temp_dir" will delete itself after completion of the with statement.

slowsmile · 12-17-2016, 05:12 AM

Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:

Code:

try:
    import os.path

    from sigil_bs4 import BeautifulSoup
except:
    from bs4 import BeautifulSoup


def fixHTML(work_dir, file)

    output = os.path.join(work_dir, 'clean_html.htm')
    outfp = open(output, 'wt', encoding=('utf-8'))
    html = open(file, 'rt', encoding='utf-8').read()
    
    soup = BeautifulSoup(html, 'html.parser')
    
    # remove all unwanted proprietary attributes from the html file   
    search_tags = ['p', 'span', 'div', 'body', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'br']  
    search_attribs =  ['dir', 'name', 'title', 'link', 'id' ,'text', 'lang', 'clear']  
    for tag in soup.findAll(search_tags):
        for attribute in search_attribs:
            del tag[attribute] 

    outfp.writelines(str(soup))
    outfp.close()
    
    os.remove(file)
    os.rename(output, file)
    return(file)

DiapDealer · 12-17-2016, 08:29 AM

Quote:

Originally Posted by slowsmile

Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html fille.

Nice example of deleting attributes from tags with bs4, but why would "id" or "lang" attributes be considered garbage (or proprietary)? Removing "id", for instance, could break a whole bunch of links in files (html toc and ncx included). Seems a very odd attribute to want to nuke ("name" should probably be converted to "id" to prevent any possible link breakage, as well).

slowsmile · 12-17-2016, 06:06 PM

The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub. And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.

I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.

DiapDealer · 12-17-2016, 07:18 PM

Quote:

Originally Posted by slowsmile

The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub.

No problem. As I said, it's a very useful snippet for deleting attributes with bs4, I was just nervous about folks associating the "id" parameter as garbage or proprietary.

Quote:

Originally Posted by slowsmile

And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.

Multi-language epubs (or epubs that just display other languages) can make use of it extensively. It's why Sigil's spellchecking is being enhanced to parse the lang attribute in the html. You might not ever encounter it, but it's not really that rare.

Quote:

Originally Posted by slowsmile

I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.

It is deprecated, but it will often still "work." That's why converting "names" to "id" can be beneficial when working with cluttered/proprietary/old html.

slowsmile · 12-17-2016, 09:16 PM

@DiapDealer...Thanks for the info. I was unaware that 'lang' was used that much in epubs so I guess I've learned something. I know that the html text is in utf-8 whereas I think the tag text is more or less ascii. So I'm slightly surprised that you need the 'lang' attribute everywhere in the html because I thought that utf-8 could be defined regionally for different languages within the epub html with the help of python. I guess that utf-8 isn't used like that when you use python in an html app.

Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5. I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.

Doitsu · 12-18-2016, 04:21 AM

Quote:

Originally Posted by slowsmile

Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:

BTW, bs4 returns the attributes as an attrs dictionary and if you're absolutely sure that you don't need any of them you could delete them all at once by assigning an empty dictionary to attrs.

Here's a minimalist proof-of-concept example:

Spoiler:

Quote:

Originally Posted by slowsmile

So I'm slightly surprised that you need the 'lang' attribute everywhere in the html [...]

You don't need to use lang attributes, unless you create a multilingual epub book, however, if you do use it, the IDPF recommends using both lang and xml:lang attributes.

Quote:

Originally Posted by slowsmile

Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5.

The epub 2.0.1. standard is based on XHTML 1.1 and XHTML 1.1 no longer allows the use of name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile

I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while.

Just because MS Word doesn't generate XHTML 1.1 compliant output doesn't mean it's OK to use it as is, even though many epub apps can handle name attributes as fragment identifiers.

Quote:

Originally Posted by slowsmile

Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it.

Amazon indeed supports the upload of ebooks with MS Word generated html files, however, IMHO, that doesn't mean that they officially condone the use of the name attribute. IIRC, the Kindle Publishing Guidelines recommend using only well-formed (X)HTML files.
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files.

DiapDealer · 12-18-2016, 05:27 AM

For the record; I wasn't supporting the use of "name" in epubs, I was suggesting that when working with alternative content that is going to be massaged into an epub, it's better to convert any "name" attributes to "id", rather than just delete them. Parsing the content for hrefs that contain the "name" attributes as fragments should be trivial enough to determne which ones can be safely deleted.

slowsmile · 12-18-2016, 05:32 AM

@Doitsu...Interesting what you say about Kindle. Their's is a proprietary format that is closely related to epub with some peculiar quirks. Similar to iBooks proprietary version of epub. You can do that if you are a mammoth company like those two.

Here's another piece of BS code for html that I've found very useful:

Spoiler:

In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions.

I used this code to remove all SIZE = 3 attributes from the html because it was causing problems. Notice that OO uses an integer while LO uses a string numeric for the size value.

Spoiler:

Both Tidy and BS have saved my bacon on many occasions. They are both remarkably useful and easy to use for processing html.

Doitsu · 12-18-2016, 06:03 AM

Quote:

Originally Posted by slowsmile

In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions.

Before you re-invent the wheel, you might want to have a look at Writer2xhtml/Writer2LaTeX and my ODT import plugin.

slowsmile · 12-18-2016, 08:50 AM

I don't think that I'm re-inventing the wheel really. And even though my plugin converter will give a full conversion(upload ready) I do not regard that as its main purpose. It's just a plugin that will save you alot of time in your conversion workflow by automatically doing all the drudge jobs like re-styling your new epub from scratch, adding metadata, adding images, creating a stylesheet etc. The plugin's main purpose is to quickly bring the plugin user to a point where he or she can just concentrate on finishing-off tasks in Sigil like final epub re-styling, embedding fonts, adding extra images, fixed layout tasks etc.

I'm also guessing that people will probably criticize the plugin and perhaps say, "Why bother when their are already good converters like Calibre, Scrivener, Jutoh etc ?" The main difference between those converters and my plugin converter is that those converters have editors, toc editors, complex settings, stylers, menus, sub-menus and pre-compiler options etc. They are complex apps that take some time to learn. The only editor my plugin app uses to style epubs is LibreOffice or OpenOffice because the plugin ports all styles -- default styles, heading styles, font styles and named styles to the epub stylesheet. It can do this because it also ports all in-tag styling to the CSS as well. So with my plugin all you have to do is style your ebook in LO or OO as you like and then, after filling in the metadata in the dialog window, just push the OK button and your html doc will convert to epub -- whose layout and styling should exactly mimic the layout and styling of the ODT version. The plugin also has a very simple interface which anyone can learn to use quickly.

DiapDealer · 12-18-2016, 09:50 AM

On a side-note: I sent you an email about testing your plugin on other platforms, @slowsmile. Did you recieve it?

12-14-2015, 01:49 PM	#1
KevinH Sigil Developer Posts: 8,732 Karma: 5703586 Join Date: Nov 2009 Device: many	Post your Useful Plugin Code Fragments Here Please reserve this thread for plugin developers and others to share their code fragments useful for Sigil plugins. Any questions about them should be directed to the Plugin Development "sticky" thread. Thanks! KevinH

12-15-2015, 01:21 PM	#3
rubeus Banned Posts: 272 Karma: 1224588 Join Date: Sep 2014 Device: Sony PRS 650	How to get width and height from an image? You need: Python Interpreter > 3 and PIL library installed or the internal builtin Python Interpreter from 0.9.0 and up. Code: from PIL import Image from io import BytesIO Code: for (id, href, mime) in bk.image_iter(): im = Image.open(BytesIO(bk.readfile(id))) (width, height) = im.size print ('id={} href={} mime={} width={} height={}'.format(id, href, mime, width,height))

01-02-2016, 01:51 PM	#4
DiapDealer Grand Sorcerer Posts: 28,542 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	Creating self-deleting temp folders with python's contextmanager: Code: from contextlib import contextmanager @contextmanager def make_temp_directory(): import tempfile import shutil temp_dir = tempfile.mkdtemp() yield temp_dir shutil.rmtree(temp_dir) Then in your plugin, you can simply do something like: Code: with make_temp_directory() as temp_dir: do stuff with things in the temp_dir It's not perfect, but barring any untrapped errors (or platform-specific permission problems), "temp_dir" will delete itself after completion of the with statement.

12-17-2016, 09:16 PM	#9
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@DiapDealer...Thanks for the info. I was unaware that 'lang' was used that much in epubs so I guess I've learned something. I know that the html text is in utf-8 whereas I think the tag text is more or less ascii. So I'm slightly surprised that you need the 'lang' attribute everywhere in the html because I thought that utf-8 could be defined regionally for different languages within the epub html with the help of python. I guess that utf-8 isn't used like that when you use python in an html app. Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5. I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while. Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it. Last edited by slowsmile; 12-17-2016 at 09:26 PM.

12-18-2016, 05:32 AM	#12
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	@Doitsu...Interesting what you say about Kindle. Their's is a proprietary format that is closely related to epub with some peculiar quirks. Similar to iBooks proprietary version of epub. You can do that if you are a mammoth company like those two. Here's another piece of BS code for html that I've found very useful: Spoiler: Code: # remove all anchors but preserve # all anchors with internet links for m in soup.findAll('a'): if 'href="http:' in str(m) or \ 'href="https:' in str(m) or \ 'mailto:' in str(m) or \ '@' in str(m): pass else: m.replaceWithChildren() In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions. I used this code to remove all SIZE = 3 attributes from the html because it was causing problems. Notice that OO uses an integer while LO uses a string numeric for the size value. Spoiler: Code: # remove all 'size = 3' font declarations from OO or LO html for x in soup.findAll('font'): if x.has_attr('size'): if x['size'] == "3" or x['size'] == 3: x.replaceWithChildren() Both Tidy and BS have saved my bacon on many occasions. They are both remarkably useful and easy to use for processing html. Last edited by slowsmile; 12-18-2016 at 05:39 AM.

12-17-2016, 06:06 PM	#7
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub. And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code. I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now.

12-18-2016, 05:27 AM	#11
DiapDealer Grand Sorcerer Posts: 28,542 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	For the record; I wasn't supporting the use of "name" in epubs, I was suggesting that when working with alternative content that is going to be massaged into an epub, it's better to convert any "name" attributes to "id", rather than just delete them. Parsing the content for hrefs that contain the "name" attributes as fragments should be trivial enough to determne which ones can be safely deleted.

12-18-2016, 08:50 AM	#14
slowsmile Witchman Posts: 628 Karma: 788808 Join Date: May 2013 Location: Philippines Device: Android S5	I don't think that I'm re-inventing the wheel really. And even though my plugin converter will give a full conversion(upload ready) I do not regard that as its main purpose. It's just a plugin that will save you alot of time in your conversion workflow by automatically doing all the drudge jobs like re-styling your new epub from scratch, adding metadata, adding images, creating a stylesheet etc. The plugin's main purpose is to quickly bring the plugin user to a point where he or she can just concentrate on finishing-off tasks in Sigil like final epub re-styling, embedding fonts, adding extra images, fixed layout tasks etc. I'm also guessing that people will probably criticize the plugin and perhaps say, "Why bother when their are already good converters like Calibre, Scrivener, Jutoh etc ?" The main difference between those converters and my plugin converter is that those converters have editors, toc editors, complex settings, stylers, menus, sub-menus and pre-compiler options etc. They are complex apps that take some time to learn. The only editor my plugin app uses to style epubs is LibreOffice or OpenOffice because the plugin ports all styles -- default styles, heading styles, font styles and named styles to the epub stylesheet. It can do this because it also ports all in-tag styling to the CSS as well. So with my plugin all you have to do is style your ebook in LO or OO as you like and then, after filling in the metadata in the dialog window, just push the OK button and your html doc will convert to epub -- whose layout and styling should exactly mimic the layout and styling of the ODT version. The plugin also has a very simple interface which anyone can learn to use quickly. Last edited by slowsmile; 12-18-2016 at 09:23 AM.

12-18-2016, 09:50 AM	#15
DiapDealer Grand Sorcerer Posts: 28,542 Karma: 204127028 Join Date: Jan 2010 Device: Nexus 7, Kindle Fire HD	On a side-note: I sent you an email about testing your plugin on other platforms, @slowsmile. Did you recieve it?

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
How to get the selected category in the code of the gui plugin?	esvorontsov	Development	1	10-03-2015 12:52 AM
How to get the uuid of the book in the code of the gui plugin?	esvorontsov	Development	3	09-29-2015 11:15 AM
Fragment identifiers	frisket	ePub	19	04-02-2014 02:44 PM
Using image in plugin code	Jellby	Development	7	03-11-2014 10:56 PM
FRAGMENT ERROR MESSAGE	dgbeig	ePub	5	11-23-2013 07:21 PM

Advert

Advert