12-14-2015, 02:49 PM | #1 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Post your Useful Plugin Code Fragments Here
Please reserve this thread for plugin developers and others to share their code fragments useful for Sigil plugins. Any questions about them should be directed to the Plugin Development "sticky" thread.
Thanks! KevinH |
12-15-2015, 10:54 AM | #2 |
Sigil Developer
Posts: 8,160
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Using the built in Quick Parser to parse OPF Metadata
Code:
# Example of using the provided stream based QuickParser # to parse metadataxml (to look for cover id) # Also rebuilds the metadata xml in res ps = bk.qp ps.setContent(bk.getmetadataxml()) res = [] coverid = None # parse the metadataxml, store away cover_id and rebuild it for text, tagprefix, tagname, tagtype, tagattr in ps.parse_iter(): if text is not None: # print(text) res.append(text) else: # print(tagprefix, tagname, tagtype, tagattr) if tagname == "meta" and tagattr.get("name",'') == "cover": coverid = tagattr["content"] res.append(ps.tag_info_to_xml(tagname, tagtype, tagattr)) original_metadata = "".join(res) |
Advert | |
|
12-15-2015, 02:21 PM | #3 |
Banned
Posts: 272
Karma: 1224588
Join Date: Sep 2014
Device: Sony PRS 650
|
How to get width and height from an image?
You need:
Python Interpreter > 3 and PIL library installed or the internal builtin Python Interpreter from 0.9.0 and up. Code:
from PIL import Image from io import BytesIO Code:
for (id, href, mime) in bk.image_iter(): im = Image.open(BytesIO(bk.readfile(id))) (width, height) = im.size print ('id={} href={} mime={} width={} height={}'.format(id, href, mime, width,height)) |
01-02-2016, 02:51 PM | #4 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Creating self-deleting temp folders with python's contextmanager:
Code:
from contextlib import contextmanager @contextmanager def make_temp_directory(): import tempfile import shutil temp_dir = tempfile.mkdtemp() yield temp_dir shutil.rmtree(temp_dir) Code:
with make_temp_directory() as temp_dir: do stuff with things in the temp_dir |
12-17-2016, 06:12 AM | #5 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
Using BeautifulSoup, here's a quick way to remove all garbage proprietary data from an html file:
Code:
try: import os.path from sigil_bs4 import BeautifulSoup except: from bs4 import BeautifulSoup def fixHTML(work_dir, file) output = os.path.join(work_dir, 'clean_html.htm') outfp = open(output, 'wt', encoding=('utf-8')) html = open(file, 'rt', encoding='utf-8').read() soup = BeautifulSoup(html, 'html.parser') # remove all unwanted proprietary attributes from the html file search_tags = ['p', 'span', 'div', 'body', 'a', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'br'] search_attribs = ['dir', 'name', 'title', 'link', 'id' ,'text', 'lang', 'clear'] for tag in soup.findAll(search_tags): for attribute in search_attribs: del tag[attribute] outfp.writelines(str(soup)) outfp.close() os.remove(file) os.rename(output, file) return(file) |
Advert | |
|
12-17-2016, 09:29 AM | #6 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Nice example of deleting attributes from tags with bs4, but why would "id" or "lang" attributes be considered garbage (or proprietary)? Removing "id", for instance, could break a whole bunch of links in files (html toc and ncx included). Seems a very odd attribute to want to nuke ("name" should probably be converted to "id" to prevent any possible link breakage, as well).
Last edited by DiapDealer; 12-17-2016 at 09:34 AM. |
12-17-2016, 07:06 PM | #7 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
The 'lang' and 'id' attributes are garbage in what I'm doing at the moment. I'm currently writing a plugin to convert opendoc html to epub. This means that you have to initially remove all bookmarks and the TOC from the html as part of the html clean up process. My plugin app then regenerates a new TOC on conversion to epub. And apart from the lang declaration in the html header namespace, the lang attributes within the html code itself also seems to be completely superfluous. I've never seen 'lang' used in epubs within the html code.
I've also read that the 'name' attribute is now also deprecated, which is why 'id' should always be used in epubs now. |
12-17-2016, 08:18 PM | #8 | ||
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
Quote:
Quote:
It is deprecated, but it will often still "work." That's why converting "names" to "id" can be beneficial when working with cluttered/proprietary/old html. Last edited by DiapDealer; 12-17-2016 at 08:21 PM. |
||
12-17-2016, 10:16 PM | #9 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@DiapDealer...Thanks for the info. I was unaware that 'lang' was used that much in epubs so I guess I've learned something. I know that the html text is in utf-8 whereas I think the tag text is more or less ascii. So I'm slightly surprised that you need the 'lang' attribute everywhere in the html because I thought that utf-8 could be defined regionally for different languages within the epub html with the help of python. I guess that utf-8 isn't used like that when you use python in an html app.
Regarding the use of 'name' or 'id' -- I always use 'id' now because you will always get an error with epubcheck if you use 'name'. Although deprecated does not mean that you can't use it, it does infer that the 'name' attribute will be dropped from html sometime in the future -- perhaps when standard epub html eventually moves to HTML5. I also note that when you convert Word to HTML -- Word HTML still uses 'name' and not 'id'. So I'm guessing that that the removal of 'name' from epub html will not happen for quite a while. Also, I think Kindle mobi allows the 'name' attribute'(because you can upload Word filtered html to KDP) whereas vendors that use standard IDPF epubs will not allow it. Last edited by slowsmile; 12-17-2016 at 10:26 PM. |
12-18-2016, 05:21 AM | #10 | |||||
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
Here's a minimalist proof-of-concept example: Spoiler:
Quote:
Quote:
Quote:
Quote:
Based on strings found in the kindlegen binary, it also looks like KindleGen uses HTMLTidy internally to clean up all HTML files. |
|||||
12-18-2016, 06:27 AM | #11 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
For the record; I wasn't supporting the use of "name" in epubs, I was suggesting that when working with alternative content that is going to be massaged into an epub, it's better to convert any "name" attributes to "id", rather than just delete them. Parsing the content for hrefs that contain the "name" attributes as fragments should be trivial enough to determne which ones can be safely deleted.
|
12-18-2016, 06:32 AM | #12 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
@Doitsu...Interesting what you say about Kindle. Their's is a proprietary format that is closely related to epub with some peculiar quirks. Similar to iBooks proprietary version of epub. You can do that if you are a mammoth company like those two.
Here's another piece of BS code for html that I've found very useful: Spoiler:
In my conversion plugin, I've also noticed significant differences between ODF html rendered from OO and LO. One problem I had was clearing out all the myriad FONT, FACE and SIZE declarations in these two different ODF html versions. I used this code to remove all SIZE = 3 attributes from the html because it was causing problems. Notice that OO uses an integer while LO uses a string numeric for the size value. Spoiler:
Both Tidy and BS have saved my bacon on many occasions. They are both remarkably useful and easy to use for processing html. Last edited by slowsmile; 12-18-2016 at 06:39 AM. |
12-18-2016, 07:03 AM | #13 | |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Quote:
|
|
12-18-2016, 09:50 AM | #14 |
Witchman
Posts: 628
Karma: 788808
Join Date: May 2013
Location: Philippines
Device: Android S5
|
I don't think that I'm re-inventing the wheel really. And even though my plugin converter will give a full conversion(upload ready) I do not regard that as its main purpose. It's just a plugin that will save you alot of time in your conversion workflow by automatically doing all the drudge jobs like re-styling your new epub from scratch, adding metadata, adding images, creating a stylesheet etc. The plugin's main purpose is to quickly bring the plugin user to a point where he or she can just concentrate on finishing-off tasks in Sigil like final epub re-styling, embedding fonts, adding extra images, fixed layout tasks etc.
I'm also guessing that people will probably criticize the plugin and perhaps say, "Why bother when their are already good converters like Calibre, Scrivener, Jutoh etc ?" The main difference between those converters and my plugin converter is that those converters have editors, toc editors, complex settings, stylers, menus, sub-menus and pre-compiler options etc. They are complex apps that take some time to learn. The only editor my plugin app uses to style epubs is LibreOffice or OpenOffice because the plugin ports all styles -- default styles, heading styles, font styles and named styles to the epub stylesheet. It can do this because it also ports all in-tag styling to the CSS as well. So with my plugin all you have to do is style your ebook in LO or OO as you like and then, after filling in the metadata in the dialog window, just push the OK button and your html doc will convert to epub -- whose layout and styling should exactly mimic the layout and styling of the ODT version. The plugin also has a very simple interface which anyone can learn to use quickly. Last edited by slowsmile; 12-18-2016 at 10:23 AM. |
12-18-2016, 10:50 AM | #15 |
Grand Sorcerer
Posts: 28,040
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
|
On a side-note: I sent you an email about testing your plugin on other platforms, @slowsmile. Did you recieve it?
|
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
How to get the selected category in the code of the gui plugin? | esvorontsov | Development | 1 | 10-03-2015 01:52 AM |
How to get the uuid of the book in the code of the gui plugin? | esvorontsov | Development | 3 | 09-29-2015 12:15 PM |
Fragment identifiers | frisket | ePub | 19 | 04-02-2014 03:44 PM |
Using image in plugin code | Jellby | Development | 7 | 03-11-2014 11:56 PM |
FRAGMENT ERROR MESSAGE | dgbeig | ePub | 5 | 11-23-2013 08:21 PM |