05-18-2024, 12:07 PM | #16 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Okay, I spent some time investigating this:
1. exiftool requires a C library and or command line subprocess interface so that is out 2. Pillow will allow you to use getxmp() but you need the "defusedxml" module. Luckily defusedxml is pure python and small and can be added to a plugin easily. So using Pillow can be used. BUT: Pillow made the insane decision to NOT return the raw XML for post processing, and instead creates some horrible nested dict that contains other dicts and lists and so accessing a single element or even walking the list requires recursion and is a real pain. Especially with all of the namespaces being used in the official example. Talk about an xml namespace nightmare! And as far as I can tell attribute values are lost, case is mutated, etc. It should have just returned the pure xml since there are many xml parsers and tools like bs4 that could be used to get what is needed. Especially when you do not know the exact namespaces or structure employed. And especially if you may need to access to multiple langauge versions of the same alt text. 3. So that just leaves the following I threw together to based on fragments I could find on on the web (stack exchange) glued together with a few pieces of my own: Code:
import sys import os from bs4 import BeautifulSoup filename = "test.jpg" f = open(filename, "rb") d = f.read() xmp_str = b"" while d: xmp_start = d.find(b"<x:xmpmeta") xmp_end = d.find(b"</x:xmpmeta") xmp_str += d[xmp_start : xmp_end + 12] d = d[xmp_end + 12 :] alt_text_dict = {} xmpAsXML = BeautifulSoup(xmp_str, 'xml') if xmpAsXML: node = xmpAsXML.find('AltTextAccessibility') if node: for element in node.find_all('li'): # print(element.prefix, element.namespace, element.name, element['xml:lang'], element.text) lang = element.get('xml:lang', 'x-default') alt_text_dict[lang] = element.text for k, v in alt_text_dict.items(): print(k, v) All of this could be rewritten into a nice routine but ... it literally walks the entire binary data file looking for particular starting strings (which depend on the x prefix namespace being defined) and ending strings. If a different prefix is used, this search will fail. This is a mess and very very time consuming for large images. So I will probably have to dig into the Pillow getxmp() implementation code to try to more quickly just extract the xml and not some horrible nested dictionary. Before doing all of that, I wonder just how many epub images actually have any of the xmp metadata at all? Otherwise this seems to be an exercise in futility, since the metadata takes up room, all image optimizers I know of (which are regularly run on images before adding them to an epub) remove this metadata completely. Removing all metadata also prevents some image orientation issues. So not sure if this is worth the work. What are people's thoughts on this. Last edited by KevinH; 05-18-2024 at 12:53 PM. |
05-18-2024, 12:46 PM | #17 |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Many thanks for looking into this, Kevin.
Based on what I have seen with the images I regularly encounter, seeing alt-text entries in image metadata is very unusual. I work with a small non-profit publisher, creating epub versions of their print books and the books often have images. But this is the first time I have ever seen images that contain alt-text entries in their metadata. It was nice not having to write alt-text , but it was also not at all difficult to copy and paste the alt-text using the very user-friendly alt text feature in Access-Aide. So I do not think that we need to proceed with this feature. Another reason not to do anything relating to alt-text in image metadata is that some of the alt-text entries that I saw more properly belonged in an extended description. See: https://kb.daisy.org/publishing/docs....html#extended So much of what is actually needed for accessibility depends on the context in which the image is used. So the alt-text in an image metadata might not be suitable for every context in which the image is used. So it's probably better to work directly with each image to provide the best alt-text in the circumstances. Thanks again, for looking into this, it was very helpful because it prompted a deeper dive into this entire topic. Jim Last edited by oston; 05-18-2024 at 12:48 PM. |
05-18-2024, 01:45 PM | #18 |
Grand Sorcerer
Posts: 5,640
Karma: 23191067
Join Date: Dec 2010
Device: Kindle PW2
|
Since IPTC metadata seems to be less commonly used than EXIF metadata, a compromise might be grabbing the ImageDescription EXIF metadata entry with Pillow.
This requires only a few lines of code: Spoiler:
The code will return the string: A Prince looks out between the bars of a prison window. (It refers to this image provided by the OP.) IMHO, automatically extracting some human generated description with Acess-Aide is better than extracting no description at all. @oston would extracting the ImageDescription information be helpful to you? |
05-18-2024, 02:24 PM | #19 | |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Quote:
I am by no means experienced enough to give a valuable answer. I am just trying to learn as much as I can about making accessible epubs. In the images I have seen, until I saw this latest set of images, I had not seen any Image Descriptions or alt-text in image meta-data. But hopefully someone who is very experienced with Image Descriptions and accessibility issues will see this and give a more informed answer. Sorry that I'm not able to be more helpful. |
|
05-18-2024, 03:58 PM | #20 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Using exif ImageDescription would be easy to add to AccessAide if that helps.
FWIW, I am just so disappointed that Pillow did not return the xml in their getxmp() method instead of nested mess of dicts and lists. Really makes accessing specific xmp metadata hard to work with. |
05-20-2024, 08:54 AM | #21 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
The Pillow dev guys nicely gave me a snippet of code that will return the actual xml across all 4 image types that support it now. That makes Pillow the obvious best candidate. So I should be able to query for Alt Text and if not present, fall back to exif ImageDescription.
I think that might be worth adding to a future version of AccessAide. |
05-20-2024, 09:42 AM | #22 |
Connoisseur
Posts: 78
Karma: 2138296
Join Date: Nov 2016
Device: ipad, Kindle Scribe, Kobo Libra 2
|
Thanks, very much, Kevin. That will be helpful.
|
05-20-2024, 02:49 PM | #23 |
Bibliophagist
Posts: 40,475
Karma: 156982136
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
It would be handy and possibly, if the gods are kind, save me from manually adding all alt texts.
|
05-23-2024, 02:57 PM | #24 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Access-Aide Version v095 has now been released. It is available via our Sigil Plugin Index as an attachment or from my github repo:
https://github.com/kevinhendricks/Access-Aide It now includes the ability to take EMPTY alt attributes and look up the image's own metadata for XMP AltTextAccessibility or failing that, exif ImageDescription to auto fill alt attribute values. It will NOT overwrite any existing image alt value. Hope this helps, KevinH |
05-23-2024, 03:08 PM | #25 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
In case anyone else wants to add this feature to their own code, here is the sample code:
Code:
import sys from bs4 import BeautifulSoup from PIL import Image # extract base language from language code def baselang(lang): if len(lang) > 3: if lang[2:3] in "-_": return lang[0:2] return None def parse_xmpxml_for_alttext(xmpxml): xmpmeta = BeautifulSoup(xmpxml, 'xml') alt_dict = {} if xmpmeta: node = xmpmeta.find('AltTextAccessibility') if node: for element in node.find_all('li'): lang = element.get('xml:lang', 'x-default') alt_dict[lang] = element.text lg = baselang(lang) if lg: alt_dict[lg] = element.txt return alt_dict def get_image_metadata_alttext(imgpath, tgtlang): xmpxml = None description = "" with Image.open(imgpath) as im: if im.format == 'WebP': if "xmp" in im.info: xmpxml = im.info["xmp"] if im.format == 'PNG': if "XML:com.adobe.xmp" in im.info: xmpxml = im.info["XML:com.adobe.xmp"] if im.format == 'TIFF': if 700 in im.tag_v2: xmpxml = im.tag_v2[700] if im.format == 'JPEG': for segment, content in im.applist: if segment == "APP1": marker, xmp_tags = content.split(b"\x00")[:2] if marker == b"http://ns.adobe.com/xap/1.0/": xmpxml = xmp_tags break exif = im.getexif() # 270 = ImageDescription if exif and 270 in exif: description = exif[270] if not xmpxml: return description alt_dict = parse_xmpxml_for_alttext(xmpxml) # first try full language code match if tgtlang in alt_dict: return alt_dict[tgtlang] # next try base language code match lg = baselang(tgtlang) if lg and lg in alt_dict: return alt_dict[lg] # use default if 'x-default' in alt_dict: return alt_dict['x-default'] # otherwise fall back to exif image description return description imgpath = "test.jpg" lang = 'en-US' print(get_image_metadata_alttext(imgpath, lang)) Last edited by KevinH; 05-24-2024 at 12:57 PM. |
05-23-2024, 03:37 PM | #26 |
Guru
Posts: 782
Karma: 2298438
Join Date: Jan 2017
Location: Poland
Device: Various
|
@KevinH: It is essential to add try/except from line 482, as it throws an error if there is no metadata in the image.
Spoiler:
|
05-23-2024, 03:44 PM | #27 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
I will check for that key first to prevent the keyerror.
Thanks! |
05-23-2024, 04:03 PM | #28 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
Should now be fixed in v0.9.6 just posted.
Thank you @BeckyEbook! |
05-23-2024, 11:06 PM | #29 |
Bibliophagist
Posts: 40,475
Karma: 156982136
Join Date: Jul 2010
Location: Vancouver
Device: Kobo Sage, Libra Colour, Lenovo M8 FHD, Paperwhite 4, Tolino epos
|
Tested 0.9.6 on 4 ePubs with images. One worked well since it had decent metadata, one worked on 4 out of 10 images, the last two had no useful metadata. Still going to save me time and effort so thanks very much!
|
05-24-2024, 01:03 PM | #30 |
Sigil Developer
Posts: 8,156
Karma: 5450818
Join Date: Nov 2009
Device: many
|
FYI: There is an indentation whitespace issue. So a new version of Access Aide (this time 0.9.7) will be coming later this evening fixing that. It only impacts jpeg images with multiple APP1 segments none of which are xmp metadata.
So the alt_text in your 4 epubs should be correct as is. Update: Version 0.9.7 just posted has this new fix. Hopefully the last one. Last edited by KevinH; 05-24-2024 at 03:22 PM. |
Tags |
access-aide, alt text |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
[Plugin] Access-Aide - help improve epub accessibility | KevinH | Plugins | 147 | 10-15-2024 11:25 AM |
Bug: splitting pages after using Access Aide | oston | Sigil | 4 | 04-08-2024 08:59 AM |
[Editor Plugins] Access Aide | wolf123 | Plugins | 5 | 07-08-2023 02:10 PM |
access-aide failure | oston | Sigil | 5 | 06-27-2023 04:42 PM |
Alt Text in epub | Lancelot | ePub | 3 | 09-11-2013 04:55 AM |