MobileRead Forums - View Single Post - Future Release To-Do Items

retiredbiker · 04-29-2024, 04:09 PM

Quote:

Originally Posted by KevinH

Also, I would love to hear people's thoughts on point 2 on the list:

2. Consider adding "Use Unicode" to Find & Replace Regex options (*UCP)

It would involve adding a new item to the PullDown and then properly handling that option during search (it is one of the few regex options that must always be first in the regex code). So it shoudl not be difficult and we already have to update the Search Chapter in the Sigil user's guide to handle the new Search Where catagories just added.

So thoughts anyone? Especially from non-english character set epub developers? Add it or rely on Saved Searches where adding the necessary regex code us be done easily once and recalled.

Thanks.

Most of my use of Sigil is in either editing purchased books so I can read them (bad eyesight not compatible with fancy-pants formatting), or doing OCR on old texts and turning the result into a proofed epub (think 1920s or 30s pulp mystery magazines). I do these for my own enjoyment, not for publication or distribution.

For the new books, I suppose the need to find and replace unicode characters could come up at any time, but I haven't hit one yet. In these cases I'm mostly interested in (for example) finding and making readable some 0.2em sized set of footnotes so I have a chance of reading them without constantly changing text size on my reader. So I'm mostly concerned with css and tag names.

Surprisingly, the old magazines and books have a lot of diacritic characters. Think Sax Rohmer and his fake Arabic transliterations! But if Tesseract gives me é or û, it is so far always a single character, not a multiple-character unicode grapheme. And if I use my compose key to type anything with an accent, it also comes out as a single character. But I suppose all this will change at some point.

I had a look at this article, and it makes my head ache to see what I might have to start typing to get unicode matches. But I once thought that about plain old regex. And if I suddenly need it...well, nice to have it there.

So just in terms of future-proofing, and since it seems not too hard, I would be in favour of adding the unicode support.