Register Guidelines E-Books Today's Posts Search

Go Back   MobileRead Forums > E-Book Software > Calibre > Editor

Notices

Reply
 
Thread Tools Search this Thread
Old 08-12-2014, 06:38 AM   #1
Carpatos
Junior Member
Carpatos began at the beginning.
 
Posts: 1
Karma: 10
Join Date: Aug 2014
Device: Kindle Touch
Post About saved searches and regex

I'm a Sigil user, and I'm testing new Calibre editor in order to evaluate it for using as replacement for Sigil.

About saved searches, I found a bug at the interface. In this window (saved searches), if you have a long saved search, with a very long regex in "Find" field, when you select it at the saved searches list, the main window resizes automatically and becomes too width. If you try to resize the window again, it sometimes disappears, and you must close and reopen the edit program to make it visible again.

And other problem is regex doesn't support PCRE expressions, like Sigil does. PCRE are much powerfull than python's "re" regexps. And there is some features in PCRE with not equivalent in "re". Is there any possibility to improve regexp, maybe with some library (like regex or python-pcre) to get PCRE support in the searches?

Thanks for you program, and your support.

Last edited by Carpatos; 08-12-2014 at 06:40 AM.
Carpatos is offline   Reply With Quote
Old 08-12-2014, 06:42 AM   #2
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,572
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
The only useful feature that exists in Sigil that does not exist in the editor's regex engine is the \U operator, and the editor will be getting a function mode for search and replace that will be far more powerful and general than that.

And note that the editor does not use python's re module. It uses https://pypi.python.org/pypi/regex
kovidgoyal is offline   Reply With Quote
Advert
Old 08-12-2014, 08:13 AM   #3
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by kovidgoyal View Post
The only useful feature that exists in Sigil that does not exist in the editor's regex engine is the \U operator, and the editor will be getting a function mode for search and replace that will be far more powerful and general than that.
I've certainly not found the regex module calibre's editor uses to be lacking any real functionality (with the exception of the \U \L \u \l \E replacement operators you mentioned) compared to Sigil's PCRE engine. I do still miss the \K operator at times, but the addition of variable-length lookbehinds makes it possible to replicate \K's versatility anyway.
DiapDealer is offline   Reply With Quote
Old 08-12-2014, 08:49 AM   #4
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,572
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
Incidentally, do the case change operators in Sigil's regex engine do an ascii swap [a-z] <-> [A-Z] or do they support full unicode case folding?
kovidgoyal is offline   Reply With Quote
Old 08-12-2014, 09:28 AM   #5
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
They seem to at least do some basic unicode case folding. Don't know if it's full support, though.

In a search for the character 'á', all of the below captures it:
Code:
(á) 
(\x{00E1})
(\p{Ll})
and \u\1 changes it to 'Á'

But I've no idea if it's limited to a certain subset of unicode (or if it handles all the ways á can be represented).
DiapDealer is offline   Reply With Quote
Advert
Old 08-12-2014, 10:17 AM   #6
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,572
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
That's better than I expected I always assumed that ICU in Sigil was present only for WebKit, but perhaps the regex engine also uses it.
kovidgoyal is offline   Reply With Quote
Old 08-14-2014, 05:34 AM   #7
Papirus
Junior Member
Papirus began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2014
Device: Papyre 613
I'm an other user of Sigil trying to move to Calibre editor.
I'm testing the regex module because I have more than 300 regex (some of them quite complex) for correcting spanish texts.

As it's refered in the initial post, there is a problem, when length of the search chain is long, with saved searches and warning window sizes.

As I said, I have a lot of of regular expressions so in my opinion it would be interesting the possibility of grouping and nesting them in the saved searches window in order to keep them organized.
Another thing I miss is that when I use a saved search expression will automatically appear in the main window find/search area. Sometimes is necessary to make some changes in one of them and now the only way is opening the saved search window, editing the expressión, copying and pasting and then modify it.
In this find/search area, would be possible to add a "count all" button?

Related with the regex engine, I've realized of two main differences: \K (but I can circumvent it using variable-length lookbehinds) and the conditional structure (?(condition)Then|Else); this one is an important limitation compared with PCRE.

Properties \p are well supported, but \p{Lu} (uppercase letter) and \p{Ll} (lowercase letter) only works correctly if "case sensitive" option is checked (I don't know if this is the expected behaviour). I've tried (?f) with no success.

As is mentioned in thread \U \L \E don't work, but they are Sigil commands no PCRE. Any way, it would be very interesting a similar option in Calibre editor because it's a very frecuent mistake in texts: lower case after a dot, and now replacement is not possible.

In another context, sometimes scanned text includes & shy; (soft hyphen), this is a hidden hyphen that you can't see (at least in Sigil) and the only way to remove it is regex searching \xAD. Here the problem is not with the regex engine but with file preview panel where it appears as a dot. A similar behaviour it's with ​& #8203; (Zero-width space), that is also represented as a dot and it's another hidden character that is used in very very very long words in order to break the paragraph avoiding text exceed the screen boundary in readers. Here \x{200B} regex is not allowed.

Thank you very much for your program and support.

Last edited by Papirus; 08-14-2014 at 07:03 AM.
Papirus is offline   Reply With Quote
Old 08-14-2014, 08:38 AM   #8
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Papirus View Post
Related with the regex engine, I've realized of two main differences: \K (but I can circumvent it using variable-length lookbehinds) and the conditional structure (?(condition)Then|Else); this one is an important limitation compared with PCRE.
I don't use really use if|then|else regex conditionals myself, but the regex module calibre's editor uses certainly supports them. Probably just a matter of getting the syntax right. For example:
Code:
(a)?b(?(1)c|d)
Matches both "bd" and "abc"

Quote:
Properties \p are well supported, but \p{Lu} (uppercase letter) and \p{Ll} (lowercase letter) only works correctly if "case sensitive" option is checked (I don't know if this is the expected behaviour). I've tried (?f) with no success.
This sort of hung me up for a bit too, but when you think about it ... searching specifically for lower- or upper-case letters is sort of the very definition of "case sensitivity," is it not?: hence the reason for the box needing to be checked. If you need case insensitivity, uncheck the box and ... use \p{L} in your expression instead.

Quote:
In another context, sometimes scanned text includes & shy; (soft hyphen), this is a hidden hyphen that you can't see (at least in Sigil) and the only way to remove it is regex searching \xAD. Here the problem is not with the regex engine but with file preview panel where it appears as a dot. A similar behaviour it's with ​& #8203; (Zero-width space), that is also represented as a dot and it's another hidden character that is used in very very very long words in order to break the paragraph avoiding text exceed the screen boundary in readers. Here \x{200B} regex is not allowed.
The syntax is different for matching specific unicode codepoints. instead of \x{FFFF} just use \uFFFF. So looking for your & shy character becomes \u00AD and the search for the zero-width space becomes \u200B. PCRE was really the odd man out with the \x{FFFF} sequence.

Last edited by DiapDealer; 08-14-2014 at 09:46 AM.
DiapDealer is offline   Reply With Quote
Old 08-14-2014, 10:52 AM   #9
arspr
Dead account. Bye
arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.arspr ought to be getting tired of karma fortunes by now.
 
Posts: 587
Karma: 668244
Join Date: Mar 2011
Device: none
Quote:
Originally Posted by Papirus View Post
I'm testing the regex module because I have more than 300 regex (some of them quite complex) for correcting spanish texts.
If you feel like it, I would be really, really glad if you shared the most useful ones.

There's a young (and still short) thread about useful S&R settings in this very same subforum: https://www.mobileread.com/forums/sho...d.php?t=237181
arspr is offline   Reply With Quote
Old 08-14-2014, 12:58 PM   #10
Papirus
Junior Member
Papirus began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2014
Device: Papyre 613
Quote:
Originally Posted by DiapDealer View Post
I don't use really use if|then|else regex conditionals myself, but the regex module calibre's editor uses certainly supports them. Probably just a matter of getting the syntax right. For example:
Code:
(a)?b(?(1)c|d)
Matches both "bd" and "abc"
This is not the situation.

Imagine a text and that you want to look for roman numbers in order to small caps them.

([ivxlcdm]+) could be a possibility with <small>\1</small> as replacement.

The above expression will return roman numbers as well as different words that are formed with these characters.

I.e. clic, CD, ill, id, lid, livid, isolated letters, etc.

Probably in spanish there are some more frequent: Thousand (mil) and mainly My (mi)

So if I use the above expresion of course I will get romans numbers but hundreds coincidences of non roman numbers too. The only way I know to bypass this is through conditional structure (?(condition)then|else). It would be more or less (in fact is necessary to refine the expression):

Code:
(?i)(?(?=[clv]?i?d|[cm]m|[dm]i|clic|lcd|(m|ci)?v?il|[mdcl]|ill|livid) |(?<=\PL)([ivxlcdm]+)(?=\PL))
When the condition is satisfied (common words that are not roman numbers) the yes-pattern is used (we look for nothing: the white space in this example after de condition parenthesis). Otherwise we will look for romans.
Papirus is offline   Reply With Quote
Old 08-14-2014, 01:00 PM   #11
Papirus
Junior Member
Papirus began at the beginning.
 
Posts: 3
Karma: 10
Join Date: Aug 2014
Device: Papyre 613
Quote:
Originally Posted by arspr View Post
If you feel like it, I would be really, really glad if you shared the most useful ones.

There's a young (and still short) thread about useful S&R settings in this very same subforum: https://www.mobileread.com/forums/sho...d.php?t=237181
They are written in PCRE, so they should be adapted to python.
Papirus is offline   Reply With Quote
Old 08-14-2014, 01:10 PM   #12
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,572
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
@Papirus: Thankfully, calibre's editor is written in python so you dont have to create that unreadable gunk to do something like change case for roman numerals. Using the upcoming function mode, all you need is

Find: (?i)\b([ivclm]+)\b
Replace:
Code:
def replace(match, context):
    word = match.group()
    if word.lower() not in {set of common words}:
         word = word.upper()
    return word
kovidgoyal is offline   Reply With Quote
Old 08-14-2014, 01:39 PM   #13
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by Papirus View Post
This is not the situation.

Imagine a text and that you want to look for roman numbers in order to small caps them.

([ivxlcdm]+) could be a possibility with <small>\1</small> as replacement.

The above expression will return roman numbers as well as different words that are formed with these characters.

I.e. clic, CD, ill, id, lid, livid, isolated letters, etc.

Probably in spanish there are some more frequent: Thousand (mil) and mainly My (mi)

So if I use the above expresion of course I will get romans numbers but hundreds coincidences of non roman numbers too. The only way I know to bypass this is through conditional structure (?(condition)then|else). It would be more or less (in fact is necessary to refine the expression):

Code:
(?i)(?(?=[clv]?i?d|[cm]m|[dm]i|clic|lcd|(m|ci)?v?il|[mdcl]|ill|livid) |(?<=\PL)([ivxlcdm]+)(?=\PL))
When the condition is satisfied (common words that are not roman numbers) the yes-pattern is used (we look for nothing: the white space in this example after de condition parenthesis). Otherwise we will look for romans.
Sorry. Perhaps I wasn't clear. I wasn't trying to give you a conditional expression that would do exactly what you want. I only intended to let you know that the if|then|else conditional construct is definitely supported by the editor's regex engine. It's going to be up to you to adjust your existing conditional expressions to work with the new engine.

I think the disconnect here is terminology. The calibre editor's regex module supports if|then|else conditionals: that part is not up for debate. I think where you may be running into problems is that it doesn't support conditionals using lookarounds. So instead of a conditional like:
Code:
(?(?=regex)then|else)
You might need to employ two opposite lookarounds:
Code:
(?=regex)then|(?!regex)else
to achieve the same end.

So in essence, your:
Code:
(?i)(?(?=[clv]?i?d|[cm]m|[dm]i|clic|lcd|(m|ci)?v?il|[mdcl]|ill|livid) |(?<=\PL)([ivxlcdm]+)(?=\PL))
becomes:
Code:
(?i)(?=[clv]?i?d|[cm]m|[dm]i|clic|lcd|(m|ci)?v?il|[mdcl]|ill|livid) |(?![clv]?i?d|[cm]m|[dm]i|clic|lcd|(m|ci)?v?il|[mdcl]|ill|livid)(?<=\PL)([ivxlcdm]+)(?=\PL)
Ugly ... but doable.

Last edited by DiapDealer; 08-14-2014 at 03:16 PM.
DiapDealer is offline   Reply With Quote
Old 08-14-2014, 01:42 PM   #14
DiapDealer
Grand Sorcerer
DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.DiapDealer ought to be getting tired of karma fortunes by now.
 
DiapDealer's Avatar
 
Posts: 28,045
Karma: 199464182
Join Date: Jan 2010
Device: Nexus 7, Kindle Fire HD
Quote:
Originally Posted by kovidgoyal View Post
@Papirus: Thankfully, calibre's editor is written in python so you dont have to create that unreadable gunk to do something like change case for roman numerals. Using the upcoming function mode, all you need is

Find: (?i)\b([ivclm]+)\b
Replace:
Code:
def replace(match, context):
    word = match.group()
    if word.lower() not in {set of common words}:
         word = word.upper()
    return word
The upcoming function mode sounds like it's going to be terribly useful.
DiapDealer is offline   Reply With Quote
Old 08-14-2014, 01:54 PM   #15
kovidgoyal
creator of calibre
kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.kovidgoyal ought to be getting tired of karma fortunes by now.
 
kovidgoyal's Avatar
 
Posts: 44,572
Karma: 24495948
Join Date: Oct 2006
Location: Mumbai, India
Device: Various
In fact, just for giggles, since the function mode is actually arbitrarily powerful, you could do this:

Code:
from calibre.gui2.tweak_book import dictionaries

def replace(match, context):
    word = match.group()
    if not dictionaries.recognized(word):
         word = word.upper()
    return word
Which even gets rid of the need to define your own list of common words, it will use the dictionary for whatever language is specified in the books opf file (provided of course you have installed such a dictionary).
kovidgoyal is offline   Reply With Quote
Reply


Forum Jump

Similar Threads
Thread Thread Starter Forum Replies Last Post
Saved searches : suggestions Bertrand Editor 0 05-09-2014 06:58 AM
copy saved searches cybmole Calibre Companion 4 04-28-2014 08:20 AM
Where are searches saved? travger Calibre 2 08-26-2012 02:37 PM
remove saved searches ? cybmole Calibre 4 04-29-2011 04:12 PM
Saved searches question danwdoo Library Management 9 01-26-2011 02:23 AM


All times are GMT -4. The time now is 04:28 AM.


MobileRead.com is a privately owned, operated and funded community.