RegEx-Function and hyphenation problem

scratch · 01-28-2017, 06:00 AM

Hello to everybody.

So first a real big thank-you-all to the community and Kovid. You provide me with so many useful tipps and advices about calibre! I visit the forum since years and learned a lot. In fact - almost every problem that ever occured to me was asked and solved by some members sometimes. But now I have a problem which I simply cannot figure out because I don't really understand Phyton (and yes I try since some weeks

).

So here's the problem. Sometimes after a scan and ocr there may be many words with false divisions. So in German (which is my main paper-book-library) like 'Maschi ne' instead of 'Maschine'.
It would be great to have a Phyton expression:

If to words divided by space are _not_ found in the dictionary pull them together but only if the new word _is_ in the dictionary.

Example in German:
'betrach ten' should then look like 'betrachten' but
'Josef Bankl' shouldn't be converted to JosefBankl

In this case the Editors inbuilt Phyton-Expression is not really helpful because it pulls all words together which are part of the dictionary. Like making 'nachdem' from 'nach dem'. But these have different meanings and should remain untouched.

So hopefully I'm not the only one with this problem and it's not a big waste of your time to think about it.
And sorry for my miserable english.
Any advice would be great.

Sincerely, Steve

kovidgoyal · 01-28-2017, 07:26 AM

It would basically work lke this example, https://manual.calibre-ebook.com/fun...phenated-words except that you have to change it to look for words separated by spaces instead of hyphens.

scratch · 01-28-2017, 08:15 AM

Thank you for your quick answer.
I tried this already. What I changed was

(\w+)\s*-\s*(\w+)
to simple this line
(\w+)\s(\w+)

But unfortunately it does not work. Maybe it's due to me beeing complete Python blind.

And there should also be a line which asks if neighbouring words are both not in the library before linking - which I cannot see in the example (or it's there and I do not understand it)
Thanks anyway

...and now I noticed something else.
In this sentence
<p>Solange Menschen auf die Welt kommen</p>
(\w+)\s(\w+)
finds word number 1+2 then 3+4 and then 5+6
so if a mistake would be between #2-#3 it is ignored.
Like this
<p>Solange Men schen auf die Welt kommen</p>
I understand why - but I don't be not able to find out how to avoid this
Again - any further advice is welcom.

kovidgoyal · 01-28-2017, 12:22 PM

I dont have the time to write the function for you, but it would go something like this:

Code:

words = text.split()
i = 0
while i < len(words) - 1:
      w1, w2 = words[i:i+2]
      if not dictionaries.recognized(w1) and not dictionaries.recognized(w2) and dictionaries.recognized(w1 + w2):
         words[i] = w1 + w2
         words[i+1] = ''
         i += 1
      i += 1
return ' '.join(words)

scratch · 01-28-2017, 12:44 PM

Thank you for your kind advice.
This gives me some hints to think about for the next days.
And BTW
Thank you for calibre which is simply the best!!

01-28-2017, 06:00 AM	#1
scratch Junior Member Posts: 3 Karma: 10 Join Date: Jan 2017 Location: Austria Device: none.	RegEx-Function and hyphenation problem Hello to everybody. So first a real big thank-you-all to the community and Kovid. You provide me with so many useful tipps and advices about calibre! I visit the forum since years and learned a lot. In fact - almost every problem that ever occured to me was asked and solved by some members sometimes. But now I have a problem which I simply cannot figure out because I don't really understand Phyton (and yes I try since some weeks ). So here's the problem. Sometimes after a scan and ocr there may be many words with false divisions. So in German (which is my main paper-book-library) like 'Maschi ne' instead of 'Maschine'. It would be great to have a Phyton expression: If to words divided by space are _not_ found in the dictionary pull them together but only if the new word _is_ in the dictionary. Example in German: 'betrach ten' should then look like 'betrachten' but 'Josef Bankl' shouldn't be converted to JosefBankl In this case the Editors inbuilt Phyton-Expression is not really helpful because it pulls all words together which are part of the dictionary. Like making 'nachdem' from 'nach dem'. But these have different meanings and should remain untouched. So hopefully I'm not the only one with this problem and it's not a big waste of your time to think about it. And sorry for my miserable english. Any advice would be great. Sincerely, Steve

Similar Threads
Thread	Thread Starter	Forum	Replies	Last Post
regex-function convert roman numerals	weberr	Editor	11	09-22-2021 05:15 PM
RegEx Function: Title Case	phossler	Editor	29	07-04-2020 10:52 AM
Regex Function about «» and “”	senhal	Editor	8	04-06-2016 02:12 AM
Regex Function - Split unknown word	Paulie_D	Editor	19	12-07-2014 05:12 AM
Using regex for more elegant hyphenation and word wrap	Psymon	Sigil	23	12-01-2014 07:27 PM

01-28-2017, 07:26 AM	#2
kovidgoyal creator of calibre Posts: 45,304 Karma: 27111242 Join Date: Oct 2006 Location: Mumbai, India Device: Various	It would basically work lke this example, https://manual.calibre-ebook.com/fun...phenated-words except that you have to change it to look for words separated by spaces instead of hyphens.

01-28-2017, 08:15 AM	#3
scratch Junior Member Posts: 3 Karma: 10 Join Date: Jan 2017 Location: Austria Device: none.	Thank you for your quick answer. I tried this already. What I changed was (\w+)\s-\s(\w+) to simple this line (\w+)\s(\w+) But unfortunately it does not work. Maybe it's due to me beeing complete Python blind. And there should also be a line which asks if neighbouring words are both not in the library before linking - which I cannot see in the example (or it's there and I do not understand it) Thanks anyway ...and now I noticed something else. In this sentence <p>Solange Menschen auf die Welt kommen</p> (\w+)\s(\w+) finds word number 1+2 then 3+4 and then 5+6 so if a mistake would be between #2-#3 it is ignored. Like this <p>Solange Men schen auf die Welt kommen</p> I understand why - but I don't be not able to find out how to avoid this Again - any further advice is welcom. Last edited by scratch; 01-28-2017 at 09:30 AM.

01-28-2017, 12:44 PM	#5
scratch Junior Member Posts: 3 Karma: 10 Join Date: Jan 2017 Location: Austria Device: none.	Thank you for your kind advice. This gives me some hints to think about for the next days. And BTW Thank you for calibre which is simply the best!!

Advert

Advert