12-27-2020, 11:04 AM | #1 |
Connoisseur
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
|
Abbyy Finereader 15 gothic/Fraktur Altdeutsch/Oldgerman
german:
habe ich nun ausgerechnet ein Buch in gothic von einem Autor, der viele Fremdsprachen im Text einsetzt. Der Haupttext ist in Altdeutsch (Fraktur). Aber es kommen etliche Seiten mit Zitaten und Vergleichen in griechisch, Latin, Französich, Englisch auf einer Seite und das über lange Strecken des Buches. Da ist FR 15 ein bischen überfordert. Die Fremdsprachen werden entweder nur als Müll erkannt (Griechisch) oder sehr fehlerhaft. Hat jemand eine Idee, wie man mit diesen Schwächen von FR15 umgehen kann? --- english: The ocr for oldgerman, oldenglish, oldfrench works very well. But in my textexample I now have a book in Gothic by an author who uses many foreign languages in the text. The main text is in Old German (Fraktur). But there are several pages with quotations and comparisons in Greek, Latin, French, English on one page and that over long stretches of the book. FR 15 is a bit overwhelmed. The foreign languages are either only recognized as garbage (Greek) or very faulty. Does anyone have any idea how to deal with these weaknesses of FR15? |
12-28-2020, 09:36 PM | #2 | |
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
2. Choose "Specify Languages Manually", then check the checkboxes for which languages you want to detect: For example, I use this: Code:
English; German; French; Note: Don't go too overboard with languages though. Finereader uses this to look up dictionary words + add certain letters in the alphabet. The more languages you add, the more likely there will be false positives. For example, "der" is a German word, but isn't an English word, so an English OCR error like "un der" will be considered okay (since it'll think it's German). |
|
Advert | |
|
12-29-2020, 03:59 PM | #3 |
the rook, bossing Never.
Posts: 12,250
Karma: 89531599
Join Date: Jun 2017
Location: Ireland
Device: All 4 Kinds: epub eink, Kindle, android eink, NxtPaper11
|
But a book might have English dialect. Then der means there.
At the end of the day you need good proofreading skills. |
12-29-2020, 04:59 PM | #4 | |
Connoisseur
Posts: 77
Karma: 2178856
Join Date: Oct 2013
Device: Kobo Clara HD
|
Quote:
Ich hatte so gedacht: Wenn der Haupttext des Buches in Altdeutsch ist, dann nehme ich Altdeutsch in OCR. Wenn nun im Text weiteren Sprachen und Schriften verwendet werden, dann füge ich generell die weiteren Sprachen zur OTR-Liste hinzu. Und damit starte ich den Erkennungsprozess für das gesamte Buch. Band für Band. Bei 4 Bänden kommt man leicht auf 2000 Seiten. Dass das nicht funtioniert ist doch wohl eine Schwäche von FR 15 oder? Ich verstehe nicht, wo das Problem ist, FR 15 auf diese Höchstleistung zu bringen. Eigentlich müsste doch möglich sein, ein Programm zu machen, dass den Text Wort für Wort liest, und bei jedem Wort automatisch die Sprache und Schrift und erkennt, und das richtige Wörterbuch zuordnet. Dann müsste das Programm die Erkennungsdiagnose in eine Liste schreiben oder für jede Seite so eine Liste schreiben. Dann braucht das Programm beim letzen OCR-Durchlauf nur anhand der am Anfang geschriebenen Liste oder Listen zu übersetzen. In den Listen steht doch drin, welches Wörterbuch für welches Wort zuständig ist. Ist das alles wirklich so viel komplizierter als ich mir das denke? english: I had thought like this: If the main text of the book is in Old German, I'll use Old German in OCR. If other languages and fonts are used in the text, then I generally add the other languages to the OTR list. And with that I start the recognition process for the entire book. Band by band. With 4 volumes you can easily get to 2000 pages. That it doesn't work is a weakness of FR 15, isn't it? I don't understand where the problem is getting FR 15 up to this peak. It should actually be possible to make a program that reads the text word for word and automatically recognizes the language and script for every word, and finds the riht dictionary for every word. Then the program would have to write the detection diagnosis in a list or write such a list for each page. Then the program only needs to translate for the last OCR run using the list or lists written at the beginning. The lists say which dictionary is responsible for which word. Is it really all that much more complicated than I think? |
|
12-30-2020, 12:36 AM | #5 | ||||
Wizard
Posts: 2,297
Karma: 12126329
Join Date: Jul 2012
Device: Kobo Forma, Nook
|
Quote:
Quote:
For example, I work on books that are "99% English", but they have many:
When you choose an OCR language in the dropdown, this enables two major things:
Alphabets Choosing English enables these basic characters: A-Z + a-z English doesn't commonly use accented characters, so if OCR ran across an 'ö', Finereader will probably think the diaeresis is specks of dust. It will guess you meant an 'o'. Choosing German enables more letters + accented characters: ßäöü And let's say you worked on a Spanish book, you'd get letters like ñ in "mañana": áéíñóúü French: àâæçèéêëÿœ (True alphabets Finereader uses is hidden in SPOILER.) Spoiler:
Dictionaries Another way OCR becomes more accurate is using words from the actual language. Let's say you had a sentence: Code:
The swordfish was found un der the sea. Hmmm, "un" + "der" isn't English words, but "under" is in the English dictionary. Most likely that little space was a little font issue or scanning artifact. If it's 99.9% sure, it MAY combine those into "under". When you add in German dictionary, it will think differently. "un" + "der" are two valid German words, so OCR will now think: Code:
The swordfish was found <--- English un der <--- German the sea. <--- English The more dictionaries you add, the more of this type gets introduced, which is why you want to use the MINIMAL AMOUNT OF LANGUAGES POSSIBLE. Yes. You can read a little about this in: "Strategies for Reducing and Correcting OCR Errors" by Martin Volk, Lenz Furrer and Rico Sennrich (Language Technology for Cultural Heritage) https://www.researchgate.net/publica...ing_OCR_Errors They go through a few other corrections at each stage (like patterns + merging + book-level statistics). Quote:
If you select the "Document Language" dropdown, you can see a selection called "Automatically select document language from the following list". There, you can choose which common languages you run across. For example, mine has: Code:
English; French; German; Italian; Spanish Let's say my Finereader runs across a lot of umlauts, it'll go: "Hmmm, there seems to be A LOT of errors on this page, maybe English language is wrong, let me run this paragraph again through German." Quote:
Which language is this word? Code:
canal <--- English canal <--- Spanish canal <--- Portuguese canal <--- Catalan "canal" in those other 3 languages means channel, as in "change the TV channel". To guess document's language, you need more text, like an entire phrase/sentence/paragraph/page. Then you can begin using statistics + dictionaries. For example: Code:
subscribe to my channel. <--- English suscribirse a mi canal. <--- Spanish inscreva-se no meu canal. <--- Portuguese (Brazil) subscriure's al meu canal. <--- Catalan iscriviti al mio canale. <--- Italian
if that doesn't work, you start looking at larger collections of words (called n-grams), but there's still a large amount of overlap between languages. Computers are getting pretty good (see pasting into Google Translate), but when you start getting into minutiae, like Portuguese (Portugal) + Portuguese (Brazil)... things become much harder. Better for humans to give the computer hints than to leave the computer 100% guessing. Or you help OCR along, by telling it what languages you're dealing with, then the statistics + red squigglies really help. Last edited by Tex2002ans; 12-30-2020 at 01:09 AM. |
||||
Advert | |
|
02-23-2021, 06:18 PM | #6 |
Connoisseur
Posts: 60
Karma: 201178
Join Date: Mar 2015
Location: Israel
Device: Kobo Aura H20, Kobo Forma
|
Also you can specify a language for each text zone, if you have a block of text of the same language.
|
Thread Tools | Search this Thread |
|
Similar Threads | ||||
Thread | Thread Starter | Forum | Replies | Last Post |
Need help with Abbyy Finereader 10 (linebreaks) | NASCARaddicted | Workshop | 11 | 01-19-2017 04:10 PM |
If I have ABBYY Finereader, do I need ABBYY PDF Transformer? | graycyn | 2 | 06-12-2012 06:23 PM | |
Abbyy Finereader 11 Pro $99 | chainring | Deals and Resources (No Self-Promotion or Affiliate Links) | 6 | 02-13-2012 07:12 AM |
Abbyy FineReader Dictionaries | Mebyon | Workshop | 2 | 02-10-2010 02:57 PM |
ABBYY FineReader cannot see images | chinesealbumart | Workshop | 8 | 05-15-2009 11:03 PM |