Project Title: Introducing 5 Old European language models to ABBYY FineReader
A collaborative initiative undertaken by a consortium of 14 universities from 7 European countries and the US, co-funded by the European Union.
The Challenge
The Meta-E project is focused on providing technology basis for digitisation and web-publishing of valuable printed sources spanning several centuries of European history. Among other content management products an OCR system was required capable of recognizing historical texts for the period 1800 - 1938, printed with Frakturschrift (an old-styled black-letter typeface used in the majority of printed texts in Central and Northern European countries up to the middle of the 20th century). At that point no omnifont-Frakturschrift system was available; all OCR products had to be trained separately on each book to achieve acceptable quality.The Solution
Meta-E coordinators started looking for a powerful OCR package to be augmented according to their requirements. ABBYY FineReader was chosen due to its unrivalled recognition accuracy, multi-lingual support (176 modern languages) and accessibility. ABBYY, the Russian-German-American manufacturer of FineReader products, took up the project as a direct contractor, and conducted development of the omnifont part (introducing of Fraktur alphabet to FineReader). The linguistic part of the project was subcontracted to ATAPY Software, ABBYY's long-term partner in OCR/ICR and linguistic development.ATAPY's work for the Meta-E project was to build Language Models for 5 old European languages: Old English, Old French, Old German, Old Italia, and Old Spanish. A Language Model is the scope of all words of the given language and their grammar paradigms - a dictionary, against which FineReader checks all words' spelling in order to highlight the incorrectly recognized characters and build correct spelling hypotheses for operator to choose from. Such dictionaries are not just plain sets of words with all their grammar forms: such a database would be enormous in size and hardly manageable. A typical OCR dictionary is well optimized: each word is stored as a single database entry, to which an appropriate grammar paradigm is assigned, in the form of a linguistic formula. ATAPY was to study the material dating back to the given time span, form the word stock for 5 project languages based on this knowledge, and assign appropriate paradigm formulas to each of hundreds of thousands of words, based on trusted sources such as authentic dictionaries and original old European texts.
The 5 Language Models developed by ATAPY were built in ABBYY FineReader, turning the latter into a powerful tool to assist Meta-E consortium in their large-scale digitisation work. Moreover, the new OCR functionality is available for purchase as an additional module to ABBYY FineReader 7.0 - a new product released by ABBYY Software House. FineReader 7.0 with Old Language support is the industry's first box OCR product to recognize Renaissance and Late Medieval sources, a product specially targeted at European libraries and public organizations engaged in preservation and publishing of cultural assets, and service bureaus helping them fulfill this mission.
Tools and Technologies
- ABBYY FineReader
1. Analysis of dictionaries and original texts of the given period
ATAPY Linguistics Department had carefully selected 10 dictionaries reflecting the state of project languages, years of publication ranging from 1808 to 1930. ATAPY had also thoroughly analyzed 105 authentic books belonging to that period, of more than 50 MB of archived information in volume.
2. Building Language Models
ATAPY linguistic engineers were authorized by ABBYY to build their Language Models on the basis of contemporary FineReader models - a factor that had considerably alleviated the process. And still, this stage appeared very laborious, as ATAPY engineers had to manually compare the information from the authentic dictionaries and texts - about 500.000 entries in total - to existing FineReader models. The resulting project language dictionaries comprised 458.767 words, from which 61% were taken as is from FineReader vocabulary, and 36% were added to vocabulary from the analyzed sources. About 3% of the words had their grammar paradigms corrected towards the XVIII-early XX century grammar rules; to carry out such correction, the linguists had to add 159 historic grammar paradigms that were missing in the contemporary models.
3. Testing and QA
The resulting Language Models were tested on the original old text material acquired during the Analysis stage. The results manifested the 98,91% vocabulary coverage for Old English, 99,16% for Old French, 96,58% for Old German, 98,58% for Old Italian and 98,79% for Old Spanish language.
Related links
http://meta-e.uibk.ac.atPost Your Story, Tell All About Your Success!
If you want the story of your company to appear on the portal please fill out this form and send it to [email protected]. We would also like you to leave contact information (name, e-mail, phone) of a person who is responsible for filling the form to clarify any questions, which could appear.
Available Success Stories From Companies:
Disclaimer
All information contained in this Section is owned by RUSSOFT.org and its Participants and is protected by Russian and international copyright laws. Any reproduction or republication of all or part of this Section has to remain intact and include a notice on the copyright of RUSSOFT.org or the Participants, as applicable.
While the information of this Section has been presented with all due care, RUSSOFT.org does not warrant the accuracy, completeness, usefulness and truth of Section’s information, links and logos derived from third parties. RUSSOFT.org is not liable for any loss or damage occurring from the use of this Section’s materials.






