Project Title: The Old Danish books conversion project for the Royal Danish Library, Denmark
The Challenge
An ambitious project by the Royal Danish Library named "Arkiv for Dansk Litteratur" is geared towards converting the whole of Danish literary canon (the works of 70 carefully selected Danish authors from the 11th to the early part of the 20th century) into XML - namely, into TEI, an XML standard used for publishing of literary and linguistic material. Great number of books, their diverse contents (verses, prose, pictures, tables, notes and comments), as well as layout and typesetting preservation requirements, have made this project a very special task.In order to succeed, the contractor company assigned for this project should have possessed some seemingly incompatible qualities. On the one hand, it had to be very competent with modern Optical Character Recognition technologies, proficient in XML coding, capable of designing specialized software instruments that facilitate the conversion process. This required high IT qualification and extensive hands-on experience in automated data input. On the other hand, almost all real-life mass data input projects still mean a lot of manual labor. No matter how accurate the OCR system is, it makes mistakes - especially on such a difficult material as old books with a complex layout. Also, in most cases full automation of XML coding was not possible due to diversity of the attributes. Therefore, the contractor should have been able to offer many qualified operators at a reasonable cost; otherwise the project's price tag would exceed financial capabilities of any library.
The Solution
The IT staff of the Royal Danish Library attempted to solve this problem by searching for a partner outside the EU. Their attention was drawn to Russia, the home of the world-renowned OCR system ABBYY FineReader. Following months of trial and pilot projects, the Library fixed upon ATAPY Software, a leading developer of custom OCR solutions based on FineReader technologies and an experienced media service provider. The pilot projects demonstrated that ATAPY combined high IT professionalism with unmatchable access to an extensive pool of qualified multi-lingual human resources. For 2002, the Library had set for ATAPY the production plan of 230 books (nearly 100,000 pages), which had been timely fulfilled.Being a software company in addition to a media service company made it possible for ATAPY to dispatch experienced customization engineers to design and improve project-specific program utilities for every conversion stage. This allowed ATAPY, as the project moved on, to gradually decrease processing time by another 10 to 20%, passing the savings to the client.
All finished books are currently available online at http://www.adl.dk.
The books conversion process consisted of three stages:
1. Reading scanned images into text format. The Library provided ATAPY with scanned pages in TIFF format. The quality of the images was remarkably good, which was an important contribution to the efficiency of the remaining stages. ABBYY FineReader automatically analyzed ("segmented") the images to distinguish text from pictures and revealed the table structure. Segmentation results were reviewed by layout operators. After that, pages were recognized using FineReader's outstanding omnifont capabilities augmented with many font-specific patterns that raised the recognition quality for most old books. Then a group of verification operators thoroughly proofread the OCR results. Special attention was paid to non-Danish inclusions, some of which may not even be OCR-able (Old Greek, Hebrew etc).
2. The next step was preparation of the initial XML-document. Verified text was exported to Microsoft- Word format. A group of XML operators armed with an arsenal of customized tools and macro programs used Word as an environment for adding XML tags. This had been a very intelligent task since the full list of tags contained over 50 entries, and only half of them yielded to automatic identification. The remaining half had to be spotted and marked
Post Your Story, Tell All About Your Success!
If you want the story of your company to appear on the portal please fill out this form and send it to [email protected]. We would also like you to leave contact information (name, e-mail, phone) of a person who is responsible for filling the form to clarify any questions, which could appear.
Available Success Stories From Companies:
Disclaimer
All information contained in this Section is owned by RUSSOFT.org and its Participants and is protected by Russian and international copyright laws. Any reproduction or republication of all or part of this Section has to remain intact and include a notice on the copyright of RUSSOFT.org or the Participants, as applicable.
While the information of this Section has been presented with all due care, RUSSOFT.org does not warrant the accuracy, completeness, usefulness and truth of Section’s information, links and logos derived from third parties. RUSSOFT.org is not liable for any loss or damage occurring from the use of this Section’s materials.






