From image capture to online access, the University's Pattern Recognition and Image Analysis (PRImA) Research Laboratory is instrumental in improving access to printed works in Europe’s libraries and archives:
- Improving access to personal, collective and community histories through increasing the availability of previously inaccessible, now digitised objects;
- Maximising production efficiency, quality and volume, whilst significantly reducing the costs of digitisation;
- Providing libraries and archives with access to new generation digitisation technologies, contributing to their sustainability and improving their skills base;
- Bringing benefit to the wider European economy by supporting continued investment in and dissemination of the technologies.
Transforming heritage into digital media
Dr Apostolos Antonacopoulos and Stefan Plestacher of the School of Computing, Science and Engineering have developed research which has responded to new challenges in the area of digitisation, developing document analysis and digitisation software tools and evaluation frameworks and leading to the development of new approaches to the mass digitisation of printed text. Libraries and archives around the world had relied on digitisation service providers whose best technologies were designed primarily for modern business documents (the service providers’ largest commercial market) and were not able to take fully into account the significant challenges posed by ageing books and newspapers.
- Apostolos led the development of Aletheia, a production-quality system for accurate and yet cost-effective ground truthing of large amounts of documents. Ground truth is crucial for evaluation at the development stage of tools and for quality assurance in the scope of production workflows for digital libraries. Aletheia aids the user with a number of automated and semi-automated tools developed and improved in association with major libraries across Europe and commercial service providers, which are now using the tool in a production environment.
- Apostolos established the common baseline for evaluating different approaches to mass digitisation through the development of a comprehensive, large scale reference dataset with ground truth at various levels, compiled in partnership with the content-holding partners and representative of their collections. Antonacopoulos defined evaluation metrics and scenarios, and the tools to implement them.
- The dataset, hosted by the IMPACT Centre of Competence in Digitisation, is a unique resource; for researchers to create new methods of digitisation; libraries who want to evaluate their holdings for digitisation and service providers who put together workflows of different methods to identify what works best in given scenarios to enable the objective evaluation of printed holdings. Covering texts from as early as 1500, containing material from newspapers, books, pamphlets and typewritten notes, and created, maintained and expanded by University of Salford (PRImA) researchers, the dataset forms a repository of document images reflecting the holdings and priorities of major European libraries, running to over 600,000 document images, ground truthed in 50,000 cases.
- Large-scale digitisation of historical documents also demands robust methods that cope with the presence of frequent distortions and 'noisy' artefacts. Apostolos led the development of a hybrid text line segmentation method that uses a novel data structure and a rule base to combine the strengths of top-down and bottom-up approaches while minimising their weaknesses. The effectiveness of the approach has been methodically evaluated in the context of large-scale digitisation.
- The British Library, the National Library of the Netherlands and the PRImA research lab explored methods of improving Optical Character Recognition (OCR) for use in the digitisation of less standardised material, making a significant impact on the digitisation of historical documents, by focusing extensive research expertise to exceptional material in both breadth and volume, such as the collections in the British Library. This collaboration increased resource discovery success for historic mass digitisation, maximised production efficiency, quality and volume, whilst significantly reducing the costs of digitisation.
- e-Strategy & Information Systems, Programme Manager, British Library said: “It is absolutely vital institutions like the British Library, the National Library of the Netherlands and technical experts like the University of Salford work together, sharing our experiences and resolving the challenges we face in digitising historic texts to ensure that we deliver digital resources, which are sustainable.”
- The Wellcome Library commissioned Apostolos to evaluate what current OCR methods could achieve in the digitisation of the their archives, including papers of Francis Crick, who discovered (with James Watson) the double helix of DNA, and helped to crack its code. In question was a combination of typewritten documents, notes, and, for example, versions of Francis Crick’s seminal paper demonstrating the construction of DNA. Apostolos evaluated which types of document would yield high quality text after digitisation and utilised Aletheia to ground truth the documents to test the required level of accuracy. The evaluation results helped Wellcome assess its digitisation strategy as well as the prioritisation for digitisation of certain document types over others.
- “Regarding our archival collections, there is a wide range of content that is theoretically OCR'able. To find out we commissioned the University of Salford PRImA (Pattern Recognition and Image Analysis Research) to do a benchmarking exercise from which we could determine whether we could rely on raw OCR outputs, should not OCR this type of material at all, or to test various methods to improve OCR'ability.” Digitisation Programme Manager, Wellcome Library, Wellcome Trust,