The University Digitisation Centre hopes to add value for clients among the University community by enabling the regeneration of their analogue-only type-text materials into electronic form. Clients will be able to exploit work which has slipped out of usability due to format creep. We will produce new Word versions of old documents with formatting largely intact – in a form which will be usable on either the PC or Macintosh platforms.
This process involves scanning the document (or importing your existing PDF or TIFF image file) and simultaneously outputting both a searchable PDF and an electronic version in Word. These twin files will enable the user to edit the text in Word to maximise its accuracy whilst referring to the image. This Word document can then be pasted into Excel or any other program - or used to create another PDF which being ‘born digital' is a tiny fraction of the file size of the image file.
The University Digitisation Centre does not undertake to edit Word documents. Clients will receive the image files and word documents as output from the scanning workflow - media type by agreement: USB, DVD, CD, or email.
- Target items:
- Research materials
- Lecture notes/materials
- Various published works
- Collection descriptions
- Shelf lists
- Technical manuals
- The ability to cut and paste large amounts of text whilst constructing a new document compared to re-keying previously typed/printed work.
- The ability to move established text passages into contemporary formats rather than merely create searchable PDFs of “old” documents for internet display etc.
- The potential to copy and paste table formatted materials directly into a format like excel and then to a database.
- Once edited, documents can be saved as PDF creating a new “born digital” document ready for web publication – with a tiny file size compared to the image file.
- The project advocate or the University must own copyright or have properly documented permission to copy or reformat this document.
- The University Digitisation Centre’s present equipment is designed to capture loose documents – so unbound material is preferred (there is an established process for guillotining bound works like theses).
- Maximum size of A3;
- Double sided documents are catered for;
- To maximise OCR, all documents will be scanned in Black & White.
- The ability to offer this service is dependent upon resources being able to keep up with demand. Priority will always be given to the resourcing of in-house projects.
Some materials have already been rendered into electronic form and samples follow. These examples illustrate the scanning system's capability to produce a Word document which represents the original text and formatting with a good degree of accuracy.
- Test cases
1 – PhD Thesis
This Thesis was produced in the UK in 1996 using a word-processing program. The author found himself in Australia with only his bound copy of the thesis – the electronic version having long succumbed to format creep. The University Digitisation Centre thanks the author for his kind permission to reproduce these pages.
Two pages are featured – one from the main body of the text which demonstrates the rendering of several aspects of formatting common in academic writing (various line spacing, indentation, various point sizes, italics and footnotes) and one page which shows the rendering of a bibliography. Both the PDF image file and the Word file of these sample pages are shown so that you can assess the accuracy of the rendering for yourself.
Please note that the Word document is ‘as is’ from the scanner – no editing has been done. There is one OCR error on this page – which Word's spell-checker will find (divmitory). The footnote is not ‘linked’- one whole paragraph has come through in italics, rather than one word, and one full stop has come through as a comma - so the requirement for some review and editing is obvious.
2 – Archive box list
Detailed lists of the contents of archive boxes are produced as part of the control documentation when materials are transferred for intermediate or permanent storage. If the list exists only in analogue form, then that information cannot be easily transferred into databases to facilitate searching. One such box list has been scanned and because the information is largely in table form, the text was copied and pasted from Word into Excel.
Please note that the Word document is “as is” from the scanner – no editing has been done. There are OCR errors in the headers of these pages due to shading on the original document. Such shading can be minimized even more than has occurred here, so the OCR could potentially be improved in cases such as this. Nevertheless, there would be little effort involved in editing the Word version to correct the OCR once in the table headline, delete the other occurrences and then transpose to Excel. This editing would be the work of moments. Please note also, that the last page of the document was not properly formatted when typed.