Business document scanning

Business document scanning is a generic term used for scanning large numbers of single sheets of paper. While the types of documents scanned are often business documents (e.g. invoices, contracts, minutes etc...) it can include many types of documents where the sheets of paper are all separate. The main aim of this process is to produce a legible, digital version of a document in the smallest possible file size.

The minimum resolution required for business document scanning is generally accepted as 200dpi although higher resolutions can be used without significant increases in file size.

Scanning of business records should follow the PROS standards as a minimum:

PROS 19/05 Digitisation: Image Requirement

File formats

PDF is the most commonly used file format as it combines highly compressed images with OCR text in a widely accessible format.

While most scanning software can output multi-page TIFF images, this is generally not advisable as this format is not widely supported for viewing.

Scanned documents intended for use as permanent electronic records are required to use a format compliant with PROV's recordkeeping standards.

Business document scanning software can automatically remove "blank" pages if a certain percentage of the page is white.

Some blank pages may not be detected for a number of reasons:

Ink bleeding through paper. e.g. stamps or marker pens
Dirty pages
Thin paper resulting in reversed text appearing on the image of the back of the page

These pages can be removed manually before the final documents are created from the scanned images. As this involves extra work for the people doing the scanning you should check whether this is being done if it is required, and if not, what additional costs may be involved. Removing blank pages after the final documents have been produced is a very manual, time-consuming process.

Optical character recognition (OCR) can be used to provide the actual text that appears in a document. This is typically provided as a hidden text layer in the PDF document or as a separate text file. OCR software can also attempt to replicate the formatting of the original document including font types and styles, and page formatting.

The accuracy of OCR for well printed documents can be in the range of 99-100% but will decrease as print quality decreases. Correction of OCR results is usually not done as this is a very time consuming (and thus expensive) process.

Forms and tables can also be recognised, with the possibility of creating excel spread sheets from data, or extracting the data from forms for entry into a database. The design of the form can play a major role in the accuracy and reliability of the extracted data and should be taken into consideration as a key part of the form design process.

As clever as scanning software can be, it still works on a fixed set of parameters which can, at times, produce undesirable results. Additional manual checking is still required to verify that everything has worked as expected.

Typical quality assurance (QA) steps include checking that:

all pages are legible, especially if areas that have shaded backgrounds have text that is required to be read. Problem pages are rescanned with custom settings.
documents have been correctly separated (the number of documents produced matches the number expected)
all "blank" pages have been removed, manually deleting any pages that have been missed

External service providers may charge separately for some of these additional steps as they require additional labour. If using service providers you should check what QA steps are included by default.