Business document scanning

 

Business document scanning is a generic term used for scanning large numbers of single sheets of paper. While the types of documents scanned are often business documents (e.g. invoices, contracts, minutes etc...) it can include many types of documents where the sheets of paper are all separate. The main aim of this process is to produce a legible, digital version of a document in the smallest possible file size.

Standards

The minimum resolution required for business document scanning is generally accepted as 200dpi although higher resolutions can be used without significant increases in file size.

Scanning of business records should follow the PROS standards as a minimum:

PROS 19/05 Digitisation: Image Requirement

File formats

PDF is the most commonly used file format as it combines highly compressed images with OCR text in a widely accessible format.

While most scanning software can output multi-page TIFF images, this is generally not advisable as this format is not widely supported for viewing.

Scanned documents intended for use as permanent electronic records are required to use a format compliant with PROV's recordkeeping standards.

Document splitting

Business document scanners typically scan large numbers of pages at a time, usually encompassing several documents. Individual pages are re-grouped into their corresponding documents by the use of separators.

These are usually in the form of:

  • a separator sheet containing a specific barcode pattern around the page inserted between documents
  • a blank piece of paper inserted between documents
  • a barcode sticker attached to the first page of a document
  • or the presence of a specific string of text that only appears on the first page of each document (detected by OCR)

Reduced quality of the separator can result in a failure to correctly separate documents.

Typical problems include:

  • Poor quality photocopies of separator sheets
  • Printers running low on toner while printing separator sheets or barcode labels
  • The use of recycled photocopies as blank pages

File naming

Files are typically named with a sequential number to keep them in order.  File names can be specified but increasing the complexity of file naming may increase the amount of quality assurance work required (and potentially the cost), especially if the file naming has to be done manually after scanning has been completed.

Matching specific names to specific documents may require additional programming to be done, but this may be practical for larger jobs. Contact your service provider for advice when you require a specific naming structure prior to commencing the project.

Document preparation

Preparing records for scanning (corporate, business and research)
Almost any document being readied for scanning will need to be prepared in some way.

Metal objects such as staples and pins can scratch the glass of the scanner causing permanent streaks in subsequent scans which may require the replacement of the glass.

For more information please read the document preparation guide.

Blank page removal

Business document scanning software can automatically remove "blank" pages if a certain percentage of the page is white.

Some blank pages may not be detected for a number of reasons:

  • Ink bleeding through paper. e.g. stamps or marker pens
  • Dirty pages
  • Thin paper resulting in reversed text appearing on the image of the back of the page

These pages can be removed manually before the final documents are created from the scanned images. As this involves extra work for the people doing the scanning you should check whether this is being done if it is required, and if not, what additional costs may be involved.  Removing blank pages after the final documents have been produced is a very manual, time-consuming process.

OCR and data extraction

Optical character recognition (OCR) can be used to provide the actual text that appears in a document. This is typically provided as a hidden text layer in the PDF document or as a separate text file.  OCR software can also attempt to replicate the formatting of the original document including font types and styles, and page formatting.

The accuracy of OCR for well printed documents can be in the range of 99-100% but will decrease as print quality decreases.  Correction of OCR results is usually not done as this is a very time consuming (and thus expensive) process.

Forms and tables can also be recognised, with the possibility of creating excel spread sheets from data, or extracting the data from forms for entry into a database. The design of the form can play a major role in the accuracy and reliability of the extracted data and should be taken into consideration as a key part of the form design process.

Quality Control

As clever as scanning software can be, it still works on a fixed set of parameters which can, at times, produce undesirable results. Additional manual checking is still required to verify that everything has worked as expected.

Typical quality assurance (QA) steps include checking that:

  • all pages are legible, especially if areas that have shaded backgrounds have text that is required to be read. Problem pages are rescanned with custom settings.
  • documents have been correctly separated (the number of documents produced matches the number expected)
  • all "blank" pages have been removed, manually deleting any pages that have been missed

External service providers may charge separately for some of these additional steps as they require additional labour. If using service providers you should check what QA steps are included by default.

Electronic document and records management systems (EDRMS)

Business document scanning can be integrated with an EDRMS (e.g. TRIM or SharePoint). These functions require additional development work for both the scanning system and the EDRMS, but this can realise significant efficiencies in certain applications.