Intelligent Document Processing, which is a general term for end-to-end digitalization of document-centric processes, comprises three main components:
Optical Character Recognition, which generates machine-readable text from images
Data Extraction, which turns unstructured data (OCR-generated or extracted from a PDF) into structured key-value pairs
Process Automation, which automates or facilitates validation and system entry of structured data
In this article we will focus on component #2, data extraction.
There are, generally speaking, two approaches to document data extraction: template-centric and AI-centric.
Template-centric data extraction allows the machine to be instructed to isolate certain sections of text based on their position and proximity to certain anchor keywords. The operator has to create a template for every group of similarly-structured documents (e.g. invoices from the same supplier).
High reliability of data extraction for static documents
Relatively low computational intensity
Considerable effort is required to set up the template library
Template library requires active management to remain up-to-date
Changes in document layouts lead to false positive results
Template-centric data extraction was the first practical approach to process digitized documents at scale. Historically, it has enabled intelligent document processing and still remains widely used in enterprise applications.
AI-centric data extraction is a more recent approach, focused on using machine learning techniques to leverage data relationships within a document. Neural networks and deep learning algorithms are most commonly used for this purpose, although other algorithms such as random forests or SVMs can be used with good results, too.
AI-centric approaches evaluate various features for each data token: data type, text size, text color, position, aligned tokens etc. to match their values with relevant labels. Key-value pairs with the highest confidence levels are returned by the algorithm.
No template set-up or maintenance required
Trained model can be scaled across multiple users, further multiplying its learning potential
Requires a sizable initial training dataset
Initial model training is computationally intensive
Requires substantial machine learning competencies
AI-centric data extraction is rapidly gaining popularity due to its versatility and scalability, with multiple providers recently emerging in the market to satisfy the growing demand. Commercial providers are adding value along several dimensions:
Implementing machine learning algorithms
Providing pre-trained models for selected applications
Providing cloud infrastructure for training and operating the models
Using a specialized provider effectively alleviates the drawbacks of AI-centric document extraction, although of course it comes at a price. However, even today the cost of an AI-centric solution is competitive compared to a template-centric one.
Major cloud providers (Google, Amazon, Microsoft) are all offering AI-centric data extraction models, both general-purpose and specialized for use cases such as invoice, receipt, ID or drivers license data extraction. Other providers are building on top of their models, while others offer proprietary solutions.
The provider landscape for data extraction solutions is very heterogenous, comprising independent focused vendors (e.g. Nanonets, Taggun, Mindee), cloud computing players (e.g. Google, Amazon, Microsoft), full-stack IDP providers (e.g. Kofax, Rossum, ABBYY), RPA vendors (e.g. UiPath, Automation Anywhere) and automation solution providers (e.g. F-ONE, Nividous). We believe that the customers are well-advised to choose the AI-centric approach to data extraction, and to consider their specific solution requirements in vendor selection.