Template-centric vs AI-centric Data Extraction

Intelligent Document Processing, which is a general term for end-to-end digitalization of document-centric processes, comprises three main components:

  1. Optical Character Recognition, which generates machine-readable text from images

  2. Data Extraction, which turns unstructured data (OCR-generated or extracted from a PDF) into structured key-value pairs

  3. Process Automation, which automates or facilitates validation and system entry of structured data

In this article we will focus on component #2, data extraction.


There are, generally speaking, two approaches to document data extraction: template-centric and AI-centric.


Template-centric data extraction allows the machine to be instructed to isolate certain sections of text based on their position and proximity to certain anchor keywords. The operator has to create a template for every group of similarly-structured documents (e.g. invoices from the same supplier).


Pros:

  • High reliability of data extraction for static documents

  • Relatively low computational intensity

Cons:

  • Considerable effort is required to set up the template library

  • Template library requires active management to remain up-to-date

  • Changes in document layouts lead to false positive results

Template-centric data extraction was the first practical approach to process digitized documents at scale. Historically, it has enabled intelligent document processing and still remains widely used in enterprise applications.


AI-centric data extraction is a more recent approach, focused on using machine learning techniques to leverage data relationships within a document. Neural networks and deep learning algorithms are most commonly used for this purpose, although other algorithms such as random forests or SVMs can be used with good results, too.


AI-centric approaches evaluate various features for each data token: data type, text size, text color, position, aligned tokens etc. to match their values with relevant labels. Key-value pairs with the highest confidence levels are returned by the algorithm.


Pros:

  • No template set-up or maintenance required

  • Trained model can be scaled across multiple users, further multiplying its learning potential

Cons:

  • Requires a sizable initial training dataset

  • Initial model training is computationally intensive

  • Requires substantial machine learning competencies

AI-centric data extraction is rapidly gaining popularity due to its versatility and scalability, with multiple providers recently emerging in the market to satisfy the growing demand. Commercial providers are adding value along several dimensions:

  • Implementing machine learning algorithms

  • Providing pre-trained models for selected applications

  • Providing cloud infrastructure for training and operating the models

Using a specialized provider effectively alleviates the drawbacks of AI-centric document extraction, although of course it comes at a price. However, even today the cost of an AI-centric solution is competitive compared to a template-centric one.


Major cloud providers (Google, Amazon, Microsoft) are all offering AI-centric data extraction models, both general-purpose and specialized for use cases such as invoice, receipt, ID or drivers license data extraction. Other providers are building on top of their models, while others offer proprietary solutions.

Data values automatically extracted from a services invoice by Google Procurement Document AI.

The provider landscape for data extraction solutions is very heterogenous, comprising independent focused vendors (e.g. Nanonets, Taggun, Mindee), cloud computing players (e.g. Google, Amazon, Microsoft), full-stack IDP providers (e.g. Kofax, Rossum, ABBYY), RPA vendors (e.g. UiPath, Automation Anywhere) and automation solution providers (e.g. F-ONE, Nividous). We believe that the customers are well-advised to choose the AI-centric approach to data extraction, and to consider their specific solution requirements in vendor selection.


0 Ansichten