21.09.2020 г.

Today we’d like to look into the problem of post-processing the text field recognition results based on a priori knowledge about the text field. A significant part of important documents, including identification documents, consists of the fields of a different kind, such as dates, numbers, car VIN-numbers, TIN and social security numbers, machine-readable fields with their checksums, etc. Although we can’t classify these as the natural language fields, oftentimes there is still some implicit linguistic model that can be applied to these fields, which means some correction algorithms will still be applicable here. In this publication, we’ll focus on two methods of recognition result post-processing that can be used for a wide variety of documents and a number of different text fields.

The language model of a text field can be divided into three components:

- Syntax: rules that regulate the structure of a text field representation. For example, the “expiration date” field is presented as seven digits “DDDDDDD” in a machine-readable zone.
- Field semantics: rules based on the subject matter interpretation of a field and its components. For example, the first two digits of the “expiration date” field in the machine-readable zone are the last two digits of the year, the third and the fourth digits are the month, the fifth and the sixth digits are the date, and the seventh digit is the checksum.
- Relationship semantics: rules based on structural and semantic relationships of the field with other text fields of a document. For example, “expiration date” field must not encode the date earlier than the date in the “date of birth” field.

With all the information about the semantic and syntactic structure of a document and a recognized text field, we can build a specialized algorithm for recognition result post-processing. However, taking into consideration the need for maintaining and advancing the recognition systems and the complexity of their development, it’d make sense to consider “universal” methods and tools that allow building a decent post-processing algorithm that could work for a wide variety of documents and text fields, and that could be built with minimal effort (from the engineers). The configuration and support method for such an algorithm would be made uniform, the only adjustable component of its structure would be the language model.

We must note that post-processing of text field recognition results is not always useful, on the contrary, it could even be damaging: if you have a good enough text field recognition module and you work with identification documents in critical systems, then you’d want to minimize post-processing and to present the results “as is”, or carefully monitor any changes and let the client know if there are any. But in many cases when we know that the document is valid and the language model of the text field is reliable, a post-processing algorithm allows for a considerable improvement of the final recognition quality.

In our article, we’ll cover two post-processing methods claiming to be universal. The first method is based on weighted finite-state transducers and requires the linguistic model to be presented as a state machine. It’s not very easy to use, but it’s quite versatile. The second method is easier to use, more effective and only requires programming the function that checks the text field validity, but it also has a number of flaws.

**The method based on weighted finite-state transducers**

A beautiful and rather general model that allows us to build a universal post-processing algorithm will be described in our work. This model relies on the weighted finite-state transducers (WFST).

WFST are a generalization of weighted finite-state automata – if the latter encodes some weighted language (i.e weighted set of strings over some alphabet ), then WFST encodes weighted mapping from a language over an alphabet into a language over an alphabet . When a weighted state machine takes a string over an alphabet , it assigns some weight w to this string accordingly, while WFST takes a string over an alphabet and mathes it with a set of pairs (possibly an infinite set), where is a string over an alphabet produced as a result of the string conversion, and is a weight of this conversion. Each transition of a converter is marked by two states (the transition is realized between them), an input character (from an alphabet ), an output character (from an alphabet ) and the transition weight. An empty character (or an empty string) is considered to be an element of both alphabets. The weight of a string conversion into a string would be the sum of the products of the transition weights down the line where the concatenation of the input characters forms the string , and the concatenation of the output characters forms the string , and these strings transform the converter from its initial to one of its terminal states.

strings transform the converter from its initial to one of its terminal states.

For WFSTs a composition operation is defined, on which the post-processing method is based. Suppose we are given two transducers and , where converts a string over into a string over with a weight , and converts over into a string over with a weight . Then the converter referred to as a composition of transducers transforms the string into the string with the weight . WFST composition is a computationally expensive operation, but there is a “lazy” way to do it – the positions and transitions of the resulting transducer can be built right when they need to be accessed.

A WFST based algorithm for post-processing the recognition results relies on three main sources of data – a hypothesis model , an error model and a language model . All three models are represented as weighted finite state transducers.

- The hypothesis model is a weighted finite-state machine (a special case of WSFT when each string is converted to itself and the transition weight matches the string weight), that accepts the hypotheses proposed by the field recognition system. For example, let’s consider that the text field recognition result is a sequence of cells , and each cell matches some character recognition result and is a set of alternatives and their estimations (weights) : , where is a j-th character of the alphabet, is an estimation (weight) of this character. This means that the hypothesis model can be presented as a finite state transducer with states in a chain-like order: the -th state is the initial state, the -th state is the terminal state, the transitions happen between the -th and -th states and correspond to the cells . At the -th transition the input and output character would be , and the transition weight will match the numerical estimation of this alternative. The figure below demonstrates an example of a hypothesis model for a three-character string recognition result.

Figure 1. Example of a hypothesis model presented as a WFST (the unage from this article). Transitions with zero-probability are not shown.

- The error model is a weighted finite-state transducer that encodes various hypothesis transformations which could be made in order to align them with the language model. In the simplest case it can be represented as WFST with a single state (which will be both an initial and a terminal state). Each transition encodes some character transformation with a respective weight. For example, the character substitution weight could be the probability estimation of a respective error of a character recognition algorithm.
- The language model is a representation of the recognized field syntax and semantics as a weighted finite-state machine. It means that a language model is a state machine that takes all the valid values of a recognized text field (and assigns some weight to each valid value).

After we define the hypothesis model, the error model and the language model, we can define the objective of recognition result post-processing as follows: let’s look at the composition for all three models (in terms of the WFST composition). A transducer encodes various transformations of a string from the hypothesis model into a string from the language model with the help of the conversions encoded in the error model . The weight of such a conversion consists of the initial hypothesis weight, the conversion weight and the resulting string weight (when we have a weighted language model). The optimal recognition result post-processing in such a model would be the optimal path (in terms of weight) in the transducer that takes it from the initial state to one of the terminal states. The input string on this path is equal to the selected initial hypothesis, and the output string – to the corrected recognition result. The optimal path can be found using shortest path algorithms in directed graphs.

The advantage of this approach is its generality and versatility. For example, the error model can be easily expanded so that it becomes possible to consider the character deletions and additions (in order to do that we just need to respectively add transitions with an empty input or output character to the error model). However, this model has some significant flaws as well. First of all, the language model has to be presented as a weighted finite-state transducer here. For complex languages, this finite-state machine can be rather bulky, and it will have to be rebuilt if the language model is changed or refined in any way. We also need to point out that the composition of these three transducers has an even bulkier transducer as a result, and this composition is calculated every time when we start post-processing any recognition result. Due to the cumbersome nature of our composition, in practice, the optimal path search is performed using heuristic approaches, such as an A*-search algorithm.

**Approach based on validation grammars**

Using the validation grammar model we will be able to build a simpler model for the post-processing problem that will not be as general as the WFST-based model but will be easy to expand and implement.

Let’s denote the validation grammar as a pair where is an alphabet and is a predicate that accepts a string over the alphabet , i.e. . The validation grammar encodes some language over the alphabet as follows: the string is a part of the language if the predicate takes true value given this string. We should note that a validation grammar is a more general way to represent the language model than a finite-state automata. In fact, any language presented as a finite-state automaton , can be presented as a validation grammar (with a predicate , where is a set of strings accepted by the automaton ). But not all the languages that can be presented as a validation grammar can be presented as a finite state automaton generally (for example, the language with words of an unrestricted length over the alphabet , where the number of characters is higher than the number of characters ).

Let’s consider a recognition result (a hypothesis model) as a sequence of cells (the same as earlier). For convenience, we assume that each cell contains K alternatives and estimation values of these alternatives are positive. A string evaluation (weight) will be a product of estimations of all the characters in this string. If the language model presented as a validation grammar , then the post-processing problem can be defined as a discrete optimization (maximization) problem for a set of controls (a set of all the strings of a length over an alphabet ) with a predicate and a functional , where is an estimation of the character in the -th unit.

Any discrete optimization problem (i.e. with a finite number of controls) can be solved by a brute-force search. This algorithm goes through the controls (strings) in the descending order of a functional value until the predicate takes a true value. Let’s denote a maximum number of iterations of the algorithm as , i.e. a maximum number of strings with the highest estimates that will be used to calculate the predicate.

First, let’s sort the alternatives in the descending order of their estimates, and we’ll be assuming that the inequality is true for any cell when . A *position* is defined as the sequence of indices that corresponds with a string . A position estimate, i.e. a functional value at this position, will be considered a product of the alternative estimates that match the indexes in this position: . In order to save the positions, we are going to need the data structure *PositionBase* which allows us to add new positions (when their address is available), to get new positions using their address and check if any given position is added to the database.

In the process of listing the positions, first, we’ll be selecting a yet unprocessed position with the highest estimate, then we’ll be adding all the positions (the positions that can be generated out of the current one by adding one to one of the indices that is included in the position) to the *PositionQueue*. The *PositionQueue* will include the triplets , where is an estimate of an unseen position, – an address of a processed position in *PositionBase* from which the current position was derived, is an index of a position element with an address that was increased by one in order to get the current position. In order to organize *PositionQueue* we’ll need a data structure that will allow us to add a new triplet and extract a triplet with the highest estimate .

At the first algorithm iteration we are going to have to review the position that has the maximum estimation. If a predicate is of its true value in a string that matches this position, then the algorithm will terminate. If it’s not, then the position is added to *PositionBase*, and all the triplets are added to *PositionQueue*, for all is an address of an initial position in *PositionBase*. At each consequent algorithm iteration the triplet with the maximum estimation value is extracted from *PositionQueue*. And then the position S is restored according to an initial position address and an index . If the position S is already added to *PositionBase*, it’s skipped, and the next triplet with the maximum estimation value is extracted from *PositionQueue*. Otherwise, we check the value of a predicate in the string that corresponds with a position . If the predicate is of true value for this string, then the algorithm terminates. If the predicate is not of true value for this string, then the string is added to *PositionBase* (with the assignment of an address ), all the derivative positions are added to *PositionQueue* and the process moves to the next iteration.

Let’s point out that after iterations the predicate verification will be done not more than M times, there will be no more than additions to *PositionBase*, and the addition to *PositionQueue*, the extraction from *PositionQueue*, and the search in *PositionBase* will happen not more than times. If in order to implement *PositionQueue* we use the “heap” data structure and in order to organize *PositionBase* we use the “trie” data structure, then the computational complexity of the produced algorithm amounts to , where is the complexity of checking the predicate for the string of length .

The biggest flaw of the algorithm based on validation grammars might be the fact that it’s not able to process the hypotheses of different lengths (which can occur due to segmentation errors: loss or appearance of extra characters, stitching and cutting the characters, etc.), while the hypotheses changes like “character deletion”, “character addition”, or even “character replacement with a pair” can be encoded easily in the error model of the algorithm based on WFST.

Speed and simplicity of use (when working with a new type of text field we just need to give it an access to the validation function) make the method based on validation grammar an extremely powerful tool in the arsenal of the recognition system developers. We utilize this technique for a wide variety of text fields, such as different dates, numbers of bank cards (the predicate is the Luhn code check), fields of machine-readable zones in documents with checksums, and many others.

Green AI-powered scanner SDK of ID cards, passports, driver’s licenses, residence permits, visas, and other ids, more than 1810+ types in total. Provides eco-friendly, fast and precise scanning SDK for a smartphone, web, desktop or server, works fully autonomously. Extracts data from photos and scans, as well as in the video stream from a smartphone or web camera, is robust to capturing conditions. No data transfer — ID scanning is performed on-device and on-premise.

Automatic scanning of machine-readable zones (MRZ); all types of credit cards: embossed, indent-printed, and flat-printed; barcodes: PDF417, QR code, AZTEC, DataMatrix, and others on the fly by a smartphone’s camera. Provides high-quality MRZ, barcode, and credit card scanning in mobile applications on-device regardless of lighting conditions. Supports card scanning of 21 payment systems.

Automatic data extraction from business and legal documents: KYC/AML questionnaires, applications, tests, etc, administrative papers (accounting documents, corporate reports, business forms, and government forms - financial statements, insurance policies, etc). High-quality Green AI-powered OCR on scans and photographs taken in real conditions. Total security: only on-premise installation. Automatically scans document data in 2 seconds on a modern smartphone.

Green AI for Tomographic reconstruction and visualization. Algorithmization of the image reconstruction process directly during the X-ray tomographic scanning process. We aspire to reduce the radiation dose received during the exposure by finding the optimal termination point of scanning.

**Send Request**

Please fill out the form to get more information about the products,

pricing and trial SDK for Android, iOS, Linux, Windows.