Efficient OCR

Modular, customizable, and scalable OCR

Efficient OCR

A highly accurate, easy to install, fast to train, and CPU-optimized OCR solution.

Easily Customizable OCR for the Social Sciences

EffOCR (EfficientOCR) is designed for researchers and archives seeking a sample-efficient, customizable, scalable OCR solution for diverse documents.

Vast document collections remain trapped in hard copy or lack accurately digitized texts. Using optical character recognition (OCR) to digitize public domain collections on a large scale entails several challenges:

  1. Accuracy. Digitized texts need to be sufficiently accurate for end users’ objectives, which are highly diverse. Accuracy can be particularly central for quantitative applications, for which small errors can create major statistical outliers. Models for lower resource languages, if they exist, tend to perform much worse than models for high resource settings like English.
  2. Cost. The OCR solution must be cheap to deploy, given document collections whose size numbers in the millions or even billions of pages. Commercial engines - as well as large open-source OCR models - fall well short of this requirement.

To meet these objectives, we developed EffOCR, an open-source OCR package designed for researchers, libraries, and archives seeking a computationally and sample efficient OCR solution for digitizing diverse document collections. EffOCR has two key ingredients: 1) a novel OCR architecture and 2) a carefully designed interface to facilitate off-the-shelf OCR usage, customization via model training if necessary, and easy sharing of OCR models.

Efficient OCR Architecture

Modern OCR overwhelmingly uses deep neural networks - either a convolutional neural network (CNN) or vision transformer (ViT) - to encode images. The representations created by passing an input image through a neural encoder are then decoded to the associated text. This approach is often referred to as “Sequence-to-Sequence (Seq2Seq) OCR.”

In contrast, EffOCR brings OCR back to its roots: the optical recognition of characters. Deep learning-based object detection methods are used to localize individual characters in the document image. Character recognition is modeled as an image retrieval problem, using a vision encoder contrastively trained on character crops.

This figure compares the two approaches: Efficient OCR Architecture

For a more complete description of EffOCR’s architecture, see our paper

Try a pretrained model

While Efficient OCR is primarily designed to be customized for your dataset, we provide pretrained models for experimentation. You can try out English inference with the following code snippet:

from efficient_ocr import EffOCR

# English
effocr = EffOCR(
  config={
      'Recognizer': {
          'char': {
              'model_backend': 'onnx',
              'model_dir': './models',
              'hf_repo_id': 'dell-research-harvard/effocr_en/char_recognizer',
          },
          'word': {
              'model_backend': 'onnx',
              'model_dir': './models',
              'hf_repo_id': 'dell-research-harvard/effocr_en/word_recognizer',
          },
      },
      'Localizer': {
          'model_dir': './models',
          'hf_repo_id': 'dell-research-harvard/effocr_en',
          'model_backend': 'onnx'
      },
      'Line': {
          'model_dir': './models',
          'hf_repo_id': 'dell-research-harvard/effocr_en',
          'model_backend': 'onnx',
      },
  }
)

results = effocr.infer(<path_to_image>.jpg)

Sample Efficient Learning for Your Dataset

EffOCR is designed to be easily customized for your dataset. Shown only a handful of examples, it quickly adapts to new character sets, languages, and document types. In the chart below, note EffOCR’s high accuracy in both low- and high-resource settings:

Efficient OCR Architecture

EffOCR training can be performed either in an end-to-end fashion, as demonstrated in this notebook or in a modular fashion, as demonstrated in this notebook.

Models in EffOCR are very lightweight, so training can nearly always be accomplished within a Colab notebook.

Getting Started with EffOCR

Efficient OCR-- A Sample-Efficient OCR Pipeline for Digitizing Diverse Document Collections

Efficient OCR: A Sample-Efficient OCR Pipeline for Digitizing Diverse Document Collections

Efficient OCR on GitHub

Efficient OCR on GitHub

Get started!

Start with the Demo Notebook (opens in Colab) for a quick intro to EffOCR