All work
2022–24

DocAI

SOTA Arabic document understanding — pre-trained from scratch when nothing else worked.

NeuralSpace · Arabic document intelligence

FIG. 06

Overview

Customers didn't want text pulled off a page. They wanted the document understood — in Arabic, where almost nothing worked well.

Getting there meant climbing through the entire OCR landscape, hitting its ceiling, and finally pre-training a model from scratch.

01 — The ladder

Every OCR approach, until one stuck

We started with Tesseract — synthetic data, a full train-deploy-test pipeline. Good results, but it broke on Arabic diacritics and skewed scans. DocTR moved the needle on nothing. Transformer-based TrOCR finally fixed diacritics — but by then every inbound use case wanted more than characters off a page. They wanted the document understood.

02 — The benchmark

A gold set of the hardest documents

So we built our own benchmark — a gold set of the weirdest, most complex documents customers actually sent. Then we evaluated the field, including the 2024 wave of vision-language models. None held up on Arabic. Pix2Struct showed the most promise; finetuned on Arabic it was okay — and okay wasn't the product.

  • Donut
  • LayoutLMv3
  • UDOP
  • Nougat
  • Qwen2-VL
  • Idefics2
  • Pix2Struct

03 — From scratch

So we pre-trained our own

We built the engine ourselves. A web scraper with a random outline drawer collected 2M+ screenshots. On AWS SageMaker we pre-trained an Arabic Pix2Struct from scratch. A second engine synthesised a VQA dataset — simulating documents, augmenting noise — and we finetuned on the downstream task.

04 — Results

State of the art, and a clear trade

Zero diacritic errors. Real document understanding and Arabic VQA. OCR CER at 2% — state of the art. The one cost was throughput: a bulky model runs slower — but the market told us plainly that for documents like these, customers will pay more for accuracy.

In production

  • Ministry of Human Resources — Saudi Arabia
  • Dubai RTA
  • EDC Dubai

Active users for their most complex Arabic document use cases.

OCR was the easy part. Understanding Arabic was the product.

Role

Core developer and customer-facing lead — drove the model research and the from-scratch pre-training, built the data engines (scraper and VQA synthesis), and ran it on SageMaker to a SOTA Arabic document-understanding model in production.