2022–24

DocAI

SOTA Arabic document understanding — pre-trained from scratch when nothing else worked.

NeuralSpace · Arabic document intelligence

FIG. 06

Overview

Customers didn't want text pulled off a page. They wanted the document understood — in Arabic, where almost nothing worked well.

Getting there meant climbing through the entire OCR landscape, hitting its ceiling, and finally pre-training a model from scratch.

01 — The ladder

Every OCR approach, until one stuck

We started with Tesseract — synthetic data, a full train-deploy-test pipeline. Good results, but it broke on Arabic diacritics and skewed scans. DocTR moved the needle on nothing. Transformer-based TrOCR finally fixed diacritics — but by then every inbound use case wanted more than characters off a page. They wanted the document understood.

02 — The benchmark

A gold set of the hardest documents

So we built our own benchmark — a gold set of the weirdest, most complex documents customers actually sent. Then we evaluated the field, including the 2024 wave of vision-language models. None held up on Arabic. Pix2Struct showed the most promise; finetuned on Arabic it was okay — and okay wasn't the product.

Donut
LayoutLMv3
UDOP
Nougat
Qwen2-VL
Idefics2
Pix2Struct

03 — From scratch

So we pre-trained our own

We built the engine ourselves. A web scraper with a random outline drawer collected 2M+ screenshots. On AWS SageMaker we pre-trained an Arabic Pix2Struct from scratch. A second engine synthesised a VQA dataset — simulating documents, augmenting noise — and we finetuned on the downstream task.

04 — Results

State of the art, and a clear trade

Zero diacritic errors. Real document understanding and Arabic VQA. OCR CER at 2% — state of the art. The one cost was throughput: a bulky model runs slower — but the market told us plainly that for documents like these, customers will pay more for accuracy.

In production

Ministry of Human Resources — Saudi Arabia
Dubai RTA
EDC Dubai

Active users for their most complex Arabic document use cases.

OCR was the easy part. Understanding Arabic was the product.

Role

Core developer and customer-facing lead — drove the model research and the from-scratch pre-training, built the data engines (scraper and VQA synthesis), and ran it on SageMaker to a SOTA Arabic document-understanding model in production.

Work with me