← Back to Log
AIMar 20262 min read

Why I Built langchain-arabic

When you pipe LLM-generated Arabic into a TTS engine, it breaks. Words get mispronounced because diacritics — the marks that distinguish "عَلَّمَ" (taught) from "عَلِمَ" (knew) — are missing 60-70% of the time. Numbers like "2030" are read as individual digits instead of "ألفان و ثلاثون." There was no LangChain integration that handled this. So I built one.

The Architecture

langchain-arabic is a BaseTransformOutputParser that slots into any LangChain chain. It runs two passes on LLM output:

Pass 1 — Diacritics restoration. A deterministic dictionary backend applies longest-match-first replacement. "علم الحاسوب" (8 chars) is matched before "علم" (3 chars) to prevent partial overwrites. Optionally, a CATT neural backend (encoder-only transformer) handles words not in the dictionary.

Pass 2 — Number conversion. Regex detects five contexts — percentages, Arabic currency, English currency, phone numbers, and plain integers — and converts each using num2words with Arabic or English output.

The Benchmarks

I ran DER (Diacritization Error Rate) and WER (Word Error Rate) on an 8-sentence test corpus:

ModeDERWERLatency
Dictionary only~70%~75%<1ms
CATT encoder-only~5-10%~15-25%~2-5s
CATT encoder-decoder~3-8%~10-20%~5-10s
Hybrid (CATT + dict overrides)~4-8%~12-22%~2-5s

Dictionary-only DER is high because the test corpus intentionally includes words outside the 4-word test dictionary. In production, your dictionary covers your domain vocabulary — the error rate drops proportionally. The hybrid mode is the sweet spot: CATT handles general text, dictionary overrides guarantee correctness on domain-critical terms like client names or product labels.

What This Means

The package is on PyPI (pip install langchain-arabic) and the LangChain docs PR is in review. It supports MSA and Gulf Arabic dialects, streaming, and async. Zero for Arabic post-processing existed in the LangChain ecosystem before this. Now there's one.