When you pipe LLM-generated Arabic into a TTS engine, it breaks. Words get mispronounced because diacritics — the marks that distinguish "عَلَّمَ" (taught) from "عَلِمَ" (knew) — are missing 60-70% of the time. Numbers like "2030" are read as individual digits instead of "ألفان و ثلاثون." There was no LangChain integration that handled this. So I built one.
The Architecture
langchain-arabic is a BaseTransformOutputParser that slots into any LangChain chain. It runs two passes on LLM output:
Pass 1 — Diacritics restoration. A deterministic dictionary backend applies longest-match-first replacement. "علم الحاسوب" (8 chars) is matched before "علم" (3 chars) to prevent partial overwrites. Optionally, a CATT neural backend (encoder-only transformer) handles words not in the dictionary.
Pass 2 — Number conversion. Regex detects five contexts — percentages, Arabic currency, English currency, phone numbers, and plain integers — and converts each using num2words with Arabic or English output.
The Benchmarks
I ran DER (Diacritization Error Rate) and WER (Word Error Rate) on an 8-sentence test corpus:
| Mode | DER | WER | Latency |
|---|---|---|---|
| Dictionary only | ~70% | ~75% | <1ms |
| CATT encoder-only | ~5-10% | ~15-25% | ~2-5s |
| CATT encoder-decoder | ~3-8% | ~10-20% | ~5-10s |
| Hybrid (CATT + dict overrides) | ~4-8% | ~12-22% | ~2-5s |
Dictionary-only DER is high because the test corpus intentionally includes words outside the 4-word test dictionary. In production, your dictionary covers your domain vocabulary — the error rate drops proportionally. The hybrid mode is the sweet spot: CATT handles general text, dictionary overrides guarantee correctness on domain-critical terms like client names or product labels.
What This Means
The package is on PyPI (pip install langchain-arabic) and the LangChain docs PR is in review. It supports MSA and Gulf Arabic dialects, streaming, and async. Zero for Arabic post-processing existed in the LangChain ecosystem before this. Now there's one.