What Are the Iranian Languages and Why They Matter for the Future of AI
- Karine Megerdoomian

- Oct 22
- 4 min read
The Iranian or Iranic languages stretch across a vast geography and an even vaster history. They appear in the poetry of Hafez, the epics of Ferdowsi, the stories of Kurdish dengbêj singers, the conversations of Tajik bazaars, the melodies of Luri lullabies, and the modern newsfeeds of millions. Yet despite this rich tapestry, the digital world often treats them as footnotes; or worse, as languages too complicated, too “low-resource,” or too culturally specific to matter for AI.
At Zoorna Institute, we argue the opposite: the Iranian (Iranic) languages are essential for building truly global, culturally intelligent AI. And to understand why, we must first understand the family itself.

A Family, Not a Single Language
When people hear “Iranian languages,” they often think only of Persian (Farsi). But Persian is just one branch of a family that spans continents and centuries.
This linguistic family includes:
Western Iranian
Persian (Farsi, Dari, Tajik)
Kurdish (Sorani, Kurmanji)
Luri
Bakhtiari
Gilaki
Mazandarani
Eastern Iranian
Pashto
Ossetic
Wakhi
Shughni
Yaghnobi
Baluchi (bridging West and East)
These languages are not dialects of one another. They differ in:
phonology and rhythm
morphology and verb systems
pragmatics and politeness norms
vocabulary, idioms, and metaphor
historical scripts (Persian-Arabic, Cyrillic, Latin)
narrative traditions and cultural logic
They share deep connections, but each has its own linguistic features.
For a list of the Iranian languages, visit the Iranic family chart.
Living History in Linguistic Form
The Iranian languages are carriers of:
the oldest continuous written Indo-European tradition
Silk Road literary exchange
epic storytelling and mysticism, as well as writings on medicine, science, and philosophy
layered politeness systems
tribal, regional, and diasporic identities
resilience in the face of colonial suppression, marginalization, and displacement
When AI systems fail to represent these languages, entire histories, archives, voices, and communities disappear from the digital sphere. This is not a technical gap; it is a cultural loss.
Why Iranian Languages Challenge AI
Modern large language models were trained mostly on English and a handful of other high-resource languages. Iranian languages contain features that strain those assumptions:
1. Rich Morphology
Sorani’s verb chains, Pashto’s inflection, or Luri’s clitics carry essential meaning that LLMs sometimes find challenging.
2. Pragmatic Subtlety
Politeness, hierarchy, and social distance shape every utterance. TaarofBench showed how dramatically AI misreads this.
3. Sparse Training Data
Even Persian, with millions of speakers, is underrepresented; Kurdish, Luri, and Gilaki barely appear in mainstream datasets.
4. Script Diversity
Iranic languages are represented in distinct writing systems including Persian-Arabic, Cyrillic (e.g., Tajik), and Latin-script Kurdish.
5. Cultural Narrative Forms
Indirectness, metaphor, and narrative softening are woven into daily discourse. LLMs often misinterpret them.
The Missing Context: Language Policy and Digital Absence
For many communities across Iran and the broader region, the Iranian languages have lived more in homes, music, and memory than in official schools or publications. Limited institutional support for teaching, documenting, or standardizing these languages has made preservation challenging over generations. That absence carries into the digital world as well: without widespread educational use, formal corpora, or sustained media presence, many Iranian languages remain underrepresented in the datasets that modern AI systems rely on. This structural gap is one of the primary reasons AI struggles with these languages today.
Why This Matters for Scholars and Communities
Many Iranologists, linguists, and heritage speakers approach AI cautiously—and understandably. Technologies built without cultural awareness can flatten nuance, misinterpret meaning, or reinforce stereotypes.
But responsible AI research can support Iranian-language communities in powerful ways:
Preserving endangered languages
Documenting low-density languages before they fade in younger generations.
Supporting education for heritage learners
Tools like our Persian AI Tutor can help diaspora families maintain language ties.
Improving access to historical and cultural archives
AI-assisted search across manuscripts, oral histories, and textual corpora.
Empowering regional scholarship
Kurdish, Luri, Gilaki, or Tajik scholars deserve tools that work for their languages — not tools that treat them as afterthoughts.
Building a community of practice
Iranian languages have been understudied in NLP not because they lack value, but because they lack representation.
This is a moment of opportunity.
Zoorna Institute is building a research ecosystem dedicated to the languages of Iran and the Caucasus. This includes:
narrative analytics for Persian and Kurdish
Tajik–Farsi transliteration systems
cultural-pragmatic benchmarks like TaarofBench
open calls for datasets, corpora, and linguistic insight
the SilkRoadNLP 2026 workshop — the first of its kind
We believe AI should amplify linguistic knowledge, not replace it. And we know that any meaningful progress requires collaboration with those who have lived, studied, and loved these languages for decades.
Why Iranian Languages Matter for the Future of AI
As AI grows more global, superficial fluency is no longer enough. We need systems that:
understand cultural nuance
respect social norms
interpret context and hierarchy
navigate indirectness
learn from scholars and communities
preserve linguistic diversity rather than erase it
In other words:
To build global AI, we must build multilingual, culturally intelligent AI. And the Iranian languages are a vital part of that future.




Comments