top of page

Benchmarking LLMs for Iranian Languages

Evaluating multilingual AI on linguistic features in low-resource Iranian languages.

Large Language Models (LLMs) have made impressive progress in multilingual understanding, yet performance across Iranian languages—such as Persian, Kurdish, Gilaki, Mazandarani, Balochi, and Luri—remains poorly understood. These languages, though rich in linguistic diversity and cultural expression, are chronically underrepresented in NLP training data and benchmarks. This initiative aims to systematically evaluate and compare LLM capabilities across a spectrum of linguistic phenomena—morphology, syntax, and semantics—to reveal gaps in model competence and guide equitable AI development for low-resource languages.
Overview

Large Language Models (LLMs) have made impressive progress in multilingual understanding, yet performance across Iranian languages—such as Persian, Kurdish, Gilaki, Mazandarani, Balochi, and Luri—remains poorly understood. These languages, though rich in linguistic diversity and cultural expression, are chronically underrepresented in NLP training data and benchmarks. This initiative aims to systematically evaluate and compare LLM capabilities across a spectrum of linguistic phenomena—morphology, syntax, and semantics—to reveal gaps in model competence and guide equitable AI development for low-resource languages.

Methodology

Our team is developing a suite of diagnostic tasks that assess model understanding of structural and typological features specific to Iranian languages:
- Feature-level benchmarks for morphology (e.g., clitic attachment, verb inflection), syntax (e.g., word order, argument structure), and discourse markers.
- Cross-lingual evaluation using parallel and comparable corpora to measure transfer learning performance across related languages.
- Prompt-based and probing methods to test the implicit linguistic knowledge encoded in LLMs without fine-tuning.
- Human evaluation by native linguists to assess model-generated outputs for fluency, grammaticality, and pragmatic appropriateness.

Preliminary Results

This project is currently in its initial design and data curation phase. Early exploratory analyses indicate that while multilingual LLMs show surface-level competence in Persian, their performance drops sharply in regional and minority languages such as Sorani Kurdish and Balochi. These findings reinforce the need for fine-grained linguistic evaluation rather than aggregate accuracy scores.

Use Case

The benchmark will provide
- A diagnostic tool for researchers and developers to evaluate model performance across underrepresented languages.
- A foundation for fine-tuning and domain adaptation, improving AI accessibility and cultural representation.
- A resource for policymakers and educators advocating for digital inclusion of minority languages.
- A stepping stone toward creating multilingual, linguistically informed AI that respects linguistic diversity across the Iranian plateau and beyond.

Team

Ali Salehi, Karine Megerdoomian

Latest publication or presentation

(if available)

bottom of page