top of page

Tajik Transliteration

Bridging Persian dialects across scripts and borders

This initiative focuses on building a robust bidirectional transliteration system between Tajik Persian (written in Cyrillic) and Iranian Persian (written in Perso-Arabic script). Although both are dialects of Persian, their separation by script, geography, and political history has created a major barrier to linguistic accessibility, digital resource sharing, and cross-border communication.

By developing accurate transliteration tools grounded in linguistic principles, Zoorna aims to promote greater connectivity across the Persian-speaking world—from Iran and Afghanistan to Tajikistan and diasporic communities.
Overview

This initiative focuses on building a robust bidirectional transliteration system between Tajik Persian (written in Cyrillic) and Iranian Persian (written in Perso-Arabic script). Although both are dialects of Persian, their separation by script, geography, and political history has created a major barrier to linguistic accessibility, digital resource sharing, and cross-border communication.

By developing accurate transliteration tools grounded in linguistic principles, Zoorna aims to promote greater connectivity across the Persian-speaking world—from Iran and Afghanistan to Tajikistan and diasporic communities.

Methodology

Our system development includes:

- Data collection and normalization from Tajik news, literature, and social media corpora
- Script mapping algorithms informed by phonological and morphological correspondences
- Comparison of approaches:
- Traditional rule-based mapping
- Transformer-based sequence models
- Generative AI prompt pipelines for low-resource transfer
- Evaluation metrics measuring fidelity, readability, and ambiguity resolution
- Human-in-the-loop feedback from bilingual speakers to fine-tune outputs

We also document dialectal differences and orthographic conventions to support linguistic research and resource development.

Preliminary Results

- Rule-based systems perform well on common orthographic forms but falter with morphophonemic alternations
- LLMs show promise in low-data settings, especially when fine-tuned on aligned sentence pairs
- Disambiguation remains a challenge due to phonemic reduction in Tajik spelling (e.g., short vowels often omitted)
- Users prefer outputs with optional glossing or translational context, suggesting room for hybrid solutions

Use Case

This transliteration engine supports:

- Scholars and linguists working across Persian varieties
- Journalists and media translators reporting in multi-dialectal environments
- Educators and students navigating materials in different scripts
- Digital libraries and archives seeking cross-script metadata and tagging
- Developers and platform builders integrating cross-dialectal search or input capabilities

The tool can also be incorporated into language-learning applications, search engines, or digital heritage projects.

Team

Rayyan Merchant

Latest publication or presentation

(if available)

bottom of page