top of page

When AI Learns to Hesitate: Decoding Persian Taarof for the Age of Chatbots

Presented at Empirical Methods in NLP 2026


If you’ve ever been offered a cup of tea in Iran, you may have unknowingly stepped into a centuries-old ritual. The host insists you take it, you politely refuse, they insist again, and only after a few rounds of gracious back-and-forth do you finally accept. This is taarof: a finely tuned social choreography of generosity, humility, and respect that structures everyday Persian life.


In another scenario, you take a ride in a taxi and when you're ready to disembark, the driver says "It is not worthy of you," roughly equivalent to "Please, be my guest." Now imagine an AI in that scene. Should the system accept or insist on paying? For large language models (LLMs) trained to be polite, factual, and direct, this moment is a cultural trapdoor. What looks like sincerity in one context can signal insincerity in another. For AI systems trained on blunt candor and task efficiency, these ritual refusals form a kind of cultural riddle.


That’s the puzzle behind TaarofBench, a groundbreaking new benchmark by researchers Nikta Gohari Sadr, Ali Emami, Karine Megerdoomian, Laleh Seyyed-Kalantari, and Sahar Heidariasl, introduced in their paper We Politely Insist: Your LLM Must Learn the Persian Art of Taarof. The team wanted to know: can a machine learn a social rule that even humans find hard to explain, and harder still to break gracefully?


A Ritual That Defies Logic

“Taarof resists definition,” says Emami. “It’s really a ritual of hesitation.” At its heart, taarof is a game of offering and refusal, insistence and restraint — a performance of mutual respect where meaning hides between the lines.


It’s so natural to Iranians that it’s rarely taught. Yet it’s so complex that even heritage speakers, like Emami himself, sometimes stumble. “When I met my co-author Karine at a conference,” he recalls, “we instinctively slipped into a light form of taarof, and that tiny awkward pause became our lightbulb moment. AI has the same problem: it can sound fluent but miss the social logic entirely.”


Turning Politeness Into Code

To teach machines this unspoken code, Gohari Sadr led the creation of TaarofBench, a dataset of 450 role-play scenarios across daily Iranian life — from dining and gift-giving to paying a bill or deflecting a compliment. Each scenario captures not just language, but context, social roles, and hierarchy: whether the speaker is a guest, host, friend, or superior.

The team formalized each instance as a six-part tuple, encoding the environment, speaker roles, conversational context, and the expected culturally correct response. “We wanted to operationalize taarof as a computational task,” says Gohari Sadr. “It’s not just words; it’s a map of human relationship dynamics.”


But much of taarof lives beyond words: tone, gesture, timing. “Sometimes,” Gohari Sadr notes, “you express politeness by not speaking — by hesitating or deferring. That’s nearly impossible to capture in text.”


When AI Gets the Words Right but the Culture Wrong

When tested across five leading LLMs, the models failed spectacularly. Even advanced systems like GPT-4o and Claude 3.5 scored 40–42% accuracy on taarof-expected scenarios — roughly the same as non-Iranian human participants. Yet these same models performed above 90% when taarof was not appropriate. In other words, they were polite but culturally wrong.


Standard politeness metrics rated 84% of the model responses as “polite,” but only 42% aligned with Persian norms. Accepting an offer too soon, praising oneself after a compliment, or declining help without the proper ritual of refusal all counted as failures.

“This shows that our AI systems are fluent but monocultural,” says Megerdoomian. “They can produce socially acceptable text, but not socially appropriate text.”


Language as a Cultural Key

One surprising finding: performance improved dramatically when the same scenarios were prompted in Persian rather than English — sometimes by more than 30 percentage points. Models seemed to unlock a different layer of behavior just by switching languages.


“That tells us something profound,” Gohari Sadr says. “Language is cultural context. If a model only learns meaning but not mindset, it will always default to the logic of its dominant training data — which, today, is largely Western.”


Yet the study also revealed an uncomfortable truth: when gender roles were introduced into the scenario, models started reproducing stereotypes: men paying the bill, women being protected. “They got the right answer for the wrong reason,” Emami notes. “That’s not cultural fluency — that’s bias dressed up as etiquette.”


Can You Teach AI to Be Culturally Fluent?

The team didn’t stop at diagnosis. They fine-tuned open models like Llama-3 using supervised learning and Direct Preference Optimization (DPO). These methods let models learn not just from labeled data but from preference feedback. The improvement was stunning: DPO raised model performance on taarof-expected scenarios from 37% to nearly 80%, approaching native speaker accuracy.


“The model began to internalize the logic of modesty, deference, and reciprocity,” Emami explains. “It learned that saying no can sometimes mean yes, and that social harmony can matter more than efficiency.”


Why This Matters for Global AI

For organizations like Zoorna Institute, which work at the intersection of linguistics, culture, and AI, TaarofBench provides a rare glimpse into how large models fail when stripped of cultural grounding.


Beyond its Persian roots, TaarofBench points to a broader challenge in AI: cultural intelligence. As chatbots and virtual agents mediate more of our cross-cultural communication (from customer service to diplomacy) the cost of misunderstanding grows. A phrase like “I insist” can sound generous in one language and aggressive in another.

“If AI misses that,” Megerdoomian warns, “it risks what linguists call pragmatic failure — getting the words right but the intent wrong.” As global AI systems mediate millions of daily interactions, misunderstandings aren’t just awkward — they can erode trust, reinforce stereotypes, or misrepresent entire cultures.


The researchers have already been contacted by teams working on Japanese keigo, Korean nunchi, and Turkish israr, all exploring similar norms of deference. The next step may be a global framework for culturally adaptive AI — one that learns not just to talk, but to listen in context.


The Politeness Paradox

At its core, TaarofBench challenges one of AI’s blind spots: the assumption that “politeness” is universal. It’s not. It’s a mirror of values such as respect, modesty, and human connection, that differs across societies.


“Taarof isn’t about deception or formality,” says Emami. “It’s about preserving dignity through reciprocity. And if AI is to operate in the human world, it needs to learn to hesitate — not because it’s uncertain, but because it understands that words are never just words.”


If the future of AI is multilingual, then the future of AI must also be culturally multilingual.

 


Comments


ZI banner 3.jpeg
bottom of page