aharrasse🌸ethz.ch

Abir Harrasseعبير الهراس

/ʕæˈbiːr hɑˈrɑːs/

Hello! I'm a PhD student at ETH Zurich (D-INFK), working on mechanistic interpretability, LLM reasoning, and reliable deep learning. I'm drawn to understanding intelligence through deep neural networks - how they work, when they generalize, and how to make them reliable.

I'm advised by Prof. Mrinmaya Sachan and Prof. Bjoern Menze, and generously supported by the ETH AI Center Doctoral Fellowship. Before ETH, I was a Research Scientist at Martian Learning Inc., where I worked on mechanistic interpretability and LLM routing.

Before that, I was a research intern at MPI-IS Tübingen under Zhijing Jin and Bernhard Schölkopf, where I worked on understanding multilingual representations in LLMs using cross-layer transcoders. I also completed my Diplôme d'Ingénieur at EMINES (selected as best thesis, first-class honors), and interned at NUS where I developed novel evaluation frameworks for LLMs inspired by social choice and economic theory.

I'm proudly Moroccan. I've grown up across different parts of Morocco - Marrakech, then Elhajeb, Benguerir during my studies at EMINES, and now Tangier - each place leaving its mark on how I see the world. I love reading and am trying to rekindle the habit. I also write (mostly in French), sing in my spare time, and recently started learning photography, inspired by my father's wonderful tradition of making photo albums for us. I enjoy philosophizing about the world, whether alone or with my sister.

Research Interests

Problem solving and reasoning in AI
AI for scientific discovery and mathematics
Generalization and robustness in neural networks
Interpretability of deep neural networks

News

2026-07 Our paper Reasoning Fine-Tuning Induces Persistent Latent Policy States was accepted to the Conference on Language Modeling (COLM) 2026.
2026-06 Running the Theory of Generalization Reading Group this summer.
2026-04 Extremely honoured to have received the ETH AI Center Doctoral Fellowship!
2025-11 Started as Research Scientist at Martian Learning Inc.
2025-09 Defended Master's thesis - selected as best thesis, graduated first-class honors from EMINES.
2025-08 Extremely thrilled that our paper TinySQL was accepted to the Main Conference at EMNLP 2025!
2025-08 Gave a talk on Training Data Attribution methods at Amazon Tübingen.
2025-06 Multilingual representations paper accepted to the Actionable Interpretability Workshop @ ICML 2025.
2025-04 Started research internship at MPI-IS Tübingen under Zhijing Jin and Bernhard Schölkopf.
2024-09 Joined Martian Learning Inc. to work on mechanistic interpretability and LLM routing.
2024-04 Joined NUS as a research intern on LLM evaluation inspired by social choice and economic theory.

Selected Publications

Reasoning Fine-Tuning Induces Persistent Latent Policy States

Abir Harrasse, Michael Lan, Hunar Batra, Fateme Hashemi Chaleshtori, Chaithanya Bandi

COLM, 2026

Paper

Reasoning-specialized language models show large performance gains over base models, yet the internal changes responsible for improved multi-step reasoning remain poorly understood. We model chain-of-thought reasoning as a switching dynamical system (SDS) in which internal representations evolve under discrete latent policy states. Using time-aware contrastive representation learning and discrete regime discovery, we recover latent policies from activation trajectories. Across benchmarks and model scales from 1.5B to 32B, reasoning-fine-tuned models exhibit richer latent-policy organization than base models, with more differentiated transition structure and changes in state utilization, persistence, and mixing. The recovered regimes show functional specialization aligned with reasoning stages; causal interventions demonstrate their functional relevance, and SDS-guided pruning improves robustness. Together, these results suggest reasoning fine-tuning globally reorganizes latent dynamics, offering a new lens for mechanistic analysis and process-level control.

@article{harrasse2026reasoning,
  title={Reasoning Fine-Tuning Induces Persistent Latent Policy States},
  author={Harrasse, Abir and Lan, Michael and Batra, Hunar and Chaleshtori, Fateme Hashemi and Bandi, Chaithanya},
  journal={arXiv preprint arXiv:2607.18532},
  year={2026}
}

STRIDE: Training Data Attribution Can Be Estimated in Activation Space

Abir Harrasse, Rishit Dagli, Amirali Abdullah, Zhijing Jin

(Under review), 3rd Data-FM Workshop @ ICLR 2026

Paper Project page

Understanding which training examples drive specific model behaviors is central to debugging failures, investigating safety issues, and auditing deployed systems. However, existing attribution methods operate in parameter space, where costs grow rapidly with model size. Approximations enable scaling, but introduce overhead that limits low-latency and scalable deployment. STRIDE is a scalable framework that estimates influence directly in activation space, bypassing explicit parameter interactions. STRIDE learns low-rank steering operators that approximate the effect of retraining on data subsets by shifting internal representations. We then recover per-example influence scores by solving a regularized regression problem that decomposes these subset-level shifts. Experiments show that STRIDE accurately identifies influential examples and detects data leakage, outperforming prior methods while being orders of magnitude faster and scalable.

@article{harrasse2026stride,
  title={STRIDE: Training data attribution can be estimated in activation space},
  author={Harrasse, Abir and Dagli, Rishit and Abdullah, Amirali and Jin, Zhijing}
}

CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs

Florent Draye, Abir Harrasse, Vedant Palit, Tung-Yu Wu, Jiarui Liu, Punya Syon Pandey, Roderick Wu, Terry Jingchen Zhang, Zhijing Jin, Bernhard Schölkopf

arXiv:2603.21014

Paper Code

Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability.

@article{draye2026clt,
  title={CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs},
  author={Draye, Florent and Harrasse, Abir and Palit, Vedant and Wu, Tung-Yu and Liu, Jiarui and Pandey, Punya Syon and Wu, Roderick and Zhang, Terry Jingchen and Jin, Zhijing and Sch{\"o}lkopf, Bernhard},
  journal={arXiv preprint arXiv:2603.21014},
  year={2026}
}

Curveball Steering: The Right Direction To Steer Isn't Always Linear

Shivam Raval, Hae Jin Song, Linlin Wu, Abir Harrasse, Jeff M. Phillips, Fazl Barez, Amirali Abdullah

arXiv:2603.09313

Paper

Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.

@article{raval2026curveball,
  title={Curveball Steering: The Right Direction To Steer Isn't Always Linear},
  author={Raval, Shivam and Song, Hae Jin and Wu, Linlin and Harrasse, Abir and Phillips, Jeff and Abdullah, Amirali},
  journal={arXiv preprint arXiv:2603.09313},
  year={2026}
}

Tracing Multilingual Representations paper figure

Tracing Multilingual Representations in LLMs with Cross-Layer Transcoders

Abir Harrasse, Florent Draye, Punya Syon Pandey, Zhijing Jin, Bernhard Schölkopf

arXiv:2511.10840

Paper Code CLTs

Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance favor the dominant training language? To address this, we train models on different multilingual mixtures and analyze their internal mechanisms using Cross-Layer Transcoders (CLT) and Attribution Graphs. Our results reveal multilingual shared representations: the model employs highly similar features across languages, while language-specific decoding emerges in later layers. Training models without English shows identical multilingual shared space structures. Decoding relies partly on a small set of high-frequency features in the final layers, which linearly encode language identity from early layers. Intervening on these features allows one language to be suppressed and another substituted. Finally, to explain non-English failures, we perform a Model-Diffing experiment: underperformance arises from dim late-layer features, weak middle-layer clusters, and tokenizer bias toward English that forces early layers to specialize in word reassembly. Finetuning strengthens these features and their links, improving token assembly and language-specific decoding, providing a mechanistic explanation for multilingual gaps.

@article{harrasse2025tracing,
  title={Tracing multilingual representations in llms with cross-layer transcoders},
  author={Harrasse, Abir and Draye, Florent and Pandey, Punya Syon and Jin, Zhijing and Sch{\"o}lkopf, Bernhard},
  journal={arXiv preprint arXiv:2511.10840},
  year={2025}
}

Debate, Deliberate, Decide (D3): A Cost-Aware Adversarial Framework for Reliable and Interpretable LLM Evaluation

Abir Harrasse, Chaithanya Bandi, Hari Bandi

EACL (Oral), 2026

Paper Code

The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of misranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-BENCH, ALIGNBENCH, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen's κ), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost–accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.

@inproceedings{harrasse2026debate,
  title={Debate, deliberate, decide (d3): A cost-aware adversarial framework for reliable and interpretable llm evaluation},
  author={Harrasse, Abir and Bandi, Chaithanya and Bandi, Hari},
  booktitle={Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
  pages={8376--8392},
  year={2026}
}

TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research

Abir Harrasse, Philip Quirke, Clement Neo, Dhruv Nathawani, Luke Marks, Amir Abdullah

EMNLP, 2025

Paper Website Code

Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.

@inproceedings{harrasse2025tinysql,
  title={TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research},
  author={Harrasse, Abir and Quirke, Philip and Neo, Clement and Nathawani, Dhruv and Marks, Luke and Abdullah, Amir},
  booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
  pages={29244--29272},
  year={2025}
}

Activation Space Interventions paper figure

Activation Space Interventions Can Be Transferred Between Large Language Models

Narmeen Fatimah Oozeer, Dhruv Nathawani, Nirmalendu Prakash, Michael Lan, Abir Harrasse, Amir Abdullah

ICML, 2025

Paper Code

The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.

@inproceedings{oozeer2025activation,
  title={Activation Space Interventions Can Be Transferred Between Large Language Models},
  author={Oozeer, Narmeen Fatimah and Nathawani, Dhruv and Prakash, Nirmalendu and Lan, Michael and Harrasse, Abir and Abdullah, Amir},
  booktitle={Forty-second International Conference on Machine Learning},
  year={2025},
  url={https://openreview.net/forum?id=HXOicJsmMQ}
}

Abir Harrasseعبير الهراس

Research Interests

News

Selected Publications

Talks

Blog