/ʕæˈbiːr hɑˈrɑːs/
Hello! I'm a Research Scientist at Martian Learning Inc. working on mechanistic interpretability and LLM reasoning. I'm drawn to understanding intelligence through deep neural networks - how they work, when they generalize, and how to make them reliable.
Starting Fall 2026, I'll pursue a PhD at ETH Zürich (D-INFK) with Prof. Mrinmaya Sachan and Prof. Bjoern Menze, generously supported by the ETH AI Center Doctoral Fellowship. This is an incredible opportunity to dive deeper into understanding and improving deep neural networks.
Before that, I was a research intern at MPI-IS Tübingen under Zhijing Jin and Bernhard Schölkopf, where I worked on understanding multilingual representations in LLMs using cross-layer transcoders. I also completed my Diplôme d'Ingénieur at EMINES (selected as best thesis, first-class honors), and interned at NUS where I developed novel evaluation frameworks for LLMs inspired by social choice and economic theory. At Martian Learning Inc., I co-authored research on LLM routing and mechanistic interpretability through the TinySQL benchmark.
I'm proudly Moroccan. I've grown up across different parts of Morocco - Marrakech, then Elhajeb, Benguerir during my studies at EMINES, and now Tangier - each place leaving its mark on how I see the world. I love reading and am trying to rekindle the habit. I also write (mostly in French), sing in my spare time, and recently started learning photography, inspired by my father's wonderful tradition of making photo albums for us. I enjoy philosophizing about the world, whether alone or with my sister.
News
For a complete and up-to-date list of publications, check my Google Scholar.
Understanding which training examples drive specific model behaviors is central to debugging failures, investigating safety issues, and auditing deployed systems. However, existing attribution methods operate in parameter space, where costs grow rapidly with model size. Approximations enable scaling, but introduce overhead that limits low-latency and scalable deployment. STRIDE is a scalable framework that estimates influence directly in activation space, bypassing explicit parameter interactions. STRIDE learns low-rank steering operators that approximate the effect of retraining on data subsets by shifting internal representations. We then recover per-example influence scores by solving a regularized regression problem that decomposes these subset-level shifts. Experiments show that STRIDE accurately identifies influential examples and detects data leakage, outperforming prior methods while being orders of magnitude faster and scalable.
@article{harrasse2026stride,
title={STRIDE: Training data attribution can be estimated in activation space},
author={Harrasse, Abir and Dagli, Rishit and Abdullah, Amirali and Jin, Zhijing}
}
Mechanistic interpretability seeks to understand how Large Language Models (LLMs) represent and process information. Recent approaches based on dictionary learning and transcoders enable representing model computation in terms of sparse, interpretable features and their interactions, giving rise to feature attribution graphs. However, these graphs are often large and redundant, limiting their interpretability in practice. Cross-Layer Transcoders (CLTs) address this issue by sharing features across layers while preserving layer-specific decoding, yielding more compact representations, but remain difficult to train and analyze at scale. We introduce an open-source library for end-to-end training and interpretability of CLTs. Our framework integrates scalable distributed training with model sharding and compressed activation caching, a unified automated interpretability pipeline for feature analysis and explanation, attribution graph computation using Circuit-Tracer, and a flexible visualization interface. This provides a practical and unified solution for scaling CLT-based mechanistic interpretability.
@article{draye2026clt,
title={CLT-Forge: A Scalable Library for Cross-Layer Transcoders and Attribution Graphs},
author={Draye, Florent and Harrasse, Abir and Palit, Vedant and Wu, Tung-Yu and Liu, Jiarui and Pandey, Punya Syon and Wu, Roderick and Zhang, Terry Jingchen and Jin, Zhijing and Sch{\"o}lkopf, Bernhard},
journal={arXiv preprint arXiv:2603.21014},
year={2026}
}
Activation steering is a widely used approach for controlling large language model (LLM) behavior by intervening on internal representations. Existing methods largely rely on the Linear Representation Hypothesis, assuming behavioral attributes can be manipulated using global linear directions. In practice, however, such linear interventions often behave inconsistently. We question this assumption by analyzing the intrinsic geometry of LLM activation spaces. Measuring geometric distortion via the ratio of geodesic to Euclidean distances, we observe substantial and concept-dependent distortions, indicating that activation spaces are not well-approximated by a globally linear geometry. Motivated by this, we propose "Curveball steering", a nonlinear steering method based on polynomial kernel PCA that performs interventions in a feature space, better respecting the learned activation geometry. Curveball steering consistently outperforms linear PCA-based steering, particularly in regimes exhibiting strong geometric distortion, suggesting that geometry-aware, nonlinear steering provides a principled alternative to global, linear interventions.
@article{raval2026curveball,
title={Curveball Steering: The Right Direction To Steer Isn't Always Linear},
author={Raval, Shivam and Song, Hae Jin and Wu, Linlin and Harrasse, Abir and Phillips, Jeff and Abdullah, Amirali},
journal={arXiv preprint arXiv:2603.09313},
year={2026}
}
Multilingual Large Language Models (LLMs) can process many languages, yet how they internally represent this diversity remains unclear. Do they form shared multilingual representations with language-specific decoding, and if so, why does performance favor the dominant training language? To address this, we train models on different multilingual mixtures and analyze their internal mechanisms using Cross-Layer Transcoders (CLT) and Attribution Graphs. Our results reveal multilingual shared representations: the model employs highly similar features across languages, while language-specific decoding emerges in later layers. Training models without English shows identical multilingual shared space structures. Decoding relies partly on a small set of high-frequency features in the final layers, which linearly encode language identity from early layers. Intervening on these features allows one language to be suppressed and another substituted. Finally, to explain non-English failures, we perform a Model-Diffing experiment: underperformance arises from dim late-layer features, weak middle-layer clusters, and tokenizer bias toward English that forces early layers to specialize in word reassembly. Finetuning strengthens these features and their links, improving token assembly and language-specific decoding, providing a mechanistic explanation for multilingual gaps.
@article{harrasse2025tracing,
title={Tracing multilingual representations in llms with cross-layer transcoders},
author={Harrasse, Abir and Draye, Florent and Pandey, Punya Syon and Jin, Zhijing and Sch{\"o}lkopf, Bernhard},
journal={arXiv preprint arXiv:2511.10840},
year={2025}
}
The evaluation of Large Language Models (LLMs) remains challenging due to inconsistency, bias, and the absence of transparent decision criteria in automated judging. We present Debate, Deliberate, Decide (D3), a cost-aware, adversarial multi-agent framework that orchestrates structured debate among role-specialized agents (advocates, a judge, and an optional jury) to produce reliable and interpretable evaluations. D3 instantiates two complementary protocols: (1) Multi-Advocate One-Round Evaluation (MORE), which elicits k parallel defenses per answer to amplify signal via diverse advocacy, and (2) Single-Advocate Multi-Round Evaluation (SAMRE) with budgeted stopping, which iteratively refines arguments under an explicit token budget and convergence checks. We develop a probabilistic model of score gaps that (i) characterizes reliability and convergence under iterative debate and (ii) explains the separation gains from parallel advocacy. Under mild assumptions, the posterior distribution of the round-r gap concentrates around the true difference and the probability of misranking vanishes; moreover, aggregating across k advocates provably increases expected score separation. We complement theory with a rigorous experimental suite across MT-BENCH, ALIGNBENCH, and AUTO-J, showing state-of-the-art agreement with human judgments (accuracy and Cohen's κ), reduced positional and verbosity biases via anonymization and role diversification, and a favorable cost–accuracy frontier enabled by budgeted stopping. Ablations and qualitative analyses isolate the contributions of debate, aggregation, and anonymity. Together, these results establish D3 as a principled, practical recipe for reliable, interpretable, and cost-aware LLM evaluation.
@inproceedings{harrasse2026debate,
title={Debate, deliberate, decide (d3): A cost-aware adversarial framework for reliable and interpretable llm evaluation},
author={Harrasse, Abir and Bandi, Chaithanya and Bandi, Hari},
booktitle={Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={8376--8392},
year={2026}
}
Mechanistic interpretability research faces a gap between analyzing simple circuits in toy tasks and discovering features in large models. To bridge this gap, we propose text-to-SQL generation as an ideal task to study, as it combines the formal structure of toy tasks with real-world complexity. We introduce TinySQL, a synthetic dataset, progressing from basic to advanced SQL operations, and train models ranging from 33M to 1B parameters to establish a comprehensive testbed for interpretability. We apply multiple complementary interpretability techniques, including Edge Attribution Patching and Sparse Autoencoders, to identify minimal circuits and components supporting SQL generation. We compare circuits for different SQL subskills, evaluating their minimality, reliability, and identifiability. Finally, we conduct a layerwise logit lens analysis to reveal how models compose SQL queries across layers: from intent recognition to schema resolution to structured generation. Our work provides a robust framework for probing and comparing interpretability methods in a structured, progressively complex setting.
@inproceedings{harrasse2025tinysql,
title={TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research},
author={Harrasse, Abir and Quirke, Philip and Neo, Clement and Nathawani, Dhruv and Marks, Luke and Abdullah, Amir},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},
pages={29244--29272},
year={2025}
}
The study of representation universality in AI models reveals growing convergence across domains, modalities, and architectures. However, the practical applications of representation universality remain largely unexplored. We bridge this gap by demonstrating that safety interventions can be transferred between models through learned mappings of their shared activation spaces. We demonstrate this approach on two well-established AI safety tasks: backdoor removal and refusal of harmful prompts, showing successful transfer of steering vectors that alter the models' outputs in a predictable way. Additionally, we propose a new task, corrupted capabilities, where models are fine-tuned to embed knowledge tied to a backdoor. This tests their ability to separate useful skills from backdoors, reflecting real-world challenges. Extensive experiments across Llama, Qwen and Gemma model families show that our method enables using smaller models to efficiently align larger ones. Furthermore, we demonstrate that autoencoder mappings between base and fine-tuned models can serve as reliable "lightweight safety switches", allowing dynamic toggling between model behaviors.
@inproceedings{oozeer2025activation,
title={Activation Space Interventions Can Be Transferred Between Large Language Models},
author={Oozeer, Narmeen Fatimah and Nathawani, Dhruv and Prakash, Nirmalendu and Lan, Michael and Harrasse, Abir and Abdullah, Amir},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=HXOicJsmMQ}
}
2025