I am a researcher in the Cloud Reliability group at Microsoft Research in Redmond, WA.

Previously, I was tenure-track faculty at the Max Planck Institute for Software Systems (2018-2022), where I led the Cloud Software Systems group and held a dual appointment at Saarland University.

I received my PhD from Brown University (2011-2018) under the supervision of Rodrigo Fonseca, supported in part by a Facebook PhD Fellowship.

Research


My research focuses on designing and building reliable, observable, self-managing cloud systems. A central goal for me is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime. Currently I am working at the intersection of observability, semantic modeling, and agentic AI.

Select Projects


Telemeta extracts and indexes semantic models from large-scale observability data, enabling accurate and reliable AI agents for cloud operations. This is an ongoing project I lead at Microsoft Research, so get in touch if you're interested in internships or collaborations!

Blueprint is an extensible compiler and benchmark suite for microservice applications. It simplifies prototyping by making it easy to reconfigure infrastructure choices without rewriting application code. Check out the project on GitHub.

Hindsight is a distributed tracing framework for edge-case tracing, i.e. capturing detailed traces for rare and outlier requests without the data loss of sampling-based systems. It combines per-node telemetry history, programmatic symptom detection, and rapid distributed retrieval. Hindsight appeared at NSDI 2023; code is on GitLab.

Clockwork is a DNN serving system designed for predictable performance. By eliminating sources of variability and centralizing scheduling and admission control, Clockwork achieves extremely tight tail latency. It received the Distinguished Artifact Award at OSDI 2020; code is on GitLab.

Pivot Tracing is a cross-component monitoring framework for distributed systems. Troubleshooting cross-component problems often requires information that is inaccessible due to a lack of end-to-end visibility. Pivot Tracing addresses this by combining causal metadata propagation with dynamic instrumentation, enabling operators to define, measure, and aggregate metrics across component boundaries using a simple SQL-like interface. It received the Best Paper Award at SOSP 2015; code is on GitHub.

Software


Research projects and code are scattered across a few locations:

2025
AIOpsLab: A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud
Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan
MLSys, 2025.
Fair, Practical, and Efficient Carbon Accounting for LLM Serving
Yueying Lisa Li, Leo Han, G Edward Suh, Christina Delimitrou, Fiodar Kazhamiaka, Esha Choukse, Rodrigo Fonseca, Liangcheng Yu, Jonathan Mace, Udit Gupta
ACM SIGMETRICS Performance Evaluation Review, 2025.
Generating Representative Macrobenchmark Microservice Systems from Distributed Traces with Palette
Vaastav Anand, Matheus Stolet, Jonathan Mace, Antoine Kaufmann
APSys, 2025.
Intent-based System Design and Operation
Vaastav Anand, Yichen Li, Alok Gautam Kumbhare, Celine Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Ricardo Bianchini, Rodrigo Fonseca
PACMI (co-located with SOSP), 2025.
Automated Service Design with Cerulean
Vaastav Anand, Akshay Kumbhare, Christopher Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Rodrigo Fonseca
AIOps Workshop, 2025.
Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models
Yile Gu, Yuxuan Xiong, Jonathan Mace, Yong Jiang, Yulong Hu, Baris Kasikci, Peng Cheng
arXiv preprint, 2025. [PDF]
2024
Building AI Agents for Autonomous Clouds: Challenges and Design Principles
Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Pedro Las-Casas, Shachee Gupta, Suman Nath, Chetan Bansal, Saravan Rajmohan
SoCC, 2024.
If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
Bogdan Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madan Musuvathi, Suman Nath
SOSP, 2024.
Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace
arXiv preprint, 2024. [PDF]
2023
Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications
Vaastav Anand, Deepak Garg, Antoine Kaufmann, Jonathan Mace
SOSP, 2023.
Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications
João Loff, Daniel Porto, João Garcia, Jonathan Mace, Rodrigo Rodrigues
SOSP, 2023.
Detection Is Better Than Cure: A Cloud Incidents Perspective
Varsha Ganatra, Aditya Parayil, Saurabh Ghosh, Yingnong Kang, Minghua Ma, Chetan Bansal, Suman Nath, Jonathan Mace
ESEC/FSE, 2023.
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
NSDI, 2023. [PDF]
Groundhog: Reconciling Efficiency and Request Isolation in FaaS
Mohamed Alzayat, Jonathan Mace, Peter Druschel, Deepak Garg
EuroSys, 2023. [PDF]
A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities
Thomas Davidson, Emily Wall, Jonathan Mace
IEEE Transactions on Visualization and Computer Graphics, February 2023. [PDF]
The Odd One Out: Energy is Not Like Other Metrics
Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace
ACM SIGENERGY Energy Informatics Review, October 2023.
2022
See it to Believe it? The Role of Visualisation in Systems Research
Thomas Davidson, Jonathan Mace
SoCC, 2022. [PDF]
The Odd One Out: Energy is not like Other Metrics
Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace
HotCarbon, 2022. [PDF]
ACT now: Aggregate Comparison of Traces for Incident Localization
Kamala Ramasubramanian, Ashutosh Raina, Jonathan Mace, Peter Alvaro
arXiv preprint, 2022. [PDF]
2020
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
OSDI, 2020. [PDF]
Distinguished Artifact Award
Distributed Tracing in Practice
Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs
Book, Published July 2020.
2019
Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace
SoCC, 2019. [PDF]
No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
Amit Samanta, Suhas Shrinivasan, Antoine Kaufmann, Jonathan Mace
arXiv preprint, 2019. [PDF]
2018
Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca
SoCC, 2018. [PDF]
A Universal Architecture for Cross-Cutting Tools in Distributed Systems
Jonathan Mace
Ph.D. Thesis, Brown University, 2018. [PDF]
Dennis M. Ritchie Doctoral Dissertation Award, Honorable Mention
Universal Context Propagation for Distributed System Instrumentation
Jonathan Mace, Rodrigo Fonseca
EuroSys, 2018. [PDF]
2017
Canopy: An End-to-End Performance Tracing And Analysis System
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song
SOSP, 2017. [PDF]
2016
Principled Workflow-Centric Tracing of Distributed Systems
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
SoCC, 2016. [PDF]
2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services
Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan
SIGCOMM, 2016. [PDF]
2015
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Jonathan Mace, Ryan Roelke, Rodrigo Fonseca
SOSP, 2015. [PDF]
Best Paper Award
We are Losing Track: a Case for Causal Metadata in Distributed Systems
Rodrigo Fonseca, Jonathan Mace
HPTS, 2015. [PDF]
Retro: Targeted Resource Management in Multi-Tenant Distributed Systems
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
NSDI, 2015. [PDF]
2014
Towards General-Purpose Resource Management in Shared Cloud Services
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
HotDep, 2014. [PDF]
2013
Revisiting End-to-End Trace Comparison with Graph Kernels
Jonathan Mace, Rodrigo Fonseca
MSc Project, Brown University, 2014. [PDF]

Supervised Theses

I have been fortunate to work with many talented students over the years. In particular I advised or co-advised the following theses:

Powering Accurate Aggregate Analysis with Representative Distributed Tracing
Reyhaneh Karimipour
Masters Thesis, Saarland University, 2022. [PDF]
Using Reinforcement Learning for Low-Latency High-Throughput Request Scheduling
Safya Alzayat
Masters Thesis, Saarland University, 2022. [PDF]
Efficient DNN Serving: Evaluating the feasibility of FPGAs for multi-tenant model serving
Erasmus Masters Thesis, Pazmany Peter Catholic University, 2021. [PDF]
Pathfinder: Exploiting Inter-Thread Communication for Request Flow Instrumentation
Nicolas Schäfer
Masters Thesis, Saarland University, 2020. [PDF]
General Baggage Model for End-to-End Tracing and Its Application on Critical Path Analysis
Hongkai Sun
Masters Thesis, Brown University, 2016. [PDF]
End-to-End Tracing Models: Analysis and Unification
Jonathan Leavitt
Undergraduate Thesis, Brown University, 2014. [PDF]

jonathan.c.mace at gmail or jonathanmace at microsoft

github.com/JonathanMace

github.com/brownsys (projects from my time at Brown)

gitlab.mpi-sws.org/cld (projects from my time at MPI-SWS)

Google Scholar

DBLP