Jonathan Mace

I am a researcher in the Systems Research Group at Microsoft Research in Redmond, WA.

Previously, I was tenure-track faculty at the Max Planck Institute for Software Systems (2018-2022), where I led the Cloud Software Systems group and held a dual appointment at Saarland University.

I received my PhD from Brown University (2011-2018) under the supervision of Rodrigo Fonseca, supported in part by a Facebook PhD Fellowship.

Research

My research focuses on designing and building reliable, observable, self-managing cloud systems. A central goal for me is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime. Currently I am working at the intersection of observability, semantic modeling, and agentic AI.

Select Projects

Telemeta. Large-scale telemetry lacks semantic context: metric names are opaque; table relationships are implicit; tables are overloaded; and data is unnormalized. Making effective use of telemetry requires extensive discovery: understanding where data lives, what form it takes, and what rules apply to extract it. Telemeta extracts and indexes semantic models from metric data - what metrics measure, what columns represent, how tables relate, and hidden assumptions about data formats - to give agents a structured foundation for interpreting telemetry.
ℹ️ This is an ongoing project I lead at Microsoft.

Hindsight is a distributed tracing framework for retroactive sampling, i.e. capturing detailed traces for rare and outlier requests without the data loss of tail sampling-based systems. Hindsight revisits the design assumptions behind tail sampling: we found that the overhead of trace collection is not inherent to tracing itself, but comes primarily from transmitting, indexing, and storing traces. Local capture is cheap. Hindsight exploits this by recording all traces into lightweight local ring buffers and persisting them only when a downstream trigger fires (e.g., an error or SLO violation). The key insight is that you can defer the sampling decision until after you know what matters, as long as local storage is cheap and triggers are fast. We built Hindsight from the ground up; its code is open-sourced (NSDI 2023; GitLab).

Blueprint is an extensible compiler and benchmark suite for microservice applications. Experimenting with end-to-end ideas in microservices requires changing scaffolding across many services - a tedious, time-consuming task. Blueprint separates workflow logic from infrastructure choices, letting researchers swap RPC frameworks, enable tracing, or add replication with a few lines of config rather than hundreds of lines of code (SOSP 2023; GitHub).

Clockwork is a DNN serving system designed for predictable performance. By eliminating sources of variability and centralizing scheduling and admission control, Clockwork achieves extremely tight tail latency. It received the Distinguished Artifact Award at OSDI 2020, and its code is open-sourced (OSDI 2020; GitLab).

Pivot Tracing is a cross-component monitoring framework for distributed systems. Filtering or aggregating metrics across component boundaries requires associating events in one location with shared keys from another. To address this, Pivot Tracing propagates tuples/tags to enable such queries without per-query instrumentation. Pivot Tracing does this with a novel baggage abstraction for general-purpose context propagation. Baggage acts as a narrow waist for cross-cutting tools: many tools share how they propagate context but differ in what they carry and how they interpret it. Pivot Tracing received the Best Paper Award at SOSP 2015; code is on GitHub. Today, baggage is part of the OpenTelemetry standard.

Software

Research projects and code are scattered across a few locations:

github.com/JonathanMace (personal projects)
gitlab.mpi-sws.org/cld (MPI-SWS projects)
github.com/brownsys (Brown University projects)
github.com/tracingplane (Brown University projects)

My work is first and foremost motivated by the scale and complexity of large, modern cloud and internet applications, and the difficulty of achieving end-to-end reliability.

The perspective I've always been drawn to is that of end-to-end behavior: requests in distributed systems traverse machines, processes, queues, and backends, and a failure at any point can cascade into end-to-end impact. How do we observe, and ultimately improve, end-to-end behavior in such systems? Much of my work has orbited distributed tracing, touching topics both direct (e.g., tracing mechanisms) and indirect (e.g., downstream tasks like root-cause analysis).

Beyond tracing, my research more broadly asks how to achieve end-to-end reliability, robustness, and performance by design. Over the years I've built systems exploring abstractions and design principles for, e.g., multi-tenant isolation, predictable performance, and causal consistency. And of course, as often happens in systems research, resource management and scheduling keep resurfacing again and again!

Distributed Tracing

A few highlights from my work on distributed tracing:

Standards. From around 2015 I worked with the Distributed Tracing Working Group to help initiate the OpenTracing standard; I later served on its Industrial Advisory Board. OpenTracing has since become the tracing arm of OpenTelemetry. That experience fed into Distributed Tracing in Practice ( O'Reilly, 2020), co-authored with colleagues from Lightstep.
Baggage. In Pivot Tracing I proposed baggage, a general-purpose context propagation mechanism (SOSP 2015, Best Paper Award). Baggage acts as a narrow waist for cross-cutting tools: many tools share how they propagate context but differ in what they carry and how they interpret it. After Pivot Tracing I refined the design and explored additional use cases in a follow-up paper (EuroSys 2018) and in my PhD thesis, which received the Honorable Mention for the 2018 Dennis Ritchie Thesis award. Several subsequent projects have built on this substrate (see below). Today, baggage is part of OpenTelemetry's trace context.
Tracing at Facebook. During my PhD I spent time working at Facebook on Canopy, their end-to-end performance tracing and analysis system (SOSP 2017). Canopy juxtaposes with other distributed tracing work in interesting ways: Canopy collected on the order of a billion traces per day, and at that scale, semantic heterogeneity became a first-order design concern: engineers instrument in different ways, and the pipeline still has to unify those choices end-to-end, including trace-derived aggregates.
Consuming Traces. Most tracing research (including my own) focuses on capture mechanisms; what happens after traces land is often an afterthought. To better understand practitioners' actual workflows, we interviewed engineers about challenges and use-cases (IEEE TVCG 2023). This followed from work where we found that systems researchers frequently omit or elide user-facing concerns when designing systems (SoCC 2022).

Cross-Cutting Tools

A recurring theme in my work is cross-cutting tools: tools that need to span components and layers because the property they measure or enforce is end-to-end. Tracing is one instance, but the same pattern shows up elsewhere, such as:

Resource Management. Multi-tenant isolation requires coordinating resource policies across distributed components, but scheduling decisions are local. Retro propagated tenant IDs in-band so that each component could enforce per-tenant limits consistently (NSDI 2015).
Cross-Component Metrics. Filtering or aggregating metrics across component boundaries requires associating events with shared keys. Pivot Tracing propagated tuples/tags to enable such queries without per-query instrumentation (SOSP 2015).
Cross-Service Causal Consistency. Independent backends can violate causal ordering even when each is internally consistent. Antipode propagated lineage-style context with requests to enforce a causal consistency model across services (SOSP 2023).
Critical-Path Analysis. Identifying the critical path of a distributed request requires propagating slack and timing measurements online. CPath does this; I co-advised the project, which formed the basis of George Sun's MSc thesis (PDF).

Across these projects, Baggage serves as shared propagation infrastructure. It does not eliminate deployment effort, but it concentrates that effort into a reusable mechanism rather than repeating it for each tool.

Tracing Instrumentation

Instrumentation is the unglamorous bottleneck of tracing. I have instrumented many systems for context propagation - Hadoop, HDFS, Spark, HBase, microbenchmarks, and more - and that hands-on work shaped much of my understanding of what is realistic to expect from engineers.

During Retro and Pivot Tracing, I developed aspect-oriented techniques for automatic instrumentation of Java codebases. The approach was not exhaustive but covered common patterns effectively and provided a starting point for manual refinement (code).
Ironically, the very first iteration of Baggage was to enable Retro, so I could iterate on context-propagation ideas without having to re-instrument or re-build from scratch each time.
We also studied inter-thread communication patterns to identify candidate instrumentation points; that work is captured in Nicolas Schäfer's MSc thesis (PDF).
Getting instrumentation into production systems is as much an organizational challenge as a technical one. I discuss these issues in Chapter 7 of my PhD thesis (PDF). With the recent progress in LLMs and coding agents, automated instrumentation and trace-correctness validation may soon become practical. It is only a matter of time...

Trace Sampling

Tracing faces a pretty clear trade-off between data volume and computational costs (compute, storage, network, and operational complexity). There is no one-size-fits-all solution to sampling, as different data suits different use cases. I've looked at sampling and retention through a few lenses:

Tail-based outlier sampling. My early work explored unsupervised techniques to bias sampling toward unusual traces; the goal was to detect outliers without hand-engineering features, despite highly heterogeneous and imbalanced traces. The techniques explored include graph kernels (2014 MSc project), clustering (SoCC 2018), and autoencoders (SoCC 2019).
Retroactive sampling. Tail sampling has always struck me as a bitter compromise: by the time you decide a request is interesting, you have already lost the trace data you need. Hindsight revisits the design assumptions behind tail sampling. We found that the overhead of trace collection is not inherent to tracing itself - it comes from transmitting, indexing, and storing traces. Local capture is cheap. Hindsight exploits this by recording all traces into lightweight local ring buffers and persisting them only when a downstream trigger fires (e.g., an error or SLO violation). The key insight is that you can defer the sampling decision until after you know what matters, as long as local storage is cheap and triggers are fast. We built Hindsight from the ground up; its code is open-sourced (NSDI 2023; Gitlab).
Sampling for derived metrics. Trace-derived aggregate metrics can become biased under non-uniform head-based sampling. We explored this topic without publishing a paper; our notes and proposed directions appear in Reyhaneh Karimipour's MSc thesis (PDF).

Telemetry for Models and Agents

Models and agents are increasingly consumers of telemetry, not just humans. This raises concrete systems questions: how should we design and integrate those agents? What representations and interfaces enable automated tools to use telemetry effectively? Several recent projects explore this direction:

Telemeta. Large-scale telemetry lacks semantic context: metric names are opaque; table relationships are implicit; tables are overloaded; and data is unnormalized. Making effective use of telemetry requires extensive discovery: understanding where data lives, what form it takes, and what rules apply to extract it. Telemeta extracts and indexes semantic models from metric data - what metrics measure, what columns represent, how tables relate, and hidden assumptions about data formats - to give agents a structured foundation for interpreting telemetry. This is an ongoing project I lead at Microsoft.
Causal Discovery. Causal reasoning is a powerful technique for root-cause analysis when metrics have known causal relationships. The challenge is constructing the causal graph: data-driven discovery struggles with cloud telemetry because incidents are rare and heterogeneous. In Atlas, we sidestepped this by using LLMs to extract causal graphs from system documentation rather than from observational data, thus treating graph construction as knowledge extraction. Atlas decomposes a system into LLM agents that interpret component descriptions and identify pairwise causal relationships (arXiv 2024).
AIOps Benchmarking. Evaluating AI agents on cloud-operations tasks requires realistic, controlled environments. AIOpsLab is a benchmark framework that deploys live microservice applications, injects realistic faults, and provides agents with a controlled interface to observe telemetry and take actions - supporting tasks like detection, localization, root-cause analysis, and mitigation (MLSys 2025).
Retry Bugs. Retry logic is notoriously hard to test and frequently breaks in production. We combined LLMs with static analysis and fault injection to automatically detect retry bugs, finding over 100 previously unknown issues across major distributed systems (SOSP 2024).
Time-Series Anomaly Detection. Black-box anomaly-detection algorithms are hard to inspect and validate. We used LLMs to generate anomaly-detection programs for time-series metrics; programs are human-readable and can be reviewed before deployment (arXiv 2025).

Systems Building

Many of my projects culminate in working systems - for me, a research idea is only worthwhile if it survives practical constraints. Three themes recur across these systems:

Resource management. Sharing resources fairly without sacrificing utilization is hard, especially when request costs vary widely and are unknown at schedule time. Retro and 2DFQ both tackled multi-tenant isolation: Retro at the level of distributed components, 2DFQ at the level of thread pools within a process (NSDI 2015; SIGCOMM 2016).
Performance Predictability. Clockwork is a multi-tenant DNN serving system; in this project we asked whether tail latency could be eliminated rather than merely tolerated. The key insight was that DNN inference is deterministic - if you consolidate all scheduling choices centrally and treat any timing deviation as an error, you can achieve six-nines SLO compliance. We built Clockwork from the ground up and open-sourced it; at OSDI 2020 the work received the Distinguished Artifact award (OSDI 2020; GitLab).
Microservices for Research. Experimenting with end-to-end ideas in microservices requires changing scaffolding across many services - a tedious, time-consuming task. Blueprint separates workflow logic from infrastructure choices, letting researchers swap RPC frameworks, enable tracing, or add replication with a few lines of config rather than hundreds of lines of code (SOSP 2023; GitHub).

2025

AIOpsLab: A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud

Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan

MLSys, 2025. [PDF]

Fair, Practical, and Efficient Carbon Accounting for LLM Serving

Yueying Lisa Li, Leo Han, G Edward Suh, Christina Delimitrou, Fiodar Kazhamiaka, Esha Choukse, Rodrigo Fonseca, Liangcheng Yu, Jonathan Mace, Udit Gupta

ACM SIGMETRICS Performance Evaluation Review, 2025. [PDF]

Generating Representative Macrobenchmark Microservice Systems from Distributed Traces with Palette

Vaastav Anand, Matheus Stolet, Jonathan Mace, Antoine Kaufmann

APSys, 2025. [PDF]

Intent-based System Design and Operation

Vaastav Anand, Yichen Li, Alok Gautam Kumbhare, Celine Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Ricardo Bianchini, Rodrigo Fonseca

PACMI, 2025. [PDF]

Automated Service Design with Cerulean

Vaastav Anand, Akshay Kumbhare, Christopher Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Rodrigo Fonseca

AIOps, 2025. [PDF]

Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models

Yile Gu, Yuxuan Xiong, Jonathan Mace, Yong Jiang, Yulong Hu, Baris Kasikci, Peng Cheng

arXiv preprint, 2025. [PDF]

2024

Building AI Agents for Autonomous Clouds: Challenges and Design Principles

Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Pedro Las-Casas, Shachee Gupta, Suman Nath, Chetan Bansal, Saravan Rajmohan

SoCC, 2024. [PDF]

If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems

Bogdan Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madan Musuvathi, Suman Nath

SOSP, 2024. [PDF]

Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight

Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace

arXiv preprint, 2024. [PDF]

2023

Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications

Vaastav Anand, Deepak Garg, Antoine Kaufmann, Jonathan Mace

SOSP, 2023. [PDF] [GitHub]

Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications

João Loff, Daniel Porto, João Garcia, Jonathan Mace, Rodrigo Rodrigues

SOSP, 2023. [PDF]

Detection Is Better Than Cure: A Cloud Incidents Perspective

Varsha Ganatra, Aditya Parayil, Saurabh Ghosh, Yingnong Kang, Minghua Ma, Chetan Bansal, Suman Nath, Jonathan Mace

ESEC/FSE, 2023. [PDF]

The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems

Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace

NSDI, 2023. [PDF] [GitLab]

Groundhog: Reconciling Efficiency and Request Isolation in FaaS

Mohamed Alzayat, Jonathan Mace, Peter Druschel, Deepak Garg

EuroSys, 2023. [PDF]

A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities

Thomas Davidson, Emily Wall, Jonathan Mace

IEEE Transactions on Visualization and Computer Graphics, February 2023. [PDF]

The Odd One Out: Energy is Not Like Other Metrics

Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace

ACM SIGENERGY Energy Informatics Review, October 2023. [PDF]

2022

See it to Believe it? The Role of Visualisation in Systems Research

Thomas Davidson, Jonathan Mace

SoCC, 2022. [PDF]

The Odd One Out: Energy is not like Other Metrics

Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace

HotCarbon, 2022. [PDF]

ACT now: Aggregate Comparison of Traces for Incident Localization

Kamala Ramasubramanian, Ashutosh Raina, Jonathan Mace, Peter Alvaro

arXiv preprint, 2022. [PDF]

2020

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace

OSDI, 2020. [PDF] [GitLab]

Distinguished Artifact Award

Distributed Tracing in Practice

Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs

Book, Published July 2020.

2019

Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering

Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace

SoCC, 2019. [PDF]

No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy

Amit Samanta, Suhas Shrinivasan, Antoine Kaufmann, Jonathan Mace

arXiv preprint, 2019. [PDF]

2018

Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay

Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca

SoCC, 2018. [PDF]

A Universal Architecture for Cross-Cutting Tools in Distributed Systems

Jonathan Mace

Ph.D. Thesis, Brown University, 2018. [PDF]

Dennis M. Ritchie Doctoral Dissertation Award, Honorable Mention

Universal Context Propagation for Distributed System Instrumentation

Jonathan Mace, Rodrigo Fonseca

EuroSys, 2018. [PDF] [GitHub]

2017

Canopy: An End-to-End Performance Tracing And Analysis System

Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song

SOSP, 2017. [PDF]

2016

Principled Workflow-Centric Tracing of Distributed Systems

Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger

SoCC, 2016. [PDF]

2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services

Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan

SIGCOMM, 2016. [PDF]

2015

Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems

Jonathan Mace, Ryan Roelke, Rodrigo Fonseca

SOSP, 2015. [PDF] [GitHub]

Best Paper Award

We are Losing Track: a Case for Causal Metadata in Distributed Systems

Rodrigo Fonseca, Jonathan Mace

HPTS, 2015. [PDF]

Retro: Targeted Resource Management in Multi-Tenant Distributed Systems

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi

NSDI, 2015. [PDF] [GitHub]

2014

Towards General-Purpose Resource Management in Shared Cloud Services

Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi

HotDep, 2014. [PDF]

2013

Revisiting End-to-End Trace Comparison with Graph Kernels

Jonathan Mace, Rodrigo Fonseca

MSc Project, Brown University, 2014. [PDF]

Supervised Theses

I have been fortunate to work with many talented students over the years. In particular I advised or co-advised the following theses:

Powering Accurate Aggregate Analysis with Representative Distributed Tracing

Reyhaneh Karimipour

Masters Thesis, Saarland University, 2022. [PDF]

Using Reinforcement Learning for Low-Latency High-Throughput Request Scheduling

Safya Alzayat

Masters Thesis, Saarland University, 2022. [PDF]

Efficient DNN Serving: Evaluating the feasibility of FPGAs for multi-tenant model serving

Franco Caspe

Erasmus Masters Thesis, Pazmany Peter Catholic University, 2021. [PDF]

Pathfinder: Exploiting Inter-Thread Communication for Request Flow Instrumentation

Nicolas Schäfer

Masters Thesis, Saarland University, 2020. [PDF]

General Baggage Model for End-to-End Tracing and Its Application on Critical Path Analysis

Hongkai Sun

Masters Thesis, Brown University, 2016. [PDF]

End-to-End Tracing Models: Analysis and Unification

Jonathan Leavitt

Undergraduate Thesis, Brown University, 2014. [PDF]

Other Resources

Instrumented forks of various systems (GitLab)
Datasets collected (GitLab)
Microbricks benchmark system (GitLab)
TPCDS Spark implementation (GitLab)
Visualization demos: Demos | Source code

jonathanmace at microsoft (preferred for work-related)

jonathan.c.mace at gmail

Jonathan Mace

Research

Select Projects

Software

Distributed Tracing

Cross-Cutting Tools

Tracing Instrumentation

Trace Sampling

Telemetry for Models and Agents

Systems Building

Supervised Theses

Other Documents

Other Resources