Photo of Jonathan Mace

I am a researcher in the Cloud Reliability group at Microsoft Research in Redmond, WA.

Previously, I was tenure-track faculty at the Max Planck Institute for Software Systems (2018-2022), where I led the Cloud Software Systems group and held a dual appointment at Saarland University.

I received my PhD from Brown University (2011-2018) under the supervision of Rodrigo Fonseca, supported in part by a Facebook PhD Fellowship.

Research


My research focuses on designing and building reliable, observable, self-managing cloud systems. A central goal for me is to make it easier to operate large, complicated software systems, and to understand their behavior at runtime. Currently I am working at the intersection of observability, semantic modeling, and agentic AI.

Select Projects


Telemeta. Large-scale telemetry lacks semantic context: metric names are opaque; table relationships are implicit; tables are overloaded; and data is unnormalized. Making effective use of telemetry requires extensive discovery: understanding where data lives, what form it takes, and what rules apply to extract it. Telemeta extracts and indexes semantic models from metric data - what metrics measure, what columns represent, how tables relate, and hidden assumptions about data formats - to give agents a structured foundation for interpreting telemetry.
ℹ️ This is an ongoing project I lead at Microsoft.

Hindsight is a distributed tracing framework for retroactive sampling, i.e. capturing detailed traces for rare and outlier requests without the data loss of tail sampling-based systems. Hindsight revisits the design assumptions behind tail sampling: we found that the overhead of trace collection is not inherent to tracing itself, but comes primarily from transmitting, indexing, and storing traces. Local capture is cheap. Hindsight exploits this by recording all traces into lightweight local ring buffers and persisting them only when a downstream trigger fires (e.g., an error or SLO violation). The key insight is that you can defer the sampling decision until after you know what matters, as long as local storage is cheap and triggers are fast. We built Hindsight from the ground up; its code is open-sourced (NSDI 2023; GitLab).

Blueprint is an extensible compiler and benchmark suite for microservice applications. Experimenting with end-to-end ideas in microservices requires changing scaffolding across many services - a tedious, time-consuming task. Blueprint separates workflow logic from infrastructure choices, letting researchers swap RPC frameworks, enable tracing, or add replication with a few lines of config rather than hundreds of lines of code (SOSP 2023; GitHub).

Clockwork is a DNN serving system designed for predictable performance. By eliminating sources of variability and centralizing scheduling and admission control, Clockwork achieves extremely tight tail latency. It received the Distinguished Artifact Award at OSDI 2020, and its code is open-sourced (OSDI 2020; GitLab).

Pivot Tracing is a cross-component monitoring framework for distributed systems. Filtering or aggregating metrics across component boundaries requires associating events in one location with shared keys from another. To address this, Pivot Tracing propagates tuples/tags to enable such queries without per-query instrumentation. Pivot Tracing does this with a novel baggage abstraction for general-purpose context propagation. Baggage acts as a narrow waist for cross-cutting tools: many tools share how they propagate context but differ in what they carry and how they interpret it. Pivot Tracing received the Best Paper Award at SOSP 2015; code is on GitHub. Today, baggage is part of the OpenTelemetry standard.

Software


Research projects and code are scattered across a few locations:

My work is first and foremost motivated by the scale and complexity of large, modern cloud and internet applications, and the difficulty of achieving end-to-end reliability.

The perspective I've always been drawn to is that of end-to-end behavior: requests in distributed systems traverse machines, processes, queues, and backends, and a failure at any point can cascade into end-to-end impact. How do we observe, and ultimately improve, end-to-end behavior in such systems? Much of my work has orbited distributed tracing, touching topics both direct (e.g., tracing mechanisms) and indirect (e.g., downstream tasks like root-cause analysis).

Beyond tracing, my research more broadly asks how to achieve end-to-end reliability, robustness, and performance by design. Over the years I've built systems exploring abstractions and design principles for, e.g., multi-tenant isolation, predictable performance, and causal consistency. And of course, as often happens in systems research, resource management and scheduling keep resurfacing again and again!

Distributed Tracing


A few highlights from my work on distributed tracing:

  • Standards. From around 2015 I worked with the Distributed Tracing Working Group to help initiate the OpenTracing standard; I later served on its Industrial Advisory Board. OpenTracing has since become the tracing arm of OpenTelemetry. That experience fed into Distributed Tracing in Practice ( O'Reilly, 2020), co-authored with colleagues from Lightstep.
  • Baggage. In Pivot Tracing I proposed baggage, a general-purpose context propagation mechanism (SOSP 2015, Best Paper Award). Baggage acts as a narrow waist for cross-cutting tools: many tools share how they propagate context but differ in what they carry and how they interpret it. After Pivot Tracing I refined the design and explored additional use cases in a follow-up paper (EuroSys 2018) and in my PhD thesis, which received the Honorable Mention for the 2018 Dennis Ritchie Thesis award. Several subsequent projects have built on this substrate (see below). Today, baggage is part of OpenTelemetry's trace context.
  • Tracing at Facebook. During my PhD I spent time working at Facebook on Canopy, their end-to-end performance tracing and analysis system (SOSP 2017). Canopy juxtaposes with other distributed tracing work in interesting ways: Canopy collected on the order of a billion traces per day, and at that scale, semantic heterogeneity became a first-order design concern: engineers instrument in different ways, and the pipeline still has to unify those choices end-to-end, including trace-derived aggregates.
  • Consuming Traces. Most tracing research (including my own) focuses on capture mechanisms; what happens after traces land is often an afterthought. To better understand practitioners' actual workflows, we interviewed engineers about challenges and use-cases (IEEE TVCG 2023). This followed from work where we found that systems researchers frequently omit or elide user-facing concerns when designing systems (SoCC 2022).

Cross-Cutting Tools


A recurring theme in my work is cross-cutting tools: tools that need to span components and layers because the property they measure or enforce is end-to-end. Tracing is one instance, but the same pattern shows up elsewhere, such as:

  • Resource Management. Multi-tenant isolation requires coordinating resource policies across distributed components, but scheduling decisions are local. Retro propagated tenant IDs in-band so that each component could enforce per-tenant limits consistently (NSDI 2015).
  • Cross-Component Metrics. Filtering or aggregating metrics across component boundaries requires associating events with shared keys. Pivot Tracing propagated tuples/tags to enable such queries without per-query instrumentation (SOSP 2015).
  • Cross-Service Causal Consistency. Independent backends can violate causal ordering even when each is internally consistent. Antipode propagated lineage-style context with requests to enforce a causal consistency model across services (SOSP 2023).
  • Critical-Path Analysis. Identifying the critical path of a distributed request requires propagating slack and timing measurements online. CPath does this; I co-advised the project, which formed the basis of George Sun's MSc thesis (PDF).

Across these projects, Baggage serves as shared propagation infrastructure. It does not eliminate deployment effort, but it concentrates that effort into a reusable mechanism rather than repeating it for each tool.

Tracing Instrumentation


Instrumentation is the unglamorous bottleneck of tracing. I have instrumented many systems for context propagation - Hadoop, HDFS, Spark, HBase, microbenchmarks, and more - and that hands-on work shaped much of my understanding of what is realistic to expect from engineers.

  • During Retro and Pivot Tracing, I developed aspect-oriented techniques for automatic instrumentation of Java codebases. The approach was not exhaustive but covered common patterns effectively and provided a starting point for manual refinement (code).
  • Ironically, the very first iteration of Baggage was to enable Retro, so I could iterate on context-propagation ideas without having to re-instrument or re-build from scratch each time.
  • We also studied inter-thread communication patterns to identify candidate instrumentation points; that work is captured in Nicolas Schäfer's MSc thesis (PDF).
  • Getting instrumentation into production systems is as much an organizational challenge as a technical one. I discuss these issues in Chapter 7 of my PhD thesis (PDF). With the recent progress in LLMs and coding agents, automated instrumentation and trace-correctness validation may soon become practical. It is only a matter of time...

Trace Sampling


Tracing faces a pretty clear trade-off between data volume and computational costs (compute, storage, network, and operational complexity). There is no one-size-fits-all solution to sampling, as different data suits different use cases. I've looked at sampling and retention through a few lenses:

  • Tail-based outlier sampling. My early work explored unsupervised techniques to bias sampling toward unusual traces; the goal was to detect outliers without hand-engineering features, despite highly heterogeneous and imbalanced traces. The techniques explored include graph kernels (2014 MSc project), clustering (SoCC 2018), and autoencoders (SoCC 2019).
  • Retroactive sampling. Tail sampling has always struck me as a bitter compromise: by the time you decide a request is interesting, you have already lost the trace data you need. Hindsight revisits the design assumptions behind tail sampling. We found that the overhead of trace collection is not inherent to tracing itself - it comes from transmitting, indexing, and storing traces. Local capture is cheap. Hindsight exploits this by recording all traces into lightweight local ring buffers and persisting them only when a downstream trigger fires (e.g., an error or SLO violation). The key insight is that you can defer the sampling decision until after you know what matters, as long as local storage is cheap and triggers are fast. We built Hindsight from the ground up; its code is open-sourced (NSDI 2023; Gitlab).
  • Sampling for derived metrics. Trace-derived aggregate metrics can become biased under non-uniform head-based sampling. We explored this topic without publishing a paper; our notes and proposed directions appear in Reyhaneh Karimipour's MSc thesis (PDF).

Telemetry for Models and Agents


Models and agents are increasingly consumers of telemetry, not just humans. This raises concrete systems questions: how should we design and integrate those agents? What representations and interfaces enable automated tools to use telemetry effectively? Several recent projects explore this direction:

  • Telemeta. Large-scale telemetry lacks semantic context: metric names are opaque; table relationships are implicit; tables are overloaded; and data is unnormalized. Making effective use of telemetry requires extensive discovery: understanding where data lives, what form it takes, and what rules apply to extract it. Telemeta extracts and indexes semantic models from metric data - what metrics measure, what columns represent, how tables relate, and hidden assumptions about data formats - to give agents a structured foundation for interpreting telemetry. This is an ongoing project I lead at Microsoft.
  • Causal Discovery. Causal reasoning is a powerful technique for root-cause analysis when metrics have known causal relationships. The challenge is constructing the causal graph: data-driven discovery struggles with cloud telemetry because incidents are rare and heterogeneous. In Atlas, we sidestepped this by using LLMs to extract causal graphs from system documentation rather than from observational data, thus treating graph construction as knowledge extraction. Atlas decomposes a system into LLM agents that interpret component descriptions and identify pairwise causal relationships (arXiv 2024).
  • AIOps Benchmarking. Evaluating AI agents on cloud-operations tasks requires realistic, controlled environments. AIOpsLab is a benchmark framework that deploys live microservice applications, injects realistic faults, and provides agents with a controlled interface to observe telemetry and take actions - supporting tasks like detection, localization, root-cause analysis, and mitigation (MLSys 2025).
  • Retry Bugs. Retry logic is notoriously hard to test and frequently breaks in production. We combined LLMs with static analysis and fault injection to automatically detect retry bugs, finding over 100 previously unknown issues across major distributed systems (SOSP 2024).
  • Time-Series Anomaly Detection. Black-box anomaly-detection algorithms are hard to inspect and validate. We used LLMs to generate anomaly-detection programs for time-series metrics; programs are human-readable and can be reviewed before deployment (arXiv 2025).

Systems Building


Many of my projects culminate in working systems - for me, a research idea is only worthwhile if it survives practical constraints. Three themes recur across these systems:

  • Resource management. Sharing resources fairly without sacrificing utilization is hard, especially when request costs vary widely and are unknown at schedule time. Retro and 2DFQ both tackled multi-tenant isolation: Retro at the level of distributed components, 2DFQ at the level of thread pools within a process (NSDI 2015; SIGCOMM 2016).
  • Performance Predictability. Clockwork is a multi-tenant DNN serving system; in this project we asked whether tail latency could be eliminated rather than merely tolerated. The key insight was that DNN inference is deterministic - if you consolidate all scheduling choices centrally and treat any timing deviation as an error, you can achieve six-nines SLO compliance. We built Clockwork from the ground up and open-sourced it; at OSDI 2020 the work received the Distinguished Artifact award (OSDI 2020; GitLab).
  • Microservices for Research. Experimenting with end-to-end ideas in microservices requires changing scaffolding across many services - a tedious, time-consuming task. Blueprint separates workflow logic from infrastructure choices, letting researchers swap RPC frameworks, enable tracing, or add replication with a few lines of config rather than hundreds of lines of code (SOSP 2023; GitHub).
2025
AIOpsLab: A Holistic Framework for Evaluating AI Agents for Enabling Autonomous Cloud
Yinfang Chen, Manish Shetty, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Jonathan Mace, Chetan Bansal, Rujia Wang, Saravan Rajmohan
MLSys, 2025. [PDF]
Fair, Practical, and Efficient Carbon Accounting for LLM Serving
Yueying Lisa Li, Leo Han, G Edward Suh, Christina Delimitrou, Fiodar Kazhamiaka, Esha Choukse, Rodrigo Fonseca, Liangcheng Yu, Jonathan Mace, Udit Gupta
ACM SIGMETRICS Performance Evaluation Review, 2025. [PDF]
Generating Representative Macrobenchmark Microservice Systems from Distributed Traces with Palette
Vaastav Anand, Matheus Stolet, Jonathan Mace, Antoine Kaufmann
APSys, 2025. [PDF]
Intent-based System Design and Operation
Vaastav Anand, Yichen Li, Alok Gautam Kumbhare, Celine Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Ricardo Bianchini, Rodrigo Fonseca
PACMI, 2025. [PDF]
Automated Service Design with Cerulean
Vaastav Anand, Akshay Kumbhare, Christopher Irvene, Chetan Bansal, Gagan Somashekar, Jonathan Mace, Pedro Las-Casas, Rodrigo Fonseca
AIOps, 2025. [PDF]
Argos: Agentic Time-Series Anomaly Detection with Autonomous Rule Generation via Large Language Models
Yile Gu, Yuxuan Xiong, Jonathan Mace, Yong Jiang, Yulong Hu, Baris Kasikci, Peng Cheng
arXiv preprint, 2025. [PDF]
2024
Building AI Agents for Autonomous Clouds: Challenges and Design Principles
Manish Shetty, Yinfang Chen, Gagan Somashekar, Minghua Ma, Yogesh Simmhan, Xuchao Zhang, Jonathan Mace, Pedro Las-Casas, Shachee Gupta, Suman Nath, Chetan Bansal, Saravan Rajmohan
SoCC, 2024. [PDF]
If At First You Don't Succeed, Try, Try, Again...? Insights and LLM-informed Tooling for Detecting Retry Bugs in Software Systems
Bogdan Stoica, Utsav Sethi, Yiming Su, Cyrus Zhou, Shan Lu, Jonathan Mace, Madan Musuvathi, Suman Nath
SOSP, 2024. [PDF]
Cloud Atlas: Efficient Fault Localization for Cloud Systems using Language Models and Causal Insight
Zhiqiang Xie, Yujia Zheng, Lizi Ottens, Kun Zhang, Christos Kozyrakis, Jonathan Mace
arXiv preprint, 2024. [PDF]
2023
Blueprint: A Toolchain for Highly-Reconfigurable Microservice Applications
Vaastav Anand, Deepak Garg, Antoine Kaufmann, Jonathan Mace
SOSP, 2023. [PDF] [GitHub]
Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications
João Loff, Daniel Porto, João Garcia, Jonathan Mace, Rodrigo Rodrigues
SOSP, 2023. [PDF]
Detection Is Better Than Cure: A Cloud Incidents Perspective
Varsha Ganatra, Aditya Parayil, Saurabh Ghosh, Yingnong Kang, Minghua Ma, Chetan Bansal, Suman Nath, Jonathan Mace
ESEC/FSE, 2023. [PDF]
The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems
Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, Jonathan Mace
NSDI, 2023. [PDF] [GitLab]
Groundhog: Reconciling Efficiency and Request Isolation in FaaS
Mohamed Alzayat, Jonathan Mace, Peter Druschel, Deepak Garg
EuroSys, 2023. [PDF]
A Qualitative Interview Study of Distributed Tracing Visualisation: A Characterisation of Challenges and Opportunities
Thomas Davidson, Emily Wall, Jonathan Mace
IEEE Transactions on Visualization and Computer Graphics, February 2023. [PDF]
The Odd One Out: Energy is Not Like Other Metrics
Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace
ACM SIGENERGY Energy Informatics Review, October 2023. [PDF]
2022
See it to Believe it? The Role of Visualisation in Systems Research
Thomas Davidson, Jonathan Mace
SoCC, 2022. [PDF]
The Odd One Out: Energy is not like Other Metrics
Vaastav Anand, Zhiqiang Xie, Matheus Stolet, Roberta De Viti, Thomas Davidson, Reyhaneh Karimipour, Safya Alzayat, Jonathan Mace
HotCarbon, 2022. [PDF]
ACT now: Aggregate Comparison of Traces for Incident Localization
Kamala Ramasubramanian, Ashutosh Raina, Jonathan Mace, Peter Alvaro
arXiv preprint, 2022. [PDF]
2020
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
Arpan Gujarati, Reza Karimi, Safya Alzayat, Wei Hao, Antoine Kaufmann, Ymir Vigfusson, Jonathan Mace
OSDI, 2020. [PDF] [GitLab]
Distinguished Artifact Award
Distributed Tracing in Practice
Austin Parker, Daniel Spoonhower, Jonathan Mace, Rebecca Isaacs
Book, Published July 2020.
2019
Sifter: Scalable Sampling for Distributed Traces, without Feature Engineering
Pedro Las-Casas, Giorgi Papakerashvili, Vaastav Anand, Jonathan Mace
SoCC, 2019. [PDF]
No DNN Left Behind: Improving Inference in the Cloud with Multi-Tenancy
Amit Samanta, Suhas Shrinivasan, Antoine Kaufmann, Jonathan Mace
arXiv preprint, 2019. [PDF]
2018
Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay
Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, Rodrigo Fonseca
SoCC, 2018. [PDF]
A Universal Architecture for Cross-Cutting Tools in Distributed Systems
Jonathan Mace
Ph.D. Thesis, Brown University, 2018. [PDF]
Dennis M. Ritchie Doctoral Dissertation Award, Honorable Mention
Universal Context Propagation for Distributed System Instrumentation
Jonathan Mace, Rodrigo Fonseca
EuroSys, 2018. [PDF] [GitHub]
2017
Canopy: An End-to-End Performance Tracing And Analysis System
Jonathan Kaldor, Jonathan Mace, Michał Bejda, Edison Gao, Wiktor Kuropatwa, Joe O'Neill, Kian Win Ong, Bill Schaller, Pingjia Shan, Brendan Viscomi, Vinod Venkataraman, Kaushik Veeraraghavan, Yee Jiun Song
SOSP, 2017. [PDF]
2016
Principled Workflow-Centric Tracing of Distributed Systems
Raja R. Sambasivan, Ilari Shafer, Jonathan Mace, Benjamin H. Sigelman, Rodrigo Fonseca, Gregory R. Ganger
SoCC, 2016. [PDF]
2DFQ: Two-Dimensional Fair Queuing for Multi-Tenant Cloud Services
Jonathan Mace, Peter Bodik, Madanlal Musuvathi, Rodrigo Fonseca, Krishnan Varadarajan
SIGCOMM, 2016. [PDF]
2015
Pivot Tracing: Dynamic Causal Monitoring for Distributed Systems
Jonathan Mace, Ryan Roelke, Rodrigo Fonseca
SOSP, 2015. [PDF] [GitHub]
Best Paper Award
We are Losing Track: a Case for Causal Metadata in Distributed Systems
Rodrigo Fonseca, Jonathan Mace
HPTS, 2015. [PDF]
Retro: Targeted Resource Management in Multi-Tenant Distributed Systems
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
NSDI, 2015. [PDF] [GitHub]
2014
Towards General-Purpose Resource Management in Shared Cloud Services
Jonathan Mace, Peter Bodik, Rodrigo Fonseca, Madanlal Musuvathi
HotDep, 2014. [PDF]
2013
Revisiting End-to-End Trace Comparison with Graph Kernels
Jonathan Mace, Rodrigo Fonseca
MSc Project, Brown University, 2014. [PDF]

Supervised Theses

I have been fortunate to work with many talented students over the years. In particular I advised or co-advised the following theses:

Powering Accurate Aggregate Analysis with Representative Distributed Tracing
Reyhaneh Karimipour
Masters Thesis, Saarland University, 2022. [PDF]
Using Reinforcement Learning for Low-Latency High-Throughput Request Scheduling
Safya Alzayat
Masters Thesis, Saarland University, 2022. [PDF]
Efficient DNN Serving: Evaluating the feasibility of FPGAs for multi-tenant model serving
Erasmus Masters Thesis, Pazmany Peter Catholic University, 2021. [PDF]
Pathfinder: Exploiting Inter-Thread Communication for Request Flow Instrumentation
Nicolas Schäfer
Masters Thesis, Saarland University, 2020. [PDF]
General Baggage Model for End-to-End Tracing and Its Application on Critical Path Analysis
Hongkai Sun
Masters Thesis, Brown University, 2016. [PDF]
End-to-End Tracing Models: Analysis and Unification
Jonathan Leavitt
Undergraduate Thesis, Brown University, 2014. [PDF]

Other Documents

  • 2021 Research Statement, Jonathan Mace [PDF]
  • 2017 Research Statement, Jonathan Mace [PDF]
  • End-to-End Tracing: Adoption and Use Cases, Jonathan Mace, 2017. [PDF]

Other Resources

jonathanmace at microsoft (preferred for work-related)

jonathan.c.mace at gmail