LLM Observability: Lessons From MLOps

For nine years, Maria Vechtomova was shouting about monitoring. Nobody cared, until LLMs arrived. As co-founder of Cauchy, Databricks MVP, and one of the most followed voices in MLOps, Maria has watched the field evolve from hand-built experiment trackers to today's flood of observability tools, and her central claim might surprise you: globally, nothing has changed.

The fundamentals are the same: track your code, data, and models so you can roll back when something breaks. What did change is the surface area. Tools, prompts, embeddings, agents, every component shifts behavior unpredictably, and business metrics often become the only signal left. Maria gets into why most teams still can't roll back cleanly, why 40% of ML projects ship with no monitoring at all, and why she believes the next era of MLOps will be its biggest yet.

Let's start with you and Cauchy. How did you land in this space?

I've been in data and AI for 12 years, and doing MLOps for nine — that's the topic I'm most passionate about. After 12 years in the corporate world, we started a consulting company called Cauchy. It's named after the French mathematician — most people have heard of Cauchy's theorem, but he's also the brain behind gradient descent. Without gradient descent, there's no modern AI. So we thought it was a fitting name. Our logo is a keyhole shape, which is used to prove another of Cauchy's theorems — nothing to do with gradient descent, but a nice fun fact.

The idea behind the company is to actually enable others to do MLOps, data platforms, and AI platforms the right way — not to parachute in, build something, and leave everyone where they were. We want to teach. We want to help others have these best practices in-house. And it's really not obvious for many how to do it right.

You were doing MLOps before it was called MLOps. How has the practice evolved, especially with LLMs?

I started doing this before it was a thing — we were building our own tools for experiment tracking, registering models, deploying them. Then we got MLflow and similar tools. MLflow is the leader in the space and the oldest, and they made a huge shift in the LLM era. If you're doing LLMs, you should be using MLflow for tracing.

But globally, nothing has changed. You still need to track your code, your data, your models. You need proper tracing of what was used when, so you can roll back when necessary. Surprisingly, very few teams actually do this well. If you ask them, "Can you roll back easily to any version of code, data, or models?" — usually they can't. And if something goes wrong, they often can't pinpoint why, because they didn't track things.

Monitoring is the same story. People kind of knew we needed it, but no one did anything about it. Alejandro — now Head of AI at Zalando — maintains a survey called the State of Production ML. According to that survey, more than 50% of companies had no monitoring in place for ML projects. This year it's around 40%, and that's because there's more attention now, specifically on LLM application monitoring. The tools are better. I'm so excited about LLM Observe. Finally we care about serving. Finally we care about monitoring. Things I was shouting about for years — now suddenly everyone cares.

Why is LLM observability so much harder than classical ML monitoring?

For standard ML models, you know what you're monitoring. You're monitoring some metrics — which may have nothing to do with reality, but they're straightforward.

For LLM applications, what you're monitoring isn't always clear. You need to first define your quality metrics, and they can be something completely arbitrary. You also need to gather evaluation data so you can actually say whether a changed agent still performs correctly. And there are so many moving parts — if you add a tool, the agent may behave differently. If you change the embedding model or the LLM, things shift. The system prompt is another component. There are way more moving pieces than there were for traditional ML models, and observability is just hard. And now everyone cares about it. So it's fun.

Why do business metrics become the only reliable signal for LLM apps?

Take a classification model — say, fraud detection. It's pretty deterministic. You can define fraud with business rules, you have historical observations, you know your false positives and false negatives, you know their cost. You can calculate total revenue from how the model predicts. Straightforward to monitor.

LLM applications have completely different goals. Take summarization or generating an audit document. You might check: does it follow the required style? You can have human evaluators score historical examples, then train LLM evaluators aligned with the humans. That's one aspect — but maybe you also care whether the model identified the right numbers, and placed them in the right spots in the final document. That's another data point you need to track.

There are so many more moving data points to evaluate, constantly. Your data-gathering problem is now much bigger than before. It's a normal thing for NLP applications — that was always the case. Just no one was shouting about it.

Databricks has become a central part of the ML stack. What changed?

I first really started using Databricks four years ago, and back then it wasn't ready. It had MLflow and some basic orchestration, but we had to deploy endpoints on Kubernetes and use Airflow for orchestration — Databricks Jobs (now Lakeflow Jobs) just weren't there yet. Deploying was painful. You had to package your code, upload it to the right place, manage all your config files yourself. We had our own internal package that did exactly that. Then Databricks released Databricks Asset Bundles — same idea, but standardized and supported by them.

Over the years they've added so many of these components — and not just for developer experience. Actual ML components, like model serving, have improved dramatically. The speed of releasing features is just crazy. I can't remember in my career ever seeing anything like it. And it's only accelerating. They're moving in a direction the general AI community agrees with. It's very impressive.

When you train SREs and platform engineers moving into MLOps, what's the gap?

A lot of the skills are already there. If you understand how to deploy things in Kubernetes, you'll absolutely be able to deploy on Databricks — it's much easier, it's all abstracted away. MLOps and ML engineering are heavy multitasking, and SREs are the best multitaskers I've ever met. The skill base is there. What's missing is exposure to the right way of doing it — and that's very hard to find. That's why we decided to teach.

Are companies actually using LLMs for the right things?

There's a lot of pressure on leadership to use LLMs for everything — they feel like if they don't, they're left behind. And there's pressure from the developer community too, because everyone wants to play with it. So pressure from both sides.

Does it make sense to use AI for everything? Probably not. What most companies are lacking is the processes. They don't have their data right, they don't have their processes right. And once you don't have that, there is no magic AI dust that solves the problems. You still have to do the hard work to get the foundations there — and no one wants to do that work somehow.

I do see companies starting to understand this. And that's why I believe MLOps will be great again. Well — it's not that it wasn't. But it'll be an even bigger topic in the upcoming years.

Any tools you're excited about for 2026?

There are so many tools that if you try everything, you'll just get overwhelmed. I'm already overwhelmed by life — I think everyone in the field is. We need to chill a bit. So I'm not chasing new tools. I'm also not sure they'll all survive — we're going through a crazy acquisition phase. Databricks, Salesforce, Oracle — everyone is acquiring smaller players. So I could tell you "this tool looks cool" and it might disappear.

That said: if you're not using Databricks, you should try it. They're heading in the right direction and have what you need for deploying AI applications properly. And from a developer experience angle, I really admire Outerbounds — the inventors of Metaflow. Not sponsored, just naming names.

For someone who doesn't want to become an MLOps engineer but needs to upskill — where do they start?

I'm writing a book about MLOps with a large LLMOps chapter, releasing around May or June. We also have a seven-week LLMOps course coming out in March, where we build a use case step by step and apply best practices — on Databricks, but the focus is on principles, not tools. You just have to pick a tool to explain things.

If I'm not on your list, follow Paul Iusztin and Maxime Labonne — they wrote a book together, run courses, and put out great content. There are so many amazing people out there. You just need to find the right ones to follow.

Where can people find you?

LinkedIn is the easiest place. We also have a free MLOps course on our YouTube channel (Marvelous MLOps) — we may rebrand it, but it's still a great place to start. Globally, the principles are the same.