Building trust into agentic AI - Margot van Laar (Anthropic)

Authors: Elena Vrabie and Adriana Spulber
The past year has made one thing clear: AI systems are becoming increasingly capable of acting independently. We've moved beyond the era of rigid rules and predefined workflows. Today's systems can reason, plan, and act - navigating tools, adapting to feedback, and operating across changing environments. The question is no longer how to constrain them, but where we're willing to let them operate on their own.
This tension defines the current moment in agentic AI. We’re shifting from systems that follow predefined steps to ones that decide those steps themselves. The challenge is that evaluation no longer stops at the answer. It extends to the behavior: the decisions made, the actions taken, and the outcomes produced over time.
To understand what this shift looks like in practice, we spoke with Margot van Laar, an Applied AI engineer at Anthropic based in London. Working at the intersection of research, product, and go-to-market - and with a background in computational physics - she helps organizations move from experimental workflows to production-ready agentic systems.
In this conversation, Margot explains how evaluation has moved from accuracy to behavior, what “agentic” systems actually look like under the hood, and why the biggest challenge ahead is no longer capability, but confidence.
UV: What changed most in how you evaluate AI systems when you moved from consulting into research and engineering, and given your background in physics?
Margot van Laar: Back when I was working in physics and building more traditional ML models, the evaluation was relatively clean. You had a ground truth and your test set. You could measure accuracy, precision, and recall, all of which told you whether the model was right or wrong, and by how much. You ran your experiment, got your distribution, and compared it to the theory.
But LLMs fundamentally broke that evaluation process. The output is language, and language doesn't have a single correct answer, the way that a label would. The model can produce something that's factually accurate, but the wrong tone, or it reasoned correctly, but the answer was unhelpfully structured, or it's technically right but subtly misleading. So evaluation became a lot messier. In the early days of LLMs, we introduced things like LLM as a judge, where we had LLMs assessing other LLMs to be able to grade things more subjectively, as a human would.
But now we're entering this era of agentic systems, and it has shifted again. The unit of evaluation isn't really the output anymore; it's really the trajectory of how it got there. What actions did it take? Did it call the right tools in the right order? Did it recover from something that failed? So overall, we've gone from measuring accuracy to measuring quality, to now measuring behavior.
I think the agency these AI systems are going to have is only going to continue to increase. And what we need to start thinking about very clearly is: as we give these things more agency, where do we still need to take control as humans versus where do we feel confident that the AI systems can act autonomously? It's a handoff we're going to have to think more and more about, where the human in the loop sits as we start giving it more complex problems.
UV: When people talk about building “agentic,” what does a real system actually look like today under the hood?
MvL: Agentic AI has become a buzzword in the industry over the last year, and as a result, people have different definitions of what it means. At Anthropic, agentic systems are those that have agency. To put that into perspective, it's helpful to talk about how LLMs have evolved.
When the first LLMs came out, we used them for input-output systems: you'd give it a chunk of text and say, “Can you summarize this for me?” Then we moved to chaining LLMs together in what we called workflows, where you'd do a classification step, pass it to another LLM for the next step, then another, and so on. The reason we had to compartmentalize those steps is that the models weren't intelligent enough to handle the full end-to-end themselves.
Whereas now, with these highly intelligent models, we're able to give them more agency. They can take the tools they have, which allow them to interact with their environment, and decide for themselves how to approach a problem. They can reason over the problem, make a plan, execute that plan, and reroute themselves if they're going down the wrong path. So that's really what we mean by agentic: It's really about giving an LLM a set of tools and rules, then letting it run autonomously in a loop, making decisions about its own trajectory based on what it discovers along the way.
UV: At the frontier of agentic systems, where do things actually break in practice today?
MvL: Two places. The first is the harness: the scaffolding that lets an agent run for long stretches independently. We're constantly pushing on this. Until recently, 200k context windows meant you had to be very deliberate about what you passed to the model, which capped how complex a task could get. Since then, we've released sub-agents, 1M context windows, and automatic compaction, where the system compresses its own context to keep going past the limit. The harness is a moving target, and long-horizon reliability is still where a lot of breakage happens.
The second is evaluation, which is where most customers actually get stuck. Building a prototype has become easier (more autonomy means less code), but getting something into production is hard precisely because of the properties that make agentic systems powerful: autonomy, tool use, and multi-step reasoning. Those are exactly what make them difficult to assess. The question I spend most of my time on with customers is: how do you build an evaluation suite that gives you confidence the system will work not just for you, but for every end user it's going to reach?
UV: Are there common challenges or themes you're hearing from Anthropic customers, and where do things stand with the market and product development right now?
MvL: There are two key shifts. The first is the transition from workflows to fully agentic systems. Until about a year ago, workflows were the default way to build with LLMs - chaining models together with each step tightly scoped. We released Claude Code in February 2025, the first fully agentic tool on the market, and it reset expectations for what an AI system could look like. There's a natural lag between frontier capability and enterprise adoption, so what I'm seeing now is a lot of customers with mature workflow-based systems asking how to re-architect them with more agency.
The other thing is that a lot of customers have built great internal tooling, small teams have developed small products which they use for their own productivity, but which they would love to roll out more broadly within their own organization, or ideally to customers. They just can't quite get the evaluation suite down to allow them to do that. And that's something I spend a lot of my time doing with customers as an applied AI engineer at Anthropic: how do we change the architecture from workflow to agentic, and how do we build an evaluation suite that gives you the confidence that this can go live into production?
UV: As we move into agentic systems, how should people be thinking about reskilling?
MvL: The prompt engineering side of things was really important before. Now our skills have to shift towards being able to orchestrate these agentic systems. What that means in practice is: how do we give our model, which is now really intelligent in and of itself, the tools that it needs to have access to? Not only to have the right information to reason over, which might be specific to a customer, but also to allow it to execute within its environment.
For example, connecting it up to certain data systems or internal monitoring and logging systems. For any application that you use, you need to be able to connect it to your AI system for it to be able to interact with it. And the key thing there is really building MCPs (an open protocol we developed that lets models connect to the systems people actually use): those connectors from your model to its outside environment.
UV: Building on this idea of AI moving through its adolescence - from the toddler stage where we needed to give it strict rules, to now where we need to loosen those constraints and let it grow - could we take agentic AI into the real world at scale? Could we imagine a world where there are no longer mobile apps, and where phones work with agents that plan things for a person, learn on the go, and even book things proactively, like a dinner or a flight?
MvL: I'd draw a distinction the metaphor glosses over. The technology is young, but the models combined with the harnesses we've built around those models are already mature enough to do genuinely useful work in the real world. So the question isn't really "is the technology ready?", it's "are we using what we already have?". From my perspective, we're not far off from the world you're describing. The capability is largely there; adoption is the lag.
Two things have driven that capability. The first is MCPs, which means models can now plug into Gmail, booking systems, internal tools, and things that were painful to integrate before.
The second is browser use, which has moved quickly in the last six months. Claude can open a Chrome tab and act the way a person would by clicking, filtering, and filling forms. A month ago, I used it to shortlist flats in London: I gave it my constraints, including price, rooms, commute time, and it spun up a tab, worked through the filters, opened each listing, and built me a spreadsheet. That kind of task was barely feasible earlier this year, and it's the clear signal of where real-world agentic use is heading.
UV: When an agent makes a mistake that isn't a hallucination but an action with real consequences, like a transaction sent, a file deleted, a message fired off, the stakes are fundamentally different. How do you design and test for that?
MvL: There are really three pillars according to which one should evaluate the behavior of an agentic system. The first is similar to how we used to do it traditionally: answer accuracy. What text does the LLM output, and is it correct? It tests what the agent said, but it doesn't actually test what the agent did.
So two other pillars assess what it did. The first is tool-use accuracy. An agentic system will have access to tools in its environment, and to execute a certain task, it may need to call three of those tools out of five with certain parameters. It can call more tools than that; we don't really care about what it did under the hood, as long as the subset of that was right. Did it pick the right tool with the right parameters? Did it recover when something went wrong?
And then finally, there's what we call, from a technical perspective, the Tau-Bench style, which is probably the most rigorous of all of them. Imagine you ask an agent to book a restaurant. The final test isn't what it told you - it's whether the reservation actually exists in the system, with the right name, party size, and time. Tau-bench ignores the conversation the model had entirely and just checks the final state of the environment.
UV: As agentic AI becomes accessible beyond just technical users, how do you think about the right level of autonomy for an agent? And given that the technology is already there but adoption is still slow, what role do education and mindset play in closing that gap?
MvL: The level of autonomy we should be giving these agentic systems is changing all the time. We see major leaps in model intelligence every few months, and therefore, what we can confidently rely on a model to do well now is very different from even three or four months ago. So we need to update our primitives to align with that.
One aspect we've done a lot of research into internally at Anthropic, and which is very important in terms of enabling people to understand what these agents can do, is this area of research called alignment, essentially about the model being honest about what it does and doesn't know.
We did some research into how people are using Claude and found that Claude interrupted itself twice as much as a human interrupted Claude. What that shows is that Claude is very good at understanding the limits of its knowledge and capability. It asks for the clarification it needs, rather than giving the user the impression it can do something when actually it's not able to do it reliably. That's a behavior we try to really bake into the models through the training process.
The other side is that Claude asks humans for approval to do certain tasks. If we think about Claude Code by default, it asks a user to approve before running certain commands or modifying certain files, to keep users safe. But a lot of clicking "approve" leads to something called “approval fatigue,” where, because you get so many messages to approve, you end up just defaulting to approve every time. We found that Claude Code users approve permission prompts 93% of the time, which is very high.
Up until very recently, there were a few options: you could use something called dangerously-skip-permissions, where you'd always permit Claude to take any action; you could manually approve; or you could do sandboxing, where you let Claude run in a very isolated environment so that if it does something wrong, it's not the end of the world. But those are quite separate trade-offs.
We wanted something in the middle; we wanted Claude to be better at understanding what is actually a dangerous command versus what's not. We released something at the end of March called “auto mode” for Claude Code, where we've trained classifiers to have a much deeper understanding of what is dangerous versus what is not, so the only time the human gets asked for permission is when Claude is really not sure. There are a lot of different things that go into what autonomy agents have and how much we should give them, and that's something we're constantly thinking about and trying to be more transparent about at Anthropic.
UV: What is the question about agentic AI that you think nobody is asking right now, something beneath the surface that deserves more attention?
MvL: There is a lot of buzz around AI and agentic AI, and a lot of it is quite speculative. What are agents going to do? How will jobs change? What is the ceiling? But I think the interesting thing is really trying to understand in more detail how this technology is being adopted right now across that entire breadth of people that it's touching.
We're making efforts at Anthropic to do this with our societal impact team. I think we're the only frontier lab actually dedicating teams to studying how AI is being used and how it's affecting people on an economic scale. This is important because without adoption data, every conversation about agentic AI collapses into speculation: ‘is it overhyped?’, ‘is regulation premature?’, when the more useful question is what's actually happening on the ground. Having more data on the adoption is really informative in terms of where we should be going next with this technology and where we should be focusing our efforts as an AI lab.
We've published some really interesting work from our societal impact team very recently, around which subject areas and careers AI is being adopted in and how much, and also how it's being used in education.
UV: What's your read on where European founders are finding genuine advantages for AI startups, and where they're still hitting a ceiling?
MvL: I'm genuinely bullish on the European startup scene. A lot of foundational AI innovation actually happened here (DeepMind was founded in London before the current wave), and the energy around LLMs has reignited that. Stockholm, London, and Paris are the obvious hubs, with newer ones emerging.
The ceiling is fragmentation. Unlike the US, Europe isn't a single market, so a startup that takes off in Paris often struggles to reach the same density of customers in Germany or the Nordics. The geographic hubs are vibrant, but their radius of impact is narrower than you'd expect for a continent this size.
What's genuinely new is that AI has become a lever for founders who are, on average, less well-capitalised than their US counterparts. It compresses the team size needed to build something serious, which means European startups can now compete globally in ways that weren't realistic a few years ago. Companies like Lovable and Legora have scaled remarkably fast. Not because the capital environment changed, but because the technology lets a small team punch well above its weight. That's the real shift: AI isn't just a product category for European founders, it's the thing letting them compete on the global stage at all.
TAGS:
AI, agentic, agentic systems, AI engineer, technical staff, anthropic, margot van laar, agents
