Agentic LLMs

Understanding Agentic LLMs

Understanding Agentic LLMs

Over the last several months, progress in LLMs has largely diverged into two work streams: thinking or reasoning models, which output reasoning tokens before responding, and hybrid or agentic models, which are capable of both fast and slow responses. These models are intended primarily for use in agentic loops: multi-turn-focused interactions with an environment where a model will make a plan, take some action, observe what happens, and repeat until solving the task.

This wave was primarily kicked off by Anthropic, who have largely stuck it out exclusively with the hybrid model approach. One very popular use case for this sort of model is Claude Code, a CLI tool which gives Claude access to a little filesystem and a variety of simple tools it can use.

Lots of other labs have similar setups:

OpenAI's Codex
Google's Gemini Code Assist
Tongyi's Qwen Code

…as well as other frontier model providers like DeepSeek (DeepSeek-V3.1), Moonshot (Kimi K2), and Zhipu AI (GLM) making explicit efforts to be easily dropped-in with Claude Code.

Relately, another similar line of research is research agents, whose primary purpose is to answer a question by looking around the internet first, then synthesizing a detailed report outlining what it found. Like CLI coding tools, there are a lot of different versions:

…and others. These are multi-turn focused systems that are capable of processing extremely large volumes of information before responding, making them vaguely similar to these coding tools built for navigating large codebases.

So, what the heck is going on here? A lot of questions pop up. Why would you want to use a thinking model vs an agentic model? How would you build a research model?

Chess Games vs Chess Puzzles: The Perception Action Cycle

I have spoken extensively in the past about the importance of the multi-turn settings for getting value out of LLMs. An analogy I really like to use is: does your problem feel more like a Chess Puzzle, or does it feel like a Chess Game?

LLM Evaluation is broadly split into two very large "categories": can you solve this really difficult task right now, in one turn (a puzzle) or can you do this long-term interaction with an environment and arrive at a goal state (a game). An example of the former is the AIME math problems that OpenAI's models do extremely well at. An example of the latter is Claude playing Pokemon Red.

Broadly, LLMs with thinking seem like the right approach for solving very difficult tasks like AIME math problems. But they don't seem like the right paradigm for something like Pokemon Red: no matter how much you think, you need to see what is on the other side of a door to make progress. Herein lies the major distinction between "reasoning" and "agentic" LLMs. A model specialized for thinking a lot before acting will be better at certain problems and worse at others, compared to a model which will prod the environment, observe the outcome, and repeat.

The core capability we're interested in is the perception-action cycle. How do we make a model which will meaningfully observe the results of actions it takes in the environment? How do we make a model which understands that it can take actions in the first place, in order to progress in solving some sort of goal? What kind of actions could those be? How do we handle cases where we have to do a lot of loops of this to solve the problem? These questions are broadly a bit different compared to standard "how do you solve this very hard task" benchmarks that were once the de facto gold standard for measuring how useful LLMs were.

Why are we interested in such a capability? First, for practical reasons: coding CLI tools like Claude Code and OpenAI Codex are super popular tools for coding now, and making models better in those settings has huge potential productive value¹. Second, it seems like actively learning through interactions and learning through experience might be a way we can unlock yet-unseen performance from models which are currently primarily driven by massive pretraining. Human beings get better at things by learning them through experience, so it stands to reason that making a model which uses interaction with an environment could do better than what we've seen so far.

Qwen3 Technical Report

The Claude models are a little interesting in comparison to something like OpenAI o1: they seem to have a special setting that turns them into reasoning models, rather than just being them by default. DeepSeek, quiet for most of 2025, has also made clear steps in this direction as well with the introduction of their V3.1 model, their hybridification of DeepSeek-V3. These models don't seem to default to this mode the way o3 / gpt-5 / R1 etc do, and they seem broadly like "extra strong no-thinking, below average with-thinking" type models. How does this work? What's the advantage here?

The Qwen 3 Technical Report from May 2025 is one of the earliest papers which outlined how a model could be both thinking and non-thinking. Qwen is notable for an extremely high volume of model releases, so they already had both chat-optimized models (e.g. Qwen2.5) and dedicated reasoning models (e.g. QwQ). This is primarily framed as a convenience thing: sometimes you want a fast response (and would pick non-thinking mode) and other times you want complex multi-step reasoning (and would pick thinking mode), so combining these into a single model prevents a potentially costly switching cost.

Notable about the Qwen models is their heinously large training datasets - 36 trillion tokens in all. A lot of this is synthetic: text extracted from pdfs using Qwen2.5-VL, code generated by Qwen2.5-coder, etc. Likewise, following DeepSeek-R1's results, smaller models are trained by distillation from the stronger models², rather than post-training them in earnest. For this reason we will mostly be covering the flagship: Qwen3-235B-A22B, a mixture-of-experts model which is both thinking and non-thinking.

Architecture

Qwen3 is a mostly unremarkable MoE model, architectually speaking. It uses Grouped Query Attention, SwiGLU, RoPE, RMSNorm, etc. From DeepSeekMoE, it adapts the fine-grained expert segmentation, which has become a slightly more standard architecture in recent months. But otherwise, this is mostly standard fare previously covered in the deepseek writeups I have covered previously.

Training

Qwen3 is pretrained similar to other similar sized large models, so for simplicity we will primarily cover what imparts the optional reasoning behaviors:

Reasoning Stage in Pretraining

After the first phase of pretraining (30T tokens at 4096 sequence length), Qwen3 enters a second phase where it is pretrained on 5T of collected reasoning tokens. They are very light on details here: it goes between normal pretraining and long-context pretraining, it's partially synthetic, and learning rate is decayed at an accelerated rate. One could imagine this could be done by collecting 5T tokens from something like QwQ, filtered for quality or correctness³.

Post-training

This is where the meat of the contributions are. Recall from DeepSeek-R1 that R1 was trained with long-CoT cold start from R1-Zero, followed by reasoning RL with verifiable rewards, followed by general purpose RL post-training as normal. Qwen introduces a new phase: Thinking Mode Fusion, which takes place after the reasoning RL step.

Training this model for the most part follows the formula of DeepSeek-R1. A cold start dataset is assembled using QwQ outputs filtered by Qwen2.5-72B composed of verifiable problems. Then reasoning-specific RLVR is used using Group Relative Policy Optimization (GRPO). This will naturally cause the model to output more tokens over time in order to boost likelihood of a correct response, which improves performance on difficult reasoning problems (e.g. AIME).

Thinking Mode Fusion is where we break from the standard fare. The way Qwen accomplishes this is extremely simple: they just do SFT with a chat template where thinking tokens are removed⁴.

In a sense, this is the simplest possible thing you could do. Once we pass stage 2, we make the model generate a lot of responses to the queries in Stage 1⁵ to use as the reasoning split of the SFT phase (i.e. "make the same exact response, when the /think flag is there"). On top of this, they include some standard instruction tuning data with a corresponding /no_think tag in the prompt. These have no thinking tokens, and it's functionally similar to regular instruction tuning. This way, when the user provides /think, it will think. When the user provides /no_think, it will output nothing in the think block.

Qwen3 claims that this type of fusion allows for thinking budgets, where the model becomes more capable of generating responses from incomplete thought traces. If you cross a number of tokens, the model output stream is halted, the string "Considering the limited time by the user, I have to give the solution based on the thinking directly now.\n</think>.\n\n" is added to the end, and the model begins responding. This is not anything they have trained into the model, it's just a claim that they make that the model is capable of handling fewer thinking tokens by virtue of undergoing this SFT phase⁶.

Following this fusion phase, they do a regular general-purpose RL post-training phase where they target a lot of tasks and specifically reward them. This is largely similar to other works, and slightly outside the scope of this writeup, but I will briefly point out that an explicitly called-out task here is "agent ability", i.e. training a model to invoke tools in multi-turn interaction cycles. Specific details on this component are light, it's one of many capabilities instilled during the general RL phase.

Results

Their resulting model is pretty good relative to thinking models.

It's also pretty good relative to non thinking models.

This paper highlights something curious about these hybrid models – by allowing the model to be good in both modes, it often will sacrifice performance it could have reasonably had if it had specialized in one or the other. Qwen3 seems pretty happy with this result where they don't clearly edge out the frontier, but it does hold its own against models of both classes⁷.

Discussion

Qwen3 is a really simple paper: train the model via SFT so that you can ask it to not output thinking tokens. A cynical take would be comparing it to a non-thinking model that you just prompt to output lots of tokens before responding:

One thing which will immediately become apparent is the difficulty of comparing a thinking model to a non-thinking model prompted via this sort of chain-of-thought prompting. It is a bit like instilling a very useful prompting template into every single response the model outputs, and here with this Qwen paper we've just reverse-engineered not including that prompting template via straightforward SFT.

But we are seeing the beginnings, vaguely, of a model which is a bit more directly intended to operate in a multi-turn setting. This is actually kind of a big deal! Many labs had strongly tunneled into scaling test-time compute, and the idea of zeroing it back out just seemed like going backwards for no reason. Is it better to have your model think even longer so it can hard even harder single turn problems, or is this sort of "Blitz Chess" model still useful in places?

Kimi K2

So now we vaguely know it's possible to make a model which can be both a "thinking model" and a "non-thinking model". So let's blur the line between them even more! Moonshot's Kimi K2 model, released on July 28th 2025, is a non-thinking model intended to operate well inside of multi-turn agentic harnesses, with 1 Trillion total (!!!!) / 32B active parameters trained on 15.5T tokens. Kimi is a "non-thinking" model, but it outputs 3x the tokens of other non-thinking models. It's sort of like a halfway point between a thinking model and a non-thinking model.

Kimi is pretty open about a stated goal of theirs being agentic models that learn through experience. Another notable work by the kimi team is checkpoint-engine, middleware intended to quickly update parameters of a very large model in-place. The Kimi K2 model is also cool as hell, scoring the clear lowest sycophancy scores and having no reservations telling the user that they are completely wrong about something⁸.

Kimi's three main contributions are:

The MuonClip Optimizer, which greatly stablizes training
An agentic data synthesis pipeline which generates tool use demonstrations in agentic harnesses
A general purpose RLVR framework which includes self-critique mechanisms

Aside A: Muon

Most very large models are trained with the tried-and-tested AdamW Optimizer, so the choice of a novel optimizer is very unusual. To understand Kimi's MuonClip, we need some background on Muon first.

Muon is a relatively recent optimizer, developed by Keller Jordan in 2024 for use in speed-training NanoGPT and CIFAR-10. Muon stands for MomentUm Orthogonalized by Newton-Schulz, which applies an orthogonalization post-processing step on top of SGD with Momentum.

What does that mean, orthogonalize? We can take a look at the operation that makes this different from regular SGD with momentum:

$||O - G||_F$ is the Frobenius Norm distance, aka the square root of the sum of the squared absolute values of all the elements. We want to find the matrix $O$ which is as close as possible to $G$ under this distance metric. However, we are limiting ourselves to only solutions where $O^TO = I$ or $OO^T = I$, i.e. matrices where $O$ is orthogonal. Recall from linear algebra that orthogonal matrices are matrices where all the columns are perpendicular to each other and have unit length: this means when you apply an orthogonal matrix to any vector, you get a vector the same exact length, just transformed (rotated / reflected / etc).

So what Muon does is replace the update matrix with the something roughly orthogonal to it. They use an iterative algorithm called Newton-Schulz to approzimately orthogonalize it. Why would this make it better than AdamW or SGD-Momentum? From Jordan's blog:

We would first like to observe that one valid answer would be: It just is OK? (Shazeer 2020)

…

for an empirically-flavored motivation, we observe that based on manual inspection, the updates produced by both SGD-momentum and Adam for the 2D parameters in transformer-based neural networks typically have very high condition number. That is, they are almost low-rank matrices, with the updates for all neurons being dominated by just a few directions. We speculate that orthogonalization effectively increases the scale of other “rare directions” which have small magnitude in the update but are nevertheless important for learning.

The biggest advantage Muon has over AdamW is that it's super fast compared to AdamW, improving the training speed on nanoGPT by 35%. However, optimizers are a ruthless business. It seems like every week there's a new optimizer which supposedly beats AdamW, and then everybody seems to keep using AdamW. It's unclear if the gains from Muon would scale to something truly huge, or if any new challenges would emerge from there.

MuonClip

Switching gears back to the Kimi paper, there does seem to be a catch scaling up Muon to something huge. Namely: training instability due to exploding attention logits. Exploding and Vanishing Gradients is a classic source of training instability in ML models, but lots of little things help make it a less common problem in modern training: stuff like batch norm, residual connections, gradient clipping, and so on.

Kimi's solution to the exploding attention logits in Muon is QK-Clip. This is pretty straightforward: define a big threshold $\tau$, and if the max attention logit exceeds that threshold, it will rescale the projection weights $W_q$ and $W_k$ down only for the head that explodes. Basically: when a logit is going to explode, make it not explode, otherwise follow Muon as normal.

As a result, they stablize this algorithm for large transformer models that worked better than AdamW on smaller transformer models.

Token Efficiency via Rephrasing

How do we fit the most possible value into a corpus of tokens? We already know from the scaling laws work in DeepSeek-LLM that token quality influences the subsequent results a lot – a token isn't just a token. Training a single epoch is insufficient (especially for rare facts), but training more than one epoch damages generalization. How do we ensure a high volume of high quality tokens without overfitting?

Kimi's contribution here is: rather than showing the same text to a model twice, get a language model to rephrase the text in a different way, and show them two versions of the same text. They find that extending the dataset this way is generally way more effective than including more epochs: the same text 10 times is way less valuable than the same text 10 ways⁹.

Architecture and Pretraining

Kimi K2 is more or less an extremely large version of DeepSeek-V3, leveraging MoE / MLA / fine-grained expert segmentation / etc.

Kimi K2 was notable for being the first publicly available open-weights model which had over 1 trillion parameters, but interestingly it seems like most of this scaling compared to DeepSeek-V3 was predominantly horizontal: it's the same depth as V3 with roughly the same number of active params, just with way more experts.

It's important to remember that the stated purpose of Kimi K2 is to exist in agentic applications: multi-turn, with long sequence lengths, and so on. Kimi calls out that they use only 64 attention heads compared to V3's 128 – This results in 83% more inference FLOPs for a roughly 1% improvement in validation loss. This is a great tradeoff for hill-climbing single-turn problems, but not such a good one for something specifically tailored for agentic applications, so Kimi opts for half the attention heads.

Pretraining follows largely the same standard formula, but with MuonClip added. They train on 15.5T tokens with a constant learning rate, which is then decayed with a cosine decay. The end of pretraining concludes with a long-context pretraining phase using YaRN to extend to 128k context size.

Post-Training

Agentic Data Synthesis for Tools (SFT)

Kimi's SFT phase involves a substantial data synthesis pipeline intended to generate a large volume of data which will help the model learn to use tools. This pipeline has three stages:

Construct a large repository of tool specs from the internet + LLM-synthetic tools
Create an agent and corresponding tasks for each toolset in the tool repository
For each agent and task, generate trajectories where the agent does the task using the tools

For step 1, they first collect 3000+ MCP tools from public github repositories. After that, they evolve synthetic tools using a hierarchical domain generating process, yielding 20,000 synthetic tools covering a large swathe of possible applications.

For agent generation, they just generate a large number of system prompts, and then equip those system prompts with different sets of tools from the repository. These in turn are used to generate tasks which could conceivably be solved by using those tools.

Multi-turn trajectories are then generated: they simulate a lot of possible users via LLM to go back and forth with the agents, and use a "sophisticated tool simulator (functionally equivalent to a world model)"¹⁰ in order to simulate what would happen if the agent called the tool. This is used to generate a large volume of multi-turn dialogues where a user submits a query and the agent calls tools over multiple turns. These are subsequently filtered using LLM-as-judge to keep only the ones which produce trajectories that solve the tasks.

They also forego the world model for some proportion of the trajectories, in favor of actually running code sometimes, in order to produce maximally accurate ground-truth interaction data. They assemble a very large multi-turn SFT dataset this way, which they claim significantly improves the tool-use capabilites of their model.

Reinforcement Learning

K2 goes through an interesting RL pipeline which introduces some attempt at extending beyond verifiable rewards.

For the standard Math, STEM, and logic tasks, they collect a large number of tasks as expected. A key thing done at this step is filtering out prompts which provide not too much signal: if the SFT model always gets the answer correct, it's not useful to include in the RL prompt set; if the SFT model never gets the answer correct, it's also not useful to include in the RL prompt set. A lot of detail is also provided on how they verify a lot of verifiable problems: there's a lot more LLM-as-judge in this stage than you might expect.

The really interesting part of Kimi K2's RL phase is the Self-Critique Rubric Reward. This is intended to be a "verifiability mechanism" for typically non-verifiable problems, subject to some very straightforward rubrics. These are intended to instill fundamental values in the LLM, as well as eliminate annoying or reward-hacking-like behaviors:

As mentioned earlier, Kimi K2 scores super low on the flattery benchmarks compared to models like 4o, so it's clear that this phase is imparting some interesting behavior to the model.

The Kimi K2 RL algorithm follows Kimi 1.5¹¹ which uses online policy mirror descent. This means that the policy is updated iteratively by maximizing expected reward while being regularized by a KL divergence term that acts as a Bregman divergence¹², constraining the new policy to remain close to the current policy and ensuring geometrically natural updates in the space of probability distributions. Basically, it's like GRPO except it uses mirror descent optimization instead of the PPO-like-framing.

The changes from K1.5 are using Muon as the optimizer, an enforced maximum token budget¹³, an auxiliary PTX loss, and a high sampling temperature with a decay schedule.

There additionally is a lot of detail about the RL infrastructure, specifically the training / checkpoint / inference engines necessary to make it work at scale. These are slightly outside the scope of what I want to cover here, but are worth a read.

Discussion

Kimi K2 gets some really interesting results. As mentioned, it's sort of complicated to compare it to other models. It's a bit like a reasoner model squeezed into a non-reasoner box. It's good at using tools, and it operates well in agentic harnesses. Getting Muon to work for a model this large is also no small contribution – it will be super interesting to see if it catches on or not.

One thing that stands out to me about the Kimi K2 work is the prevalence of LLM-as-judge, an often-proposed verification mechanism which feels like it never pans out in practice. Kimi used it all over the place, and it actually did seem to work at making the model better at being less sycophantic. Anthropic's Constitutional AI paper was the first to outline this sort of RLAIF approach way back in 2022, and DeepSeek-V3 made some vague allusion to using this late last year. Here we have a clear, concrete example of it working to do a specific thing: an undesirable behavior (sycophancy) addressed primarily through self-critique as RL signal.

I am admittedly surprised by this. I wonder if it's the case that there are thresholds of capability where better and better LLMs can self-critique their way to improved performance on gradually harder tasks, and similar to "reasoner" behavior emerging through simple RLVR, I wonder if "verifying unverifiable problems" is another such wall that can be crossed eventually.

Hybrid Models - GLM-4.5

GLM-4.5 by Zhipu AI further iterates on the hybrid model literature. They get a great result here, putting together a 355B total / 32B active parameter model trained on 23T tokens which seems roughly at parity with the Claude and OpenAI models. These models are natively capable of both extended thinking and immediate responses much like Qwen3, Claude Sonnet, and so on.