網(wǎng)易首頁 > 網(wǎng)易號(hào) > 正文申請(qǐng)入駐

林俊旸離職后首發(fā)長(zhǎng)文：從「想得更久」到「為行動(dòng)而想」

2026-03-26 21:04:03　來源: 賽博禪心

北京舉報(bào)

分享至

NOTE

3 月 4 日凌晨發(fā)出那句「me stepping down. bye my beloved qwen」之后，林俊旸在社交媒體上沉默了三周

今天，他在 X（Twitter）上發(fā)布了離職以來的第一篇長(zhǎng)文

https://x.com/JustinLin610/status/2037116325210829168

在這篇文章里，他沒有談離職原因，沒有回應(yīng)去向傳聞。全文只做了一件事：寫下他對(duì) AI 下一階段方向的判斷

從「讓模型想得更久」到「讓模型邊做邊想」

以下是原文全文，采用中英對(duì)照呈現(xiàn)

開篇

The last two years reshaped how we evaluate models and what we expect from them. OpenAI's o1 showed that "thinking" could be a first-class capability, something you train for and expose to users. DeepSeek-R1 proved that reasoning-style post-training could be reproduced and scaled outside the original labs. OpenAI described o1 as a model trained with reinforcement learning to "think before it answers." DeepSeek positioned R1 as an open reasoning model competitive with o1.

過去兩年，整個(gè)行業(yè)對(duì)模型的評(píng)判標(biāo)準(zhǔn)和預(yù)期都變了。OpenAI 的 o1 讓大家看到，「思考」本身可以是一種被訓(xùn)練出來的能力。DeepSeek-R1 緊隨其后，證明推理式的后訓(xùn)練可以在原始實(shí)驗(yàn)室之外被復(fù)現(xiàn)、被擴(kuò)展

That phase mattered. But the first half of 2025 was mostly about reasoning thinking: how to make models spend more inference-time compute, how to train them with stronger rewards, how to expose or control that extra reasoning effort. The question now is what comes next. I believe the answer is agentic thinking: thinking in order to act, while interacting with an environment, and continuously updating plans based on feedback from the world.

那個(gè)階段很重要，但 2025 上半年基本還是在圍繞一個(gè)問題打轉(zhuǎn)：怎么讓模型在推理的時(shí)候多想一會(huì)兒�，F(xiàn)在該問下一步了。我的判斷是智能體式思考（agentic thinking）。為了行動(dòng)而思考，在跟環(huán)境打交道的過程中思考，根據(jù)真實(shí)反饋不斷修正計(jì)劃

1. o1 和 R1 真正教會(huì)了我們什么

The first wave of reasoning models taught us that if we want to scale reinforcement learning in language models, we need feedback signals that are deterministic, stable, and scalable. Math, code, logic, and other verifiable domains became central because rewards in these settings are much stronger than generic preference supervision. They let RL optimize for correctness rather than plausibility. Infrastructure became critical.

第一波推理模型教會(huì)我們一件事：要在語言模型上把強(qiáng)化學(xué)習(xí)跑起來，反饋信號(hào)得是確定的、穩(wěn)定的、能規(guī)�；摹�(shù)學(xué)、代碼、邏輯這些可以驗(yàn)證對(duì)錯(cuò)的領(lǐng)域，成了 RL 的主戰(zhàn)場(chǎng)。因?yàn)樵谶@些場(chǎng)景里，獎(jiǎng)勵(lì)信號(hào)的質(zhì)量遠(yuǎn)高于「讓人類標(biāo)注員投票選哪個(gè)回答更好」

Once a model is trained to reason through longer trajectories, RL stops being a lightweight add-on to supervised fine-tuning. It becomes a systems problem. You need rollouts at scale, high-throughput verification, stable policy updates, efficient sampling. The emergence of reasoning models was as much an infra story as a modeling story. OpenAI described o1 as a reasoning line trained with RL, and DeepSeek R1 later reinforced that direction by showing how much dedicated algorithmic and infrastructure work reasoning-based RL demands. The first big transition: from scaling pretraining to scaling post-training for reasoning.

模型一旦開始在更長(zhǎng)的推理軌跡上訓(xùn)練，RL 就不再是 SFT 上面加的一層薄薄的東西了，它變成了一個(gè)系統(tǒng)工程問題。你需要大規(guī)模的 rollout、高吞吐的驗(yàn)證、穩(wěn)定的策略更新。推理模型的誕生，說到底是一個(gè)基礎(chǔ)設(shè)施的故事。第一個(gè)大轉(zhuǎn)變：從擴(kuò)展預(yù)訓(xùn)練，到擴(kuò)展推理后訓(xùn)練

2. 真正的難題從來不是「合并思考與指令」

At the beginning of 2025, many of us in Qwen team had an ambitious picture in mind. The ideal system would unify thinking and instruct modes. It would support adjustable reasoning effort, similar in spirit to low / medium / high reasoning settings. Better still, it would automatically infer the appropriate amount of reasoning from the prompt and context, so the model could decide when to answer immediately, when to think longer, and when to spend much more computation on a truly difficult problem.

2025 年初，我們千問團(tuán)隊(duì)有一個(gè)很大的野心：做一個(gè)統(tǒng)一的系統(tǒng)，讓思考模式和指令模式合二為一。用戶可以調(diào)推理力度，低、中、高三檔。更好的情況是模型自己判斷這道題該想多久，簡(jiǎn)單的直接答，難的多花點(diǎn)算力

Conceptually, this was the right direction. Qwen3 was one of the clearest public attempts. It introduced "hybrid thinking modes," supported both thinking and non-thinking behavior in one family, emphasized controllable thinking budgets, and described a four-stage post-training pipeline that explicitly included "thinking mode fusion" after long-CoT cold start and reasoning RL.

方向是對(duì)的。Qwen3 是業(yè)內(nèi)最清晰的一次公開嘗試，引入了「混合思考模式」，一個(gè)模型家族里同時(shí)支持想和不想兩種狀態(tài)，還有一個(gè)四階段的后訓(xùn)練流水線，專門做「思考模式融合」

But merging is much easier to describe than to execute well. The hard part is data. When people talk about merging thinking and instruct, they often think first about model-side compatibility: can one checkpoint support both modes, can one chat template switch between them, can one serving stack expose the right toggles. The deeper issue is that the data distributions and behavioral objectives of the two modes are substantially different.

但做起來比說起來難多了。難點(diǎn)在數(shù)據(jù)。大家聊合并的時(shí)候，第一反應(yīng)往往是模型側(cè)的問題：一個(gè) checkpoint 能不能同時(shí)裝兩種模式。真正的麻煩在更深處，兩種模式要的數(shù)據(jù)分布和行為目標(biāo)，本質(zhì)上就不一樣

We did not get everything right when trying to balance model merging with improving the quality and diversity of post-training data. During that revision process, we also paid close attention to how users were actually engaging with thinking and instruct modes. A strong instruct model is typically rewarded for directness, brevity, formatting compliance, low latency on repetitive, high-volume enterprise tasks such as rewriting, labeling, templated support, structured extraction, and operational QA. A strong thinking model is rewarded for spending more tokens on difficult problems, maintaining coherent intermediate structure, exploring alternative paths, and preserving enough internal computation to meaningfully improve final correctness.

這件事我們沒有全做對(duì)。過程中我們一直在看用戶到底怎么用這兩種模式。好的指令模型講究干脆利落，回復(fù)短、格式規(guī)矩、延遲低，適合企業(yè)里那種大批量的改寫、標(biāo)注、模板客服。好的思考模型則相反，它需要在難題上多花 token，走不同的路徑去探索，保持足夠的內(nèi)部計(jì)算來真正提升最終的準(zhǔn)確率

These two behavior profiles pull against each other. If the merged data is not carefully curated, the result is usually mediocre in both directions: the "thinking" behavior becomes noisy, bloated, or insufficiently decisive, while the "instruct" behavior becomes less crisp, less reliable, and more expensive than what commercial users actually want.

兩種行為畫像天然互斥。數(shù)據(jù)沒策展好的話，兩頭都會(huì)變平庸：思考模式變得啰嗦、膨脹、不果斷，指令模式則變得不夠干脆、不夠穩(wěn)定，還比客戶實(shí)際需要的更貴

Separation remained attractive in practice. Later in 2025, after the initial hybrid framing of Qwen3, the 2507 line shipped distinct Instruct and Thinking updates, including separate 30B and 235B variants. In commercial deployment, a large number of customers still wanted high-throughput, low-cost, highly steerable instruct behavior for batch operations. For those scenarios, merging wasn't obviously a benefit. Separating the lines allowed teams to focus on solving the data and training problems of each mode more cleanly.

所以分開做在實(shí)踐中仍然有吸引力。2025 年下半年，Qwen 的 2507 版本就發(fā)了獨(dú)立的 Instruct 和 Thinking 版本，30B 和 235B 各一套。很多企業(yè)客戶要的就是高吞吐、低成本、高度可控的指令模型。分開做讓團(tuán)隊(duì)能更干凈地解決各自的問題

Other labs chose the opposite route. Anthropic publicly argued for an integrated model philosophy: Claude 3.7 Sonnet was introduced as a hybrid reasoning model where users could choose ordinary responses or extended thinking, and API users could set a thinking budget. Anthropic explicitly said they believed reasoning should be an integrated capability rather than a separate model. GLM-4.5 also publicly positioned itself as a hybrid reasoning model with both thinking and non-thinking modes, unifying reasoning, coding, and agent capabilities; DeepSeek later moved in a similar direction with V3.1's "Think & Non-Think" hybrid inference.

也有實(shí)驗(yàn)室走了反方向。Anthropic 明確主張集成路線，Claude 3.7 Sonnet 就是一個(gè)混合推理模型，用戶想讓它多想就多想，API 還能設(shè)思考預(yù)算。GLM-4.5、DeepSeek V3.1 后來也往這個(gè)方向走了

The key question is whether the merge is organic. If thinking and instruct are merely co-located inside one checkpoint but still behave like two awkwardly stitched personalities, the product experience remains unnatural. A truly successful merge requires a smooth spectrum of reasoning effort. The model should be able to express multiple levels of effort, and ideally choose among them adaptively. GPT-style effort control points toward this: a policy over compute, rather than a binary switch.

關(guān)鍵在于這個(gè)合并是不是自然長(zhǎng)出來的。如果兩種模式只是硬塞在一個(gè) checkpoint 里，表現(xiàn)得像兩個(gè)尷尬拼起來的人格，用戶體驗(yàn)不會(huì)好。真正成功的合并需要一個(gè)平滑的推理力度光譜，模型能自己判斷該花多少力氣去想。GPT 的 effort control 機(jī)制指向了這個(gè)方向：對(duì)計(jì)算的策略分配，而非開關(guān)切換

3. Anthropic 的方向是一個(gè)有用的糾偏

Anthropic's public framing around Claude 3.7 and Claude 4 was restrained. They emphasized integrated reasoning, user-controlled thinking budgets, real-world tasks, coding quality, and later the ability to use tools during extended thinking. Claude 3.7 was presented as a hybrid reasoning model with controllable budgets; Claude 4 extended that by allowing reasoning to interleave with tool use, while Anthropic simultaneously emphasized coding, long-running tasks, and agent workflows as primary goals.

Anthropic 在 Claude 3.7 和 Claude 4 上的公開表述一直比較克制。強(qiáng)調(diào)的是集成推理、用戶可控的思考預(yù)算、真實(shí)世界任務(wù)、代碼質(zhì)量。到了 Claude 4，推理可以跟工具調(diào)用交叉進(jìn)行了，編程和 Agent 工作流被放在了最優(yōu)先的位置

Producing a longer reasoning trace doesn't automatically make a model more intelligent. In many cases, excessive visible reasoning signals weak allocation. If the model is trying to reason about everything in the same verbose way, it may be failing to prioritize, failing to compress, or failing to act. Anthropic's trajectory suggested a more disciplined view: thinking should be shaped by the target workload. If the target is coding, then thinking should help with codebase navigation, planning, decomposition, error recovery, and tool orchestration. If the target is agent workflows, then thinking should improve execution quality over long horizons rather than producing impressive intermediate prose.

推理鏈更長(zhǎng)，不等于模型更聰明。很多時(shí)候，推理鏈越長(zhǎng)，反而說明模型在亂花算力。什么都用同一種冗長(zhǎng)的方式去想，說明它不會(huì)分輕重、不會(huì)壓縮、不會(huì)動(dòng)手。Anthropic 的路徑暗示了一個(gè)更有紀(jì)律的思路：思考應(yīng)該由目標(biāo)任務(wù)來塑造。寫代碼就幫你導(dǎo)航代碼庫(kù)、做規(guī)劃、拆解問題。跑 Agent 工作流就提升長(zhǎng)周期的執(zhí)行質(zhì)量，而非產(chǎn)出漂亮的中間文本

This emphasis on targeted utility points toward something larger: we are moving from the era of training models to the era of training agents. We made this explicit in the Qwen3 blog, writing that "we are transitioning from an era focused on training models to one centered on training agents," and linking future RL advances to environmental feedback for long-horizon reasoning. An agent is a system that can formulate plans, decide when to act, use tools, perceive environment feedback, revise strategy, and continue over long horizons. It is defined by closed-loop interaction with the world.

這個(gè)思路往大了看，指向的是一個(gè)更根本的變化：我們正在從訓(xùn)練模型的時(shí)代，走向訓(xùn)練智能體的時(shí)代。我們?cè)?Qwen3 的博客里也明確寫過這句話。Agent 是什么？能做計(jì)劃、能判斷什么時(shí)候動(dòng)手、能用工具、能感知環(huán)境給的反饋、能改策略、能持續(xù)跑下去。它的定義特征是跟真實(shí)世界的閉環(huán)交互

4. 「智能體式思考」到底指什么

Agentic thinking is a different optimization target. Reasoning thinking is usually judged by the quality of internal deliberation before a final answer: can the model solve the theorem, write the proof, produce the correct code, or pass the benchmark. Agentic thinking is about whether the model can keep making progress while interacting with an environment.

智能體式思考和推理式思考，優(yōu)化目標(biāo)就不一樣。推理式思考看的是模型在給出最終答案之前的內(nèi)部推演質(zhì)量：能不能解這道定理，能不能寫對(duì)代碼，能不能過 benchmark。智能體式思考看的是另一件事：模型在跟環(huán)境打交道的過程中，能不能持續(xù)往前走

The central question shifts from "Can the model think long enough?" to "Can the model think in a way that sustains effective action?" Agentic thinking has to handle several things that pure reasoning models can mostly avoid:

核心問題從「模型能不能想得夠久」變成了「模型能不能用一種撐得起有效行動(dòng)的方式來想」。智能體式思考要處理幾件純推理模型基本不用管的事：

+ 什么時(shí)候該停下來不想了，開始動(dòng)手

+ 該調(diào)哪個(gè)工具，先調(diào)哪個(gè)后調(diào)哪個(gè)

+ 環(huán)境給回來的信息可能是殘缺的、有噪聲的，得能用

+ 失敗了得能改計(jì)劃

+ 跨很多輪對(duì)話、很多次工具調(diào)用，思路不能斷

Agentic thinking is a model that reasons through action.

智能體式思考，就是通過行動(dòng)來推理

5. 為什么智能體 RL 的基礎(chǔ)設(shè)施更難

Once the objective shifts from solving benchmark problems to solving interactive tasks, the RL stack changes. The infrastructure used for classical reasoning RL isn't enough. In reasoning RL, you can often treat rollouts as mostly self-contained trajectories with relatively clean evaluators. In agentic RL, the policy is embedded inside a larger harness: tool servers, browsers, terminals, search engines, simulators, execution sandboxes, API layers, memory systems, and orchestration frameworks. The environment is no longer a static verifier; it's part of the training system.

目標(biāo)一旦從解 benchmark 變成解交互式任務(wù)，整個(gè) RL 技術(shù)棧就得跟著變。以前推理 RL 的基礎(chǔ)設(shè)施不夠用了。推理 RL 里，rollout 基本上是自己跑完的一條軌跡，配個(gè)相對(duì)干凈的評(píng)估器就行。智能體 RL 里，策略被塞進(jìn)了一個(gè)大得多的 harness：工具服務(wù)器、瀏覽器、終端、搜索引擎、模擬器、沙盒、API 層、記憶系統(tǒng)、編排框架。環(huán)境不再是一個(gè)靜態(tài)的判分器，它是訓(xùn)練系統(tǒng)的一部分

This creates a new systems requirement: training and inference must be more cleanly decoupled. Without that decoupling, rollout throughput collapses. Consider a coding agent that must execute generated code against a live test harness: the inference side stalls waiting for execution feedback, the training side starves for completed trajectories, and the whole pipeline operates far below the GPU utilization you would expect from classical reasoning RL. Adding tool latency, partial observability, and stateful environments amplifies these inefficiencies. The result is that experimentation slows and becomes painful long before you reach the capability levels you are targeting.

這就帶來一個(gè)新的系統(tǒng)需求：訓(xùn)練和推理必須更干凈地分開。不分開的話，rollout 吞吐量直接塌掉。舉個(gè)例子，一個(gè)編程 Agent 得把生成的代碼對(duì)著真實(shí)測(cè)試跑一遍。推理端等著執(zhí)行反饋，訓(xùn)練端等著完整軌跡，整條流水線的 GPU 利用率遠(yuǎn)不如你預(yù)想的那么高。再加上工具延遲、信息不完整、環(huán)境有狀態(tài)，這些低效層層疊加。結(jié)果就是實(shí)驗(yàn)變慢，還沒到你想要的能力水平就已經(jīng)很痛苦了

The environment itself also becomes a first-class research artifact. In the SFT era, we obsessed over data diversity. In the agent era, we should obsess over environment quality: stability, realism, coverage, difficulty, diversity of states, richness of feedback, exploit resistance, and scalability of rollout generation. Environment-building has started to become a real startup category rather than a side project. If the agent is being trained to operate in production-like settings, then the environment is part of the core capability stack.

環(huán)境本身也成了一等研究對(duì)象。SFT 時(shí)代大家執(zhí)著于數(shù)據(jù)多樣性，Agent 時(shí)代應(yīng)該執(zhí)著于環(huán)境質(zhì)量：穩(wěn)不穩(wěn)定、夠不夠真實(shí)、覆蓋面多大、狀態(tài)夠不夠豐富、模型能不能找到漏洞刷分。環(huán)境構(gòu)建已經(jīng)開始變成一個(gè)真正的創(chuàng)業(yè)方向了，不再是邊角料

6. 下一個(gè)前沿是更有用的思考

My expectation is that agentic thinking will become the dominant form of thinking. I think it may eventually replace much of the old static-monologue version of reasoning thinking: excessively long, isolated internal traces that try to compensate for lack of interaction by emitting more and more text. Even on very difficult math or coding tasks, a genuinely advanced system should have the right to search, simulate, execute, inspect, verify, and revise. The objective is to solve problems robustly and productively.

我預(yù)期智能體式思考會(huì)成為主流。它大概率會(huì)替代掉大部分舊式的推理方式：那種又長(zhǎng)又封閉的內(nèi)部獨(dú)白，試圖靠吐出越來越多的文字來彌補(bǔ)自己沒法跟外界交互的缺陷。哪怕是極難的數(shù)學(xué)或編程任務(wù)，一個(gè)真正先進(jìn)的系統(tǒng)也應(yīng)該能搜索、能模擬、能執(zhí)行、能檢查、能回頭改。目標(biāo)是把問題穩(wěn)穩(wěn)當(dāng)當(dāng)?shù)亟鉀Q

The hardest challenge in training such systems is reward hacking. As soon as the model gets meaningful tool access, reward hacking becomes much more dangerous. A model with search might learn to look up answers directly during RL. A coding agent might exploit future information in a repository, misuse logs, or discover shortcuts that invalidate the task. An environment with hidden leaks can make the policy look superhuman while actually training it to cheat. This is where the agent era becomes much more delicate than the reasoning era. Better tools make the model more useful, but they also enlarge the attack surface for spurious optimization. We should expect the next serious research bottlenecks to come from environment design, evaluator robustness, anti-cheating protocols, and more principled interfaces between policy and world. Still, the direction is clear. Tool-enabled thinking is simply more useful than isolated thinking, and has a far better chance of improving real productivity.

訓(xùn)練這類系統(tǒng)最難的是 reward hacking。模型一旦拿到工具，作弊就變得容易得多。有搜索能力的模型可能在 RL 訓(xùn)練時(shí)直接去查答案；編程 Agent 可能利用倉(cāng)庫(kù)里不該看到的信息、濫用日志、找到繞過任務(wù)的捷徑。環(huán)境里藏著漏洞的話，策略看起來超強(qiáng)，其實(shí)是學(xué)會(huì)了作弊。這是 Agent 時(shí)代比推理時(shí)代更微妙的地方。工具越好，模型越有用，但虛假優(yōu)化的空間也越大。接下來真正卡脖子的研究瓶頸大概率來自環(huán)境設(shè)計(jì)、評(píng)估器的魯棒性、反作弊機(jī)制。但方向是清楚的：能用工具的思考就是比封閉思考更有用

Agentic thinking will also mean harness engineering. The core intelligence will increasingly come from how multiple agents are organized: an orchestrator that plans and routes work, specialized agents that act like domain experts, and sub-agents that execute narrower tasks while helping control context, avoid pollution, and preserve separation between different levels of reasoning. The future is a shift from training models to training agents, and from training agents to training systems.

智能體式思考也意味著 harness 工程會(huì)變得越來越重要。核心智能會(huì)越來越多地取決于多個(gè) Agent 怎么組織：誰來編排分工，誰當(dāng)領(lǐng)域?qū)＜�，誰執(zhí)行具體任務(wù)同時(shí)幫忙管上下文、防污染。從訓(xùn)練模型到訓(xùn)練智能體，再?gòu)挠?xùn)練智能體到訓(xùn)練系統(tǒng)

結(jié)語

The first phase of the reasoning wave established something important: RL on top of language models can produce qualitatively stronger cognition when the feedback signal is reliable and the infrastructure can support it.

推理浪潮的第一階段確立了一件事：反饋信號(hào)夠可靠、基礎(chǔ)設(shè)施撐得住的話，語言模型上的 RL 能產(chǎn)出質(zhì)變級(jí)別的認(rèn)知提升

The deeper transition is from reasoning thinking to agentic thinking: from thinking longer to thinking in order to act. The core object of training has shifted. It is the model-plus-environment system, or more concretely, the agent and the harness around it. That changes what research artifacts matter most: model architecture and training data, yes, but also environment design, rollout infrastructure, evaluator robustness, and the interfaces through which multiple agents coordinate. It changes what "good thinking" means: the most useful trace for sustaining action under real-world constraints, rather than the longest or most visible one.

更深層的變化是從推理式思考到智能體式思考：從想得更久，到為了動(dòng)手而想。訓(xùn)練的核心對(duì)象變了，變成了模型加環(huán)境的整個(gè)系統(tǒng)。哪些東西重要也跟著變了：模型架構(gòu)和訓(xùn)練數(shù)據(jù)當(dāng)然還重要，但環(huán)境設(shè)計(jì)、rollout 基礎(chǔ)設(shè)施、評(píng)估器的穩(wěn)健程度、多個(gè) Agent 之間怎么協(xié)調(diào)，這些都進(jìn)了核心圈�！负玫乃伎肌沟亩x也變了：在真實(shí)約束下最能撐起行動(dòng)的那條軌跡，而非最長(zhǎng)或最顯眼的那條

It also changes where the competitive edge will come from. In the reasoning era, the edge came from better RL algorithms, stronger feedback signals, and more scalable training pipelines. In the agentic era, the edge will come from better environments, tighter train-serve integration, stronger harness engineering, and the ability to close the loop between a model's decisions and the consequences those decisions produce.

競(jìng)爭(zhēng)優(yōu)勢(shì)的來源也不一樣了。推理時(shí)代拼的是 RL 算法、反饋信號(hào)、訓(xùn)練流水線的擴(kuò)展性。智能體時(shí)代拼的是環(huán)境質(zhì)量、訓(xùn)練和推理的緊耦合、harness 工程能力，以及能不能把模型的決策和決策的后果真正串成一個(gè)閉環(huán)

原文發(fā)布于 X（Twitter），作者林俊旸（Junyang Lin）

編譯賽博禪心

特別聲明：以上內(nèi)容(如有圖片或視頻亦包括在內(nèi))為自媒體平臺(tái)“網(wǎng)易號(hào)”用戶上傳并發(fā)布，本平臺(tái)僅提供信息存儲(chǔ)服務(wù)。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.