AI Agents, Reasoning, and Qwen: Today’s AI News

Today’s focus is strikingly often on agents, reasoning, and efficiency. In other words, exactly the area where it becomes clear whether AI merely sounds impressive — or actually does useful work. On top of that, there are new research approaches for retrieval, uncertainty estimation, and optimization, plus a small geopolitical side note involving Anthropic and the NSA.

🧠 AutoTTS: When an AI agent makes thinking more efficient

Researchers from UMD, Google, Meta, and other institutions show with AutoTTS that a coding agent can itself find a better control algorithm for AI reasoning. Specifically, the goal is to decide when a language model should keep thinking — and when it can stop without losing quality. The result sounds almost too good to be true: around 70 percent less compute at comparable accuracy compared with the well-known Self-Consistency method. Searching for the algorithm cost only $40 and 160 minutes. That’s remarkable because it not only improves a model, but optimizes the reasoning workflow itself. For anyone deploying LLMs in production, this matters: fewer tokens, lower latency, lower electricity costs. And yes, sometimes the smartest AI is apparently the one that knows when to stop talking. Source: The Decoder

🔎 SeedER: Retrieval from Knowledge Graphs becomes more agentic

With SeedER, researchers propose an approach intended to make retrieval from knowledge graphs more efficient. Many systems know the problem: ego-graph expansion quickly gets out of hand, dense embeddings struggle with multi-hop, compositional questions, and classic agentic graph search is expensive. SeedER uses a seed-and-expand principle: first find suitable starting nodes, then expand in a targeted way. This is exciting because knowledge graphs are extremely powerful in theory, but in practice often fail due to cost, scale, and a hard-to-manage structure. For LLM retrieval, RAG, and agentic knowledge systems, such an approach could become an important building block, especially for complex multi-hop questions. The real message: intelligent search is often more important than brute-force traversal. Source: arXiv:2605.23753

🤖 Qwen3.7-Max: Alibaba is playing at agent level

Alibaba presents Qwen3.7-Max, a proprietary model explicitly designed for long-term autonomous work as an AI agent. According to the benchmarks, it reaches a level comparable to Claude Opus 4.6 and leaves other Chinese models such as DeepSeek V4 Pro and Kimi K2.6 behind. Particularly interesting is not just its benchmark position, but the demonstration of 1158 autonomous steps on a complex development task. That’s another sign that competition among models is increasingly shifting from pure text quality toward agent capability, endurance, and tool use. Almost as an aside, the team also shows the model controlling a quadruped robot — the classic “by the way, we attached some hardware to it” move. For the market, this means agentic LLMs are increasingly becoming a product promise. Source: The Decoder

🧬 WeCon: Multi-objective optimization with weight control

WeCon is a new neural solver for multi-objective combinatorial optimization problems. Sounds unwieldy, but it matters for anyone trying to solve optimization problems with multiple goals — for example cost, time, and quality at the same time. Many existing methods break such problems into subtasks with weights, but treat those weights too statically or only once during decoding. WeCon tries to embed the weights more deeply into the model, making it more flexible in responding to different trade-offs. That’s interesting because real-world optimization is rarely one-dimensional. Once you have multiple goals at once, “the best solution” quickly becomes a question of: best for whom, and according to which criterion? That is exactly where WeCon comes in. For research in optimization, operations research, and AI-assisted planning, this is a useful step forward. Source: arXiv:2605.22876

🛡️ Anthropic may continue supplying Claude to the NSA

A politically and economically sensitive point: Anthropic apparently may continue to deliver Claude models to the NSA despite being classified as a “supply chain risk.” According to the report, this is partly because the intelligence agencies lack Nvidia’s latest Grace Blackwell chips, and because Anthropic’s model “Mythos” is said to run on older hardware as well. What makes this particularly sensitive is the contract context around the clause “any lawful use”, which had previously caused disputes. For the AI market, this is interesting because it shows how closely government, security, hardware availability, and model access are intertwined. It is no longer just about who has the best model, but also who is allowed to run it, and under what conditions. The debate about AI in government agencies therefore remains a mix of technology, procurement, and politics — exactly the kind of topic where all parties end up phrasing things very carefully. Source: The Decoder

📊 Reading uncertainty from trajectories rather than a single snapshot

The paper Reading Calibrated Uncertainty from Language Model Trajectories addresses a familiar problem: many methods for uncertainty quantification in LLMs rely on Maximum Softmax Probability (MSP) — cheap, but often poorly calibrated. Other approaches look at internal activations, but treat them as static snapshots. The new direction instead focuses on trajectories, meaning the evolution of the model’s states over the course of generation. That is plausible: in language models, uncertainty is often not a moment, but a process. If you look only at a single value at the end, you may miss the actual dynamics. For applications with structured output, safety, and reliable decision support, this matters, because well-calibrated uncertainty is often the difference between useful assistance and overconfident nonsense. Source: arXiv:2605.22864

🏭 Federated Recommender Systems: personalization without centralizing data

The paper Building a privacy-preserving Federated Recommender system for mobile devices presents an approach to personalized recommendations on mobile devices without centrally collecting sensitive user data. Instead, the system uses a two-stage federated pipeline and cleanly separates the relevant components. This is especially important because recommendation systems have traditionally depended heavily on centralized user data — which increasingly conflicts with modern privacy requirements and regional regulations. Here, federated learning is not a buzzword, but a practical compromise: models learn from local signals without phoning everything home. For Mobile AI, privacy-by-design, and on-device personalization, this is a relevant building block. Not yet the all-purpose solution to everything, but at least significantly less data-hungry. Source: arXiv:2605.22924

🛠️ Tool tip of the day

If you work on agents, benchmarking, and LLM workflows, it’s worth taking a look at tools for structured experiment and run management today. Especially with autonomous agents taking many steps, it quickly becomes unclear where time, tokens, and quality are being lost. A tool stack with clean logging, evaluation pipelines, and cost control will save you more headaches later than any spontaneous “let’s just take a quick look.” For teams that want to test such workflows professionally, this is a good starting point: #

Don’t want to miss any news? Subscribe to the newsletter