AI deals, benchmarks, and agents: what’s inside today

Today is less about the next big “AI moment” and more about the question of who really holds the levers in the AI market right now: model makers, tool builders, or the companies trying to make agents productive in everyday work. On top of that, there are two research papers that make one thing pretty clear: the difference between a demo and real reliability often lies in the details. And yes, the big deals are still on the table today too — naturally with the usual mix of strategy, money, and a bit of Silicon Valley operetta.

🚀 OpenAI buys Ona: more cloud power for Codex agents

OpenAI is acquiring the German startup Ona, formerly known as Gitpod, which was founded in Kiel in 2020. The deal fits neatly into OpenAI’s agent strategy: Codex should not only suggest code, but also be able to work autonomously for longer periods in secure cloud development environments — even when your laptop has long since switched to power-saving mode. That’s more than just a nice feature. It shows where AI development is heading: away from the pure chat window and toward persistently running work environments with permissions, context, and infrastructure.

This matters for the market because such cloud setups are the real bottleneck for productive coding agents. It’s not the model alone that decides, but how well it is embedded in real dev workflows. So OpenAI is not just buying technology here, but also experience with developer environments and security questions. Source: The Decoder

🧠 LLM-as-an-Investigator: evidence first, conclusions later

The arXiv paper “LLM-as-an-Investigator” addresses a problem you’ve probably already seen in many AI assistants: they jump too quickly to the first plausible explanation. The researchers instead propose an evidence-first approach. Rather than simply accepting a user’s assumption, the model should first actively search for verifiable facts and diagnose problems interactively. Sounds simple, but it makes a big difference for reliability.

Why does this matter? Because LLMs can be persuasive but still wrong very early on, especially in technical support or debugging tasks. The model then “knows” too much too soon. An investigative approach can reduce such false assumptions and is therefore especially interesting for IT support, incident analysis, and complex troubleshooting. In short: less gut feeling, more detective work. Source: arXiv

🛠️ Tool tip of the day: Codex/agent workflows with a clean cloud environment

If you’re experimenting with coding agents, you don’t just need a good model — you also need a clean, reproducible environment. This is exactly where cloud dev workspaces help, allowing agents to edit files, run tests, and work through tasks over longer periods. That’s especially interesting for teams that want to automate Git-based workflows without tangling up local machines.
Tip: take a look at suitable agent and cloud dev setups — ideally with a security and permissions model. #

⚙️ SkillOpt: training agents without touching the model

With SkillOpt, Microsoft and three Chinese universities present a pretty clever approach: instead of changing a model’s weights, they optimize the instructions for AI agents. It’s like training for workflows, not for the brain itself. According to the report, a simple Markdown file can significantly improve GPT-5.5 on procedural tasks — and even across models and environments, such as between Codex and Claude Code.

This matters because many agent problems aren’t model problems at all, but prompt, workflow, or context problems. If a method like SkillOpt can generate robust instructions, it could reduce costs and make agents more reliable. For companies, that means: before the next fine-tuning round, better optimize the process layer first. The classic “we need more parameters” reflex is at least getting a little nervous. Source: The Decoder

📐 Anthropic makes major math gains with Fable 5

Anthropic’s Claude Fable 5 sets new standards on the difficult FrontierMath benchmark and reaches 88% accuracy on the hardest level. For comparison: its predecessor Opus 4.5 was still below 10% at the beginning of 2026, while OpenAI’s GPT-5.5 reaches about 75% according to the report. That’s a pretty significant leap — and a sign of how quickly specialized reasoning capabilities are developing.

But that doesn’t automatically mean: “problem solved.” Benchmarks matter, but they always measure only a slice of real-world performance. Still, the trend clearly shows that mathematical reasoning and structured problem-solving are improving rapidly in frontier models. That’s relevant for research, engineering, finance, and anywhere else where sound conclusions matter. The race remains exciting — and mildly uncomfortable for anyone who thought yesterday that mathematics was the last safe stronghold. Source: The Decoder

🎥 Rethinking RAG in Long Videos: retrieval becomes multimodal

The arXiv paper “Rethinking RAG in Long Videos” extends retrieval-augmented generation in a difficult direction: long egocentric videos with multiple modalities and timescales. The core problem is familiar, but even trickier here: a system must not only find the right material, but also decide how to use it. Many existing benchmarks are too easy, because the answer can sometimes be given even without the video — a classic benchmark bug in attractive packaging.

Why does this matter? Because video assistants, robotics, and visual analysis need exactly such systems. Good VideoRAG approaches must perform real retrieval rather than just guessing elegantly. In practice, this means: if you rely on multimodal agents, you should take retrieval quality just as seriously as the language model behind it. Otherwise, you end up with a very eloquent spectator, but not a useful analyst. Source: arXiv

📊 TerraBench: can agents really connect Earth system data?

The second research paper around TerraBench tests whether AI agents can make meaningful inferences across heterogeneous Earth system data. That includes climate, environmental, and geoscience data sources that can’t simply be turned into a neat text block. That’s exactly what makes it interesting: the systems have to combine information from different formats, time spans, and contexts.

For the AI market, this is a sign of where “practical AI” is heading: away from isolated Q&A setups and toward analytical systems for complex data landscapes. Anyone working in climate tech, research, or monitoring needs models that don’t just sound nice, but integrate and justify reliably. Benchmarks like TerraBench matter because they show where multimodal reasoning systems still stumble today. And unfortunately, that often happens exactly where the world is most complicated. Source: arXiv

🧩 Meta and Manus: when geopolitical reality meets AI deals

TechCrunch reports that Meta apparently wants to unwind the $2 billion deal with Manus again after Beijing requested a reversal. This is a good example of how AI deals are no longer decided solely by technology or valuation, but also by geopolitical conditions, regulation, and questions of power. In the lovely AI world, it’s not just the model that’s big — the political shadow behind it is too.

For the industry, this is a reminder that acquisitions in the AI sector do not happen in a vacuum. Anyone with international teams, data, infrastructure, or stakes increasingly has to expect headwinds from multiple directions. That can delay strategies, kill deals, or force companies into detours. In short: not every merger survives contact with reality — and some fail before the ink on the term sheet is even dry. Source: TechCrunch

Don’t want to miss any news? Subscribe to the newsletter