We're in the midst of the biggest disruption for humanity since the industrial revolution. We're back to the drawing board in nearly all aspects of society. Contact me for projects rethinking small business organization, enhancing human performance, and the way we will work together and find meaning in an AI-centric world.
Four blog posts from OpenAI this week teach managers, finance teams, and individual users how to prompt better, build reusable workflows, and personalize their ChatGPT experience. Then there's the Hyatt deployment, rolling out ChatGPT Enterprise across its entire global workforce using GPT-5.4 and Codex. That's not experimentation anymore. That's infrastructure.
Meanwhile the research side is asking sharper questions about whether the tools actually hold up. The Amazing Agent Race benchmark caught something worth sitting with: most existing agent evaluations are simple linear chains, 55 to 100 percent of test instances involving just two to five steps. Models that look capable in tests may be navigating nothing more complex than a hallway. GTA-2 makes a similar point about tool-use benchmarks being misaligned with real-world workflow complexity.
There's a parallel concern inside the models themselves. The diversity collapse paper shows that post-training narrows output variation, which quietly undermines inference-time scaling methods that depend on getting different answers from the same model.
So the enterprise rollouts assume robustness. The benchmarks keep finding brittleness. That gap doesn't resolve itself just because the contracts are signed.
The week's most revealing detail isn't a model launch. It's the list of things being measured: whether AI sabotages its own research, whether it understands animal biology, whether it can be trusted to reason faithfully, whether it generates fake music convincingly enough to need forensic detection.
The labs are shipping faster than anyone can audit. So the field is quietly building the audit infrastructure in parallel.
Anthropic signed safety MOUs, published RSP Version 3.0, and expanded its Long-Term Benefit Trust board. Google introduced a "cognitive framework for measuring progress toward AGI." Researchers published ASMR-Bench specifically to catch AI sabotaging ML research. AtManRL and "Beyond Surface Statistics" both chase the same ghost: an AI that says it's reasoning but isn't.
That's not a research trend. That's a trust deficit being papered over with benchmarks.
The product announcements kept coming: Claude Opus 4.7, Gemini 3.1 in four flavors, Qwen3.5-Omni at hundreds of billions of parameters, GPT-Rosalind for life sciences. Bigger, faster, more vertical. The capability curve isn't slowing.
Which means the gap between what these systems can do and what anyone can verify about them just got wider again.