BenchmarksAgentsEvaluation

Beyond the 5-Minute Task: Why Current Agent Benchmarks Are Failing Us

April 1, 20259 min readAbel Santillan Rodriguez

Beyond the 5-Minute Task: Why Current Agent Benchmarks Are Failing Us

Abstract

The rise of LLM-based autonomous agents has outpaced our ability to evaluate them. Most existing benchmarks—WebArena, SWE-bench, AgentBench—measure performance on tasks completable in seconds or minutes. Real-world deployment, however, demands agents that operate reliably over hours, days, or weeks. This post surveys the landscape of long-horizon task evaluation, identifies structural gaps in current methodologies, and proposes a framework for what Long-Time-Task (LTT) benchmarks should look like.

1. The Horizon Problem

When we benchmark LLMs on reasoning, we ask: can the model produce a correct answer? When we benchmark agents, we should ask: can the model complete a meaningful workflow? These are fundamentally different questions.

Current benchmarks implicitly assume a bounded horizon: a task has a clear start state, a clear terminal condition, and fits within a single context window. This assumption breaks down the moment we try to model anything a human knowledge worker actually does.

Consider the difference:

Benchmark Task	Real-World Analog
Fix a failing unit test	Refactor a legacy codebase over 3 sprints
Book a flight given a URL	Plan and manage a multi-destination business trip
Answer a support ticket	Manage a customer relationship over 6 months
Summarize a document	Synthesize a month of Slack threads + docs into a strategy memo

The gap isn't just quantitative (more steps). It's qualitative—long-horizon tasks require state management, error recovery, context compression, and goal drift detection that short-horizon benchmarks simply don't exercise.

2. State of the Art: What Benchmarks Exist Today

2.1 Short-Horizon Leaders

SWE-bench (Princeton, 2024) remains the gold standard for software engineering tasks. Agents are given a GitHub issue and a codebase; success means a passing test suite. Median successful resolution time: ~3–8 minutes of wall-clock execution. SWE-bench Verified (the human-filtered subset) has driven the field's most meaningful progress—Claude 3.7 Sonnet reached 70.3% on this subset as of early 2025.

WebArena evaluates web navigation across e-commerce, forums, and productivity apps. Tasks average 4–12 steps. Strong agents score ~35–50%. The benchmark is valuable but task complexity plateaus quickly—there's no mechanism for multi-session continuity.

AgentBench (THUDM, 2023) covers 8 environments including OS, database, and web. It introduced the idea of agent capability profiles rather than a single score, which is genuinely useful. Still, the longest tasks cap out at ~20 steps.

τ-bench (Yao et al., 2024) is the most interesting recent addition. It focuses on tool-augmented tasks in retail and airline domains, explicitly modeling multi-turn interaction with simulated user policies. It's the closest existing benchmark to testing robustness under uncertainty—but tasks still resolve within a single session.

2.2 Emerging Long-Horizon Work

GAIA (Meta AI, 2023) tests "general AI assistants" on tasks requiring multi-step reasoning, web access, and tool use. Level 3 tasks (the hardest tier) average ~50 intermediate steps and require chaining multiple capabilities. Human baseline: 92%. State-of-the-art agents: ~15–30% on Level 3. This gap is diagnostic.

OSWorld (2024) evaluates agents on real desktop environments. Tasks like "set up a development environment" or "migrate a project to a new framework" are genuinely hard and multi-step. It's one of the few benchmarks where time to completion is reported alongside success rate—an underused metric.

Cybench and InterCode target security and coding tasks respectively, with longer chains. Useful as capability probes, less useful as deployment proxies.

3. What's Missing: A Taxonomy of Long-Time-Task Requirements

If we want to build benchmarks that actually predict real-world agent performance on long-horizon tasks, we need to define what those tasks demand. Here's a working taxonomy:

3.1 Temporal Persistence

The agent must maintain coherent goals and state across multiple sessions. This is non-trivial: context windows reset, external state changes between sessions, and the agent must reconstruct its working memory from external storage.

Current gap: No mainstream benchmark tests multi-session continuity. Everything is single-session.

3.2 Graceful Degradation Under Interruption

Real tasks get interrupted. A code migration might be paused while the team debates architecture. An agent managing a procurement workflow might need to wait 3 days for a vendor response. Does the agent preserve its state cleanly? Does it detect that external state has changed while it was paused?

Current gap: Benchmarks assume uninterrupted execution. No evaluation of checkpoint/resume behavior.

3.3 Error Recovery and Replanning

Short-horizon benchmarks penalize errors in a binary way: either the final state is correct or it isn't. Long-horizon tasks require partial credit semantics—an agent that recovers from 3 errors to complete a 50-step task is more valuable than one that fails completely on step 4.

Additionally, error compounding is a key failure mode: small mistakes at step 5 create large divergences by step 30. Benchmarks should measure error propagation, not just terminal state.

Current gap: Most benchmarks report only success/failure. Error recovery trajectories are rarely analyzed.

3.4 Context Compression and Memory Management

At some horizon length, agents must compress their working context to fit within model limits. How they do this matters: lossy compression can drop critical constraints established early in the task. Lossless compression (e.g., external storage + retrieval) introduces latency and retrieval errors.

Current gap: Benchmarks don't stress-test memory management. Tasks are short enough that compression is rarely needed.

3.5 Goal Drift Detection

Over long horizons, the original goal can drift—either because the agent misinterprets accumulated instructions, or because the environment changes and the original goal becomes invalid. Agents need metacognitive awareness: "Is what I'm doing still aligned with what I was asked to do?"

Current gap: No benchmark explicitly tests goal drift detection or correction.

4. Proposed Framework: LTT-Bench

Based on the gap analysis above, here's a sketch of what a Long-Time-Task benchmark should include:

4.1 Task Design Principles

Minimum 50-step tasks, targeting 200+. Tasks should require at minimum one context reset to complete. The most interesting regime is 5–20 "sessions" of work.

Injected interruptions. Randomly pause agent execution and mutate external state. Measure how often the agent detects the change versus proceeding with stale assumptions.

Adversarial underspecification. Real tasks are underspecified. LTT-Bench tasks should have ambiguous requirements that require the agent to ask clarifying questions—and penalize agents that hallucinate specifications instead.

Partial credit scoring. Define intermediate milestones explicitly. Score = weighted sum of milestone completions, not binary terminal state.

4.2 Domains

Candidate domains with natural long-horizon structure:

Software project management: Plan, implement, review, and ship a feature across a simulated sprint. Involves code, tickets, PR reviews, stakeholder communication.
Research synthesis: Given a research question, gather sources over multiple sessions, synthesize findings, and produce a structured report. External "new papers" can appear mid-task.
Business operations: Handle an end-to-end procurement or onboarding flow with real third-party integrations, approvals, and wait states.
Incident response: Diagnose, mitigate, and document a production incident across multiple tools (logs, traces, runbooks, on-call systems).

4.3 Metrics

Beyond pass@k, LTT-Bench should report:

Metric	Definition
Milestone Completion Rate (MCR)	% of defined subtasks completed
Error Recovery Rate (ERR)	% of errors from which the agent successfully recovers
Context Fidelity Score (CFS)	How accurately agent reconstructs task state after a context reset
Goal Alignment Index (GAI)	Semantic similarity of agent's current operative goal to original specification
Session Efficiency	Steps to completion normalized by task complexity

5. Why This Matters for Production Deployments

This isn't academic. Teams deploying agents in production are discovering these failure modes the hard way.

The most common failure pattern we've observed: an agent that scores well on short benchmarks fails catastrophically on the 40th step of a real workflow—not because it lacks capability, but because it loses the thread. It forgets a constraint established early, it retries a failed action without understanding why it failed, or it completes a subtask that's no longer relevant because the goal drifted.

These failures are nearly invisible in current benchmark scores. An agent that gets 70% on SWE-bench might still be unreliable for any task that takes more than 20 minutes of wall-clock time.

The industry needs a "stress endurance" benchmark the same way the automotive industry needs crash tests alongside fuel efficiency ratings. Both matter. We're only measuring one.

6. Open Questions

How do we handle human-in-the-loop tasks? Many long-horizon tasks include human approval steps. Should benchmarks simulate human feedback, and if so, with what fidelity?
What's the right evaluation cadence? Should LTT benchmarks run continuously (like a production monitor) rather than as point-in-time evaluations?
How do we prevent benchmark overfitting? If agents are trained on simulated long-horizon tasks, they may overfit to the benchmark's specific interruption patterns or error injection strategies.
Multi-agent settings. The hardest real-world tasks involve multiple agents coordinating. LTT-Bench v1 should probably focus on single-agent, but the framework needs to extend cleanly.

7. Conclusion

The agent benchmark ecosystem has done impressive work defining what "capable" means on bounded tasks. The next frontier is defining what "reliable" means over time. We need benchmarks that test error recovery, multi-session continuity, goal alignment, and graceful degradation—not just terminal state correctness.

Until we can measure long-horizon reliability, we're flying blind when deploying agents on anything that matters.

References

Jimenez et al. (2024). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? Princeton NLP.
Zhou et al. (2023). WebArena: A Realistic Web Environment for Building Autonomous Agents. CMU.
Liu et al. (2023). AgentBench: Evaluating LLMs as Agents. THUDM.
Yao et al. (2024). τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains.
Mialon et al. (2023). GAIA: A Benchmark for General AI Assistants. Meta AI.
Xie et al. (2024). OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments.

¿Feedback, correcciones, o quieres colaborar en LTT-Bench? → abelsantillanrdz@gmail.com