Generative artificial intelligence progressively degrades the accuracy of corporate documents when used for extended editing tasks, according to a new study from Microsoft Research. The benchmark, named DELEGATE-52, evaluated large language models (LLMs) across successive reading, interpretation, and modification tasks. Results show that while these tools perform well on short assignments, they can remove relevant data, alter correct information, and generate progressive distortions when operating without constant human supervision.
The DELEGATE-52 Benchmark: Simulating Real-World Workflows
Unlike traditional evaluations that focus on isolated questions, the DELEGATE-52 benchmark was designed to simulate actual professional activities across dozens of knowledge domains. It measures what happens when an AI system receives autonomy to execute extensive workflows — such as drafting reports, creating presentations, and summarizing content — across multiple sequential steps. Researchers observed that problems intensify as the number of interactions performed by the artificial intelligence within a single document increases. This occurs because small errors, even if imperceptible at each step, accumulate over time, leading to significant quality degradation.
Document Degradation: The Accumulation of Small Errors
One of the central phenomena identified is what researchers call "document degradation" — the gradual loss of precision as a document undergoes multiple AI-driven revisions. A piece of information slightly modified in one revision can be treated as correct in later stages, generating progressive distortions. This behavior resembles the effect of successive message transmission between people, where small changes accumulate to produce a result very different from the original. According to the study, this pattern was observed across several advanced models currently available on the market.
Why AI Loses Precision Over Time
Large language models function by predicting which words are most likely to appear in sequence within a given context. Although this approach generates sophisticated text, it does not guarantee perfect understanding of the information's meaning. When a document is edited repeatedly, the model must decide what to keep, remove, or modify — and in many cases, important information is excessively summarized, inadequately reinterpreted, or replaced with content that seems plausible but is incorrect. Long documents present an additional challenge, as they require the system to consider a large volume of context simultaneously.
Python Programming Shows Relative Resilience
Among the areas evaluated, Python programming demonstrated relatively superior performance. Researchers noted that code generation and modification tasks have characteristics that favor automatic evaluation: errors can be identified by tests, compilers, and validators — something that does not occur with traditional text. This helps explain the considerable success of AI automation in software development. Still, experts warn that code produced by artificial intelligence must undergo technical review before being deployed in production environments.
The Role of Human Supervision Remains Indispensable
The main conclusion of the DELEGATE-52 study is that human oversight remains essential. Current models, no matter how advanced, lack real understanding of context, intentions, or the consequences associated with the information they manipulate. Experienced professionals play a fundamental role in fact-checking, critical analysis, identifying inconsistencies, and validating results. In practice, the combination of artificial intelligence and human supervision tends to deliver better outcomes than either approach alone. The research reinforces that for critical activities such as financial reports, legal contracts, and scientific research, AI should serve as a support tool — not a substitute.
Despite current limitations, experts believe AI agents will continue to evolve rapidly. New architectures, larger context windows, integration with external databases, and advanced verification mechanisms could significantly reduce the problems observed today. Many argue that the future of automation will depend on creating systems capable of continuously verifying their own responses — perhaps with multiple agents working together and independent validations. According to the research, the most promising path is collaboration between humans and machines, combining computational speed with human judgment.
