Microsoft Research Finds AI Tools Introduce Progressive…

Generative artificial intelligence progressively degrades the accuracy of corporate documents when used for extended editing tasks, according to a new study from Microsoft Research. The benchmark, named DELEGATE-52, evaluated large language models (LLMs) across successive reading, interpretation, and modification tasks. Results show that while these tools perform well on short assignments, they can remove relevant data, alter correct information, and generate progressive distortions when operating without constant human supervision.

The DELEGATE-52 Benchmark: Simulating Real-World Workflows

Unlike traditional evaluations that focus on isolated questions, the DELEGATE-52 benchmark was designed to simulate actual professional activities across dozens of knowledge domains. It measures what happens when an AI system receives autonomy to execute extensive workflows — such as drafting reports, creating presentations, and summarizing content — across multiple sequential steps. Researchers observed that problems intensify as the number of interactions performed by the artificial intelligence within a single document increases. This occurs because small errors, even if imperceptible at each step, accumulate over time, leading to significant quality degradation.

Document Degradation: The Accumulation of Small Errors

One of the central phenomena identified is what researchers call "document degradation" — the gradual loss of precision as a document undergoes multiple AI-driven revisions. A piece of information slightly modified in one revision can be treated as correct in later stages, generating progressive distortions. This behavior resembles the effect of successive message transmission between people, where small changes accumulate to produce a result very different from the original. According to the study, this pattern was observed across several advanced models currently available on the market.

Why AI Loses Precision Over Time

Large language models function by predicting which words are most likely to appear in sequence within a given context. Although this approach generates sophisticated text, it does not guarantee perfect understanding of the information's meaning. When a document is edited repeatedly, the model must decide what to keep, remove, or modify — and in many cases, important information is excessively summarized, inadequately reinterpreted, or replaced with content that seems plausible but is incorrect. Long documents present an additional challenge, as they require the system to consider a large volume of context simultaneously.

Python Programming Shows Relative Resilience

Among the areas evaluated, Python programming demonstrated relatively superior performance. Researchers noted that code generation and modification tasks have characteristics that favor automatic evaluation: errors can be identified by tests, compilers, and validators — something that does not occur with traditional text. This helps explain the considerable success of AI automation in software development. Still, experts warn that code produced by artificial intelligence must undergo technical review before being deployed in production environments.

The Role of Human Supervision Remains Indispensable

The main conclusion of the DELEGATE-52 study is that human oversight remains essential. Current models, no matter how advanced, lack real understanding of context, intentions, or the consequences associated with the information they manipulate. Experienced professionals play a fundamental role in fact-checking, critical analysis, identifying inconsistencies, and validating results. In practice, the combination of artificial intelligence and human supervision tends to deliver better outcomes than either approach alone. The research reinforces that for critical activities such as financial reports, legal contracts, and scientific research, AI should serve as a support tool — not a substitute.

Despite current limitations, experts believe AI agents will continue to evolve rapidly. New architectures, larger context windows, integration with external databases, and advanced verification mechanisms could significantly reduce the problems observed today. Many argue that the future of automation will depend on creating systems capable of continuously verifying their own responses — perhaps with multiple agents working together and independent validations. According to the research, the most promising path is collaboration between humans and machines, combining computational speed with human judgment.

The Premise News Editorial View: The Microsoft Research study arrives at a pivotal moment when companies worldwide are investing billions in AI seeking productivity gains. The discovery of document degradation shows that blind trust in autonomous systems can be dangerous, especially in sectors where precision is non-negotiable. At stake is not just the quality of reports, but decisions based on potentially distorted information — with financial, regulatory, and even legal consequences. The key tension revealed is between the promise of total automation and the reality that AI still does not understand the meaning of what it manipulates. In the coming months, readers should closely watch how technology companies respond to these limitations: by investing in new validation methods or adjusting their market promises. For now, the most important lesson is that artificial intelligence does not replace the human critical eye — it only complements it.

Microsoft Research Finds AI Tools Introduce Progressive Errors in Corporate Document Editing

The DELEGATE-52 Benchmark: Simulating Real-World Workflows

Document Degradation: The Accumulation of Small Errors

Why AI Loses Precision Over Time

Python Programming Shows Relative Resilience

The Role of Human Supervision Remains Indispensable

What did you think?