LF AI & Data Foundation Launches DocLang Working Group to Standardize AI-Optimized Document Format

The LF AI & Data Foundation, operating under the Linux Foundation, has established a working group to drive the development of DocLang, a document format engineered specifically for artificial intelligence consumption. The initiative aims to help enterprises efficiently feed proprietary files into AI systems.

The DocLang consortium counts IBM, NVIDIA, Red Hat, ABBYY, HumanSignal, and Forgis among its founding members. The group contends that prevailing formats—including PDF, Markdown, HTML, and LaTeX—are fundamentally mismatched for AI document parsing.

In late 2024, IBM released Docling, an open-source toolkit designed to convert diverse file formats into structured, AI-ready data, similar in purpose to Microsoft’s MarkItDown or the Marker project. DocLang builds upon this foundation by defining a standard for exchanging structured output across disparate systems.

“DocLang addresses a foundational challenge in enterprise AI: documents were constructed for human readers, not machines,” said Maxime Vermeir, VP of AI Strategy at ABBYY. “By introducing a minimal, standardized, and AI-native representation of document structure, layout, semantics, and governance, DocLang establishes a far more deterministic foundation for modern AI systems.”

The specification authors argue that existing formats sacrifice semantic information, structural relationships, or geometric context when transformed into tokens for AI models. They note that Markdown lacks sufficient scope, HTML is excessively verbose, and LaTeX introduces excessive ambiguity.

DocLang optimizes for large language model tokenizers through markup that maps DocLang elements to LLM tokens on a one-to-one basis. The specification employs a constrained XML vocabulary aligned with LLM tokenizers to generate optimized prompts. The format is lossless, preserving all original information, and supports complex graphical elements such as tables, formulas, charts, and multimodal content. It is published as an open standard.

The format also promises significant cost control. According to AI Cost Check, a baseline OCR scan of a PDF by an AI model consumes approximately 1,200 input tokens and 150 output tokens. While negligible for single documents, these costs compound rapidly at scale. Variable token pricing across models means enterprises may spend far more than anticipated ingesting lengthy, complex PDFs, particularly when using premium frontier models.

“PDFs were designed for rendering, not understanding,” said Jon Knisley, AI Value and Enablement Lead at ABBYY. “Every time a PDF enters an AI pipeline, structure, meaning, and layout are lost, so the model’s accuracy becomes bottlenecked by document quality rather than model capability. Teams compensate by building custom parsers at every integration point, resulting in brittle, one-off solutions and a new engineering sprint for every new document type.”

Knisley emphasized that ambiguous structure forces models into guesswork, increasing hallucination risk and consuming tokens on layout deciphering rather than meaning extraction. He stated that DocLang delivers improved accuracy, reduced costs, lower token consumption, faster performance, and more consistent outputs. Initial benchmarks indicate cost reductions ranging from 4x to over 30x, depending on the model evaluated.

Governance advantages also feature prominently. Knisley noted that document provenance data and metadata are frequently stripped during file transfers; DocLang preserves this information natively.

ABBYY has published the DocLang Interactive Benchmark to demonstrate potential token savings. In one test, a PDF of IBM’s 2025 annual report required 8,421 input tokens and 512 output tokens, whereas the DocLang version required only 5,310 input tokens and 498 output tokens. The DocLang version also achieved lower latency (2.7 seconds versus 4.2 seconds) and higher quality—the AI missed a subsection and mangled a table merger when processing the PDF.

“It is still early, and we will not overstate adoption,” Knisley added. “The standard is open and free to build upon, and the group is actively recruiting additional technology providers and enterprises. The early response has been encouraging, and we are optimistic about the trajectory.”

Also Read

Source link

What's Hot

Netanyahu Affirms Israeli Security Presence in Lebanon, Syria, and Gaza

Decoding the Silent Message in Spielberg’s ‘Disclosure Day’

I modified my PC setup with a 15-in-1 docking station, and the benefits go beyond more ports

LF AI & Data Foundation Launches DocLang Working Group to Standardize AI-Optimized Document Format

I modified my PC setup with a 15-in-1 docking station, and the benefits go beyond more ports

Explore JPL to Take Place Oct. 10, 11

Trump’s Reflecting Pool Renovation Faces Immediate Algae Resurgence After $14.2M Upgrade]

Why South Korea Is at the Forefront of AI Adoption—and the Challenges It Faces

Anthropic to meet White House over AI tool suspension

SpaceX Shares Surge on First Full Trading Day

I modified my PC setup with a 15-in-1 docking station, and the benefits go beyond more ports

Explore JPL to Take Place Oct. 10, 11

Trump’s Reflecting Pool Renovation Faces Immediate Algae Resurgence After $14.2M Upgrade]

Why South Korea Is at the Forefront of AI Adoption—and the Challenges It Faces

What's Hot

LF AI & Data Foundation Launches DocLang Working Group to Standardize AI-Optimized Document Format

Also Read

Keep Reading

Subscribe to Updates