Llms.txt for documentation

This is a brief overview of investigation into the usefulness of the llms.txt standard proposal: what it is, the current state of adoption, results of some benchmark tests, and an overview of implementation options for Canonical documentation (with our recommended implementation to round things off).

What’s llms.txt?

llms.txt is a proposed standard for simplifying the way LLMs parse and consume web content, with documentation being an obvious example where it could be useful.

How does it actually work? The short version is that it allows LLMs to parse content in the form of Markdown files (which they already understand well, Markdown being their lingua franca) instead of having them plough through the HTML source code of web pages. An llms.txt implementation on a website usually consists of these three types of artifacts:

  • llms.txt file in site root: a Markdown formatted link directory to all or relevant pages on the site
  • page.md (or page.html.md or page/index.md) files with the content of individual pages in Markdown
  • (optionally) llms-full.txt in site root: a concatenation of all page.md files

While parsing HTML efficiently is a notorious nightmare, the most important thing is that websites usually have all kinds of non-content fluff (navigation elements, inline styling, headers, footers, …), which the model needs to parse fully to be able to extract the useful content. llms.txt sidesteps this problem by offering a clean, easily readable view of the content that really matters, without any fluff.

Mintlify, which adopted llms.txt for all docs published through its platform, has a detailed article with examples that tries to persuade you that it does work: Real llms.txt examples from leading tech companies (and what they got right).

Do websites actually use it?

While it’s still a proposal, it is being widely adopted: a web search for “llms.txt” yields a plethora of articles that both sing its praises and denigrate it as a dead end (the former heavily outweigh the latter, but given the hectic pace of development in the LLM space, there’s no guarantee it will develop into a real standard).

Many major players in the AI space have adopted it (for example, here’s Anthropic’s llms.txt for its Claude platform), and there are a bunch of community-maintained directories that list all the places that offer llms.txt (e.g. llms-txt-hub, llmstxt.site, llmstxt.directory).

Significantly though, no major purveyor of LLMs admits to using it for their models. What’s more, John Mueller of Google seemed to have thrown a wet blanket over llms.txt last year by stating “FWIW no AI system currently uses llms.txt.” (Bluesky).

We can speculate why that might be, but the most sensible reason is probably that LLM “teachers” don’t want their models to consume content that website publishers prepare specifically for them – instead, they want the stuff that people read when they visit the site. In other words, LLMs don’t want to be gamed.

Well, why bother then?

There is still a compelling use case for llms.txt. Not for models to use for their learning, but rather for (coding) clients and agents to peruse on an on-demand basis. When you want to, for example, instruct your client/agent to base its design, planning, and implementation decisions on what docs are available about a particular concept, pointing it to an llms.txt is way more efficient (cheaper in tokens, as well as surer in interpretation) than having it blunder about a raw website.

Benchmarks

How much more efficient? I ran some simple benchmarks on real Ubuntu docs. They make it clear that the llms.txt-formatted Markdown content is much easier for LLMs to chew on. I tested the following:

  • Response speed
  • Token consumption/efficiency
  • Comprehension accuracy (against a predefined set of questions and ideal answers)

And I took four different types of input content to compare:

  • html_raw: Raw HTML (including all the meta, navigation, and page furniture ballast)
  • html_article: HTML without the ballast (i.e. only the article content but with HTML tags intact)
  • html_stripped: Stripped HTML (article content only with no tags, i.e. text with no markup)
  • markdown: Markdown generated according to the llms.txt spec

Note: I only included the html_article and html_stripped versions because I found it interesting. In real life, the comparison that matters is between Markdown and raw, unmodified HTML.

The input content was from the Ubuntu for developers docs set. For llms.txt Markdown, I took the llms-full.txt file and for raw HTML, I used a full dump of all HTML pages (in the same order as is used in llms-full.txt). The stripped versions were obtained directly from the raw HTML dump.

I only tested with freely available models, and I still ran into all kinds of rate-limiting with Gemini and Mixtral, so, in the end, I settled for Llama and Mistral.

The results are pretty convincing.

TL;DR

This is a quick summary: At the same character limit, Markdown covers significantly more documentation content than raw HTML (because HTML wastes tokens on markup, navigation, scripts, and CSS rather than info). For LLM pipelines that retrieve or embed documentation, the llms.txt (Markdown) format is substantially more token-efficient, and it produces higher comprehension scores at equal cost.

Dimension Markdown HTML (stripped) HTML (article) HTML (raw)
Token cost 68,855 65,023 157,840 482,860
Best Q&A score 24.0% 20.0% 13.1% 5.8%

Response speed

Model Char limit Format Latency Input tokens (est.)
llama-3.1-8b-instant 8,000 markdown 20.6 s 1,680
llama-3.1-8b-instant 8,000 html_stripped 20.7 s 1,842
llama-3.1-8b-instant 8,000 html_article 20.8 s 1,969
llama-3.1-8b-instant 8,000 html_raw 20.7 s 2,912
mistral-small-latest 16,000 markdown 1.9 s 3,540
mistral-small-latest 16,000 html_stripped 2.5 s 3,692
mistral-small-latest 16,000 html_article 3.0 s 3,948
mistral-small-latest 16,000 html_raw 1.9 s 5,547

Token efficiency

(For this I used the cl100k_base tokenizer with TikToken.)

Format Chars Tokens Chars / Token
markdown 285,681 68,855 4.15
html_stripped 258,661 65,023 3.98
html_article 559,860 157,840 3.55
html_raw 1,587,650 482,860 3.29

Q&A accuracy

Format Score Bar
markdown 21.9% ██████████
html_stripped 17.6% ████████
html_article 8.6% ████
html_raw 4.9% ██

This is a rather rudimentary test based on keyword matching (hits / total keywords → ratio of 0.0 to 1.0), but the larger the tested dataset, the better it captures how easy it is for a model to get to relevant info.

Benchmark summary

  • markdown beats html_raw by 4–6× at equal character budgets
  • html_raw costs 7× more tokens than markdown for identical content
  • (Rather obvious) Stripping tags beats keeping them
  • (Maybe not so obvious) Markdown still performs slightly better than plain, no-tags text

What’s available for Sphinx/Read the Docs?

Read the Docs documentation talks about llms.txt support but only to outline how and where the llms.txt artifacts should be added. It leaves the creation/generation of those artifacts to the user. It also mentions the possibility of using the sphinx-llm extension to automate the generation.

There are other solutions that can do this. Here’s a quick overview of the ones I found to be reasonable candidates (in other words, this is not meant to be an exhaustive list of available solutions):

sphinx-llm

sphinx-llm: A simple and straightforward implementation that makes use of the sphinx-markdown-builder extension to generate the Markdown files for individual docs pages. I find this extension to be the best choice for our needs because it’s lightweight and yet still reasonably configurable.

During the past month or so, as we’ve been looking into what it would take for Canonical docs to start supporting llms.txt, I have submitted a number of fixes and improvements to the extension to address the feedback I’ve been receiving from @sally-makin, @akonev, and @tmihoc. Thanks to the responsive maintainer of the extension, these have all been merged now.

sphinx-llms-txt

sphinx-llms-txt: Similar in scope to sphinx-llm but much more complex in implementation and configuration. It supports more configuration options than sphinx-llm but is also (dis)proportionately harder to set up and uses some heavy-duty dependencies, such as CMake to accomplish its thing. For the given purpose, I think it’s an overkill. (There’s also a fork of this extension with a few tweaks: sphinx-llms-txt-rw.)

llms-txt-action

llms-txt-action: This is not a Sphinx extension. Rather, it’s a GitHub Action that generates the llms.txt artifacts for inclusion with your docs build. It supports more than just Sphinx, which is useful but not to us. And given the fact that it’s not a Sphinx extension, it would require adding more logic to readthedocs.yaml and to our Makefile, thus making the whole setup more complex.

sphinx-llms-txt-link

sphinx-llms-txt-link: This is just an add-on utility that might nevertheless prove useful: it injects a link pointing to the Markdown-rendered version of individual pages directly into your docs-page HTML, thus making the llms.txt artifacts easily indexable. I’m not actually sure if this is useful or desirable, but it’s worth mentioning.

Serving the docs

As a proof-of-concept, I put together (with Copilot CLI and Claude Sonnet) a simple MCP server that takes a list of llms.txt-enabled docs sets as configuration parameters and makes these available to AI (coding) clients. I tested it with Ubuntu for developers and Ubuntu project docs in Copilot CLI and Claude Code, and I liked the results. You can check it out at github.com/rkratky/docshub.

Once we start publishing the llms.txt artifacts, we should consider adding them to the aforementioned directories (for some of which dedicated MCP servers exist, too). This would also most likely ask for a central, all-Canonical llms.txt file that would function as a link directory for all the docs-set-specific llms.txt (and llms-full.txt) files we would have.

What next?

The Docs team will go ahead and enable llms.txt serving on Canonical docs. It’s a non-intrusive thing, the generation is relatively cheap* in terms of processing power and time, and there seem to be no obvious disadvantages to having it alongside our HTML-rendered docs.

We will want to incorporate it into the Docs Starter Pack sometime soon, but in the meantime, individual docs sets can enable it on their own by adding the following to their configuration:

conf.py

extensions = [
    # whatever else you have here
    "sphinx_llm.txt",
]

# sphinx-llm config
llms_txt_suffix_mode = "url-suffix"
# Short description of your docs set:
llms_txt_description = (
    "This documentation provides guidance for using the Ubuntu Desktop "
    "Linux distribution as a development platform. The guides focus on "
    "setting up and using the Ubuntu system as a workstation for developers, "
    "with an emphasis on the following toolchains: Python, Golang, Rust, "
    "GCC, Clang, .NET, and Java."
)

# sphinx-markdown-builder config; URL to the root of docs with no trailing /
# e.g.:
markdown_http_base = "https://documentation.ubuntu.com/ubuntu-for-developers"

requirements.txt

sphinx-llm

* One of the improvements I implemented for the sphinx-llm extension is the option to disable the generation of the llms.txt stuff on a per-build basis. There’s no need to have Sphinx spend cycles (re)generating all the files while, for example, using sphinx-autobuild to check small edits in the docs. This can be added to our Makefile build targets as I’ve done it for the Ubuntu project docs Makefile (note the SPHINXOPTS_NOLLM variable).

If it becomes clear that llms.txt indeed was a dead-end street, it would be simple to ditch the support and/or implement whatever nascent standard comes next.

7 Likes

This is brilliant. Thanks a lot, @rkratky !