14 votes

Supporting Markdown search for LLMs

21 comments

  1. [5]
    smores
    Link
    This seems.. fraught. It's true that HTML is more verbose than markdown (and that much of that verbosity is unnecessary information for an LLM unconcerned with style or functionality of a...

    This seems.. fraught. It's true that HTML is more verbose than markdown (and that much of that verbosity is unnecessary information for an LLM unconcerned with style or functionality of a website). But it's also true that HTML has much more semantic information than markdown — something that I would expect to be beneficial to LLMs. Markdown has extremely limited semantics (basically headers and not-headers), compared to the wide array of useful semantic elements available in HTML.

    It seems that perhaps a better strategy would be for LLM agents to have HTML pre-processing steps that clean up HTML before actually tokenizing? Strip out styles and scripts, remove class names and non-aria/semantic attributes, and maybe even only provide the LLM with the contents of the header and main elements, if they exist.

    Markdown is neat, and useful in many contexts (like the Tildes comment box!), but it is not a good semantic document format. I think it would be a shame to run toward it's use for representing documents on the web, especially if that means running away from HTML.

    6 votes
    1. archevel
      Link Parent
      I think most html is not good for providing meaningful semantic content anyway. XHTML tried to do this, but failed in my view. For most text rich websites, e.g. Wikipedia, I think could...

      I think most html is not good for providing meaningful semantic content anyway. XHTML tried to do this, but failed in my view. For most text rich websites, e.g. Wikipedia, I think could meaningfully be converted to markdown without loosing anything of importance. Might be wrong, but I can't think of a counter example.

      2 votes
    2. [3]
      skybrian
      Link Parent
      Markdown documents can use HTML tags. But I’m not sure it’s worth it for the sort of technical documentation that coding agents read.

      Markdown documents can use HTML tags. But I’m not sure it’s worth it for the sort of technical documentation that coding agents read.

      1. [2]
        smores
        Link Parent
        Well.. yes, sure (not all markdown parsers support this and it's almost always limited). But I'm responding to the premise of this post, which talks about providing a Markdown version of an...

        Well.. yes, sure (not all markdown parsers support this and it's almost always limited). But I'm responding to the premise of this post, which talks about providing a Markdown version of an already-HTML version of your website. If you want HTML, you can just serve your normal website haha

        2 votes
        1. skybrian
          Link Parent
          I've recently been working on what's effectively screen-scraping software. It's a bookmarklet that I use to select the quotes that I often include in the Tildes links I share. The main thing I...

          I've recently been working on what's effectively screen-scraping software. It's a bookmarklet that I use to select the quotes that I often include in the Tildes links I share. The main thing I find semantic HTML tags useful for is finding the main article (hopefully they use the <article> tag) and getting rid of all the navigation cruft surrounding the actual words in the article. And then I'm converting to Markdown.

          Another example of stripping most HTML is the Firefox Readability extension.

          So, I don't see the original HTML as being all that useful. For technical documentation, it might have started out as Markdown anyway. Providing the source for the article is removing some extra translation steps.

          Maybe there are other articles that really take full advantage of HTML, though?

  2. [4]
    xk3
    Link
    It's great that people are doing this! Maybe someday there can be a browser that just loads these markdown files (and redirects README.md to AGENTS.md so that people can read without marketing...

    It's great that people are doing this!

    Maybe someday there can be a browser that just loads these markdown files (and redirects README.md to AGENTS.md so that people can read without marketing copy and other BS)

    5 votes
    1. devalexwhite
      Link Parent
      It's like Gemini Protocol all over again (which I love)!

      It's like Gemini Protocol all over again (which I love)!

      5 votes
    2. mergesort
      Link Parent
      Heya, post-author here! Funny enough you're not the first person to say this after seeing my post. I hadn't considered it until now but I'm building a Reader Mode for the link-saving app I work...

      Heya, post-author here! Funny enough you're not the first person to say this after seeing my post. I hadn't considered it until now but I'm building a Reader Mode for the link-saving app I work on, so I may actually do something with this idea. 🤔

      1 vote
  3. skybrian
    Link
    From the article:

    From the article:

    Claude Code now automatically searches for markdown versions of websites, and other tools like Codex have followed suit. Yet despite its importance, few websites have implemented this technique. Here’s how to be one of them.

    4 votes
  4. TurtleCracker
    Link
    I would rather read the websites in markdown anyways and just let whatever viewer style it. This could be better for humans too. Cuts out all the visual noise.

    I would rather read the websites in markdown anyways and just let whatever viewer style it. This could be better for humans too. Cuts out all the visual noise.

    2 votes
  5. [7]
    Narry
    Link
    I think I have some questions about this: Is this more likely to make my website surface as part of AI search summaries? Or for those auto-browsing AI agents? Will it make my page rank higher in...

    I think I have some questions about this:

    Is this more likely to make my website surface as part of AI search summaries? Or for those auto-browsing AI agents? Will it make my page rank higher in human generated search results? Well it help with accessibility for people using reader software?

    In short: other than bandwidth (which is not nothing!) what is the benefit? I read the whole article, but admittedly, I did start to get a little glossy towards the end when he was talking about how to implement it. If he mentioned any of this there, I didn’t spot it.

    1 vote
    1. [4]
      kacey
      Link Parent
      I think the implied benefit is that you're doing your part to allow tech companies to extract training data from your content more efficiently, which is ... framed as a good thing? Just thinking...

      I think the implied benefit is that you're doing your part to allow tech companies to extract training data from your content more efficiently, which is ... framed as a good thing?

      As you can see adding support for the text/markdown Accept header is straightforward and easy! And it’s beneficial: LLMs get more accurate information, they use fewer tokens, and your website saves on bandwidth costs.

      Just thinking out loud, though, this would be a spectacular way to subtly poison LLM datasets.

      4 votes
      1. [3]
        skybrian
        (edited )
        Link Parent
        Not training data, or at least not as its primary use. Coding agents do web searches to look up a library's current documentation. There have likely been some software releases since the LLM was...

        Not training data, or at least not as its primary use. Coding agents do web searches to look up a library's current documentation. There have likely been some software releases since the LLM was trained.

        I suppose providing deliberately bad documentation to make coding agents write bad code is an option for malicious software providers, but if they're going to install your library anyway, that's a potential remote code execution attack.

        Supply chain attacks are a problem that package providers like npm have to deal with and naive coding agents are likely to make the problem worse.

        2 votes
        1. [2]
          kacey
          Link Parent
          I was thinking that this would be to inject malicious code into usage of someone "else's" library, to use the example of a coding agent searching for library usages. One could imagine SEO'ing...

          I suppose providing deliberately bad documentation to make coding agents write bad code is an option for malicious software providers, but if they're going to install your library anyway, that's a potential remote code execution attack.

          I was thinking that this would be to inject malicious code into usage of someone "else's" library, to use the example of a coding agent searching for library usages. One could imagine SEO'ing one's way to the top of the search results for a popular library, then heuristically detecting coding agents in order to serve them snippets with known flaws (e.g. intentionally flawed RNG seeds, known bad crypto configs, etc.). Again, though, just thinking out loud -- this is a brave new world, so I'm not sure what the best practices are going to coalesce around.

          Not training data ...

          Hmm. I'm curious what the terms of service are for these platforms, especially around search results vs. generated content vs. user input. It seems like a waste of bandwidth and processing time to throw away useful information, such as those scraped pages. Perhaps they're already drowning in training data and don't want to take on the legal liability of sourcing more?

          1. skybrian
            Link Parent
            Yeah, good point that it might be someone other than the author of the library. I don't know how much these specialized web searches are locked down. It might be used for training as well,...

            Yeah, good point that it might be someone other than the author of the library. I don't know how much these specialized web searches are locked down.

            It might be used for training as well, depending on the terms of service. I'm just saying that "training data" isn't the primary reason coding agents do web searches.

            2 votes
    2. skybrian
      Link Parent
      The article talks about a specialized search engine used by coding agents when they are writing code. Providing Markdown documentation for API's and libraries seems like the most obvious...

      The article talks about a specialized search engine used by coding agents when they are writing code. Providing Markdown documentation for API's and libraries seems like the most obvious application. This would be for commercial vendors or open source libraries (there is overlap) who want to be more popular with software developers who use coding agents.

      2 votes
    3. mergesort
      Link Parent
      Heya, post-author here! I wrote this post with the underlying assumption that you want to share your content more easily with LLMs. If not, then of course this won't be particularly useful or...

      Heya, post-author here! I wrote this post with the underlying assumption that you want to share your content more easily with LLMs. If not, then of course this won't be particularly useful or beneficial.

      Personally, as you can see from my other writing, I feel positively about the impact AI systems will have on individual learning and growth for people that exercise creativity and critical thinking — but very torn and lean negative on the greater societal implications. That said, my writing is licensed CC (BY) because I write to have people read my work, and to reuse my ideas as they see fit to improve their own lives.

      I spent a lot of time debating whether this improvement loop should apply to agentic systems like ChatGPT, and ultimately came to the conclusion that if this medium is the way that people will be consuming my writing then I want my writing to be there. There are definitely downstream effects of this decision, but ultimately if I'm going to be doing that — to bring it back to the post — I want my writing to be as token-efficient as possible. :)

      To answer your other question about AI search summaries/page ranking, I wrote a complementary post last week about how to optimize your website for AI search — which may provide more context for this general arc of thinking I was going through as I wrote this post about optimizing your website for token-efficiency. All of the other benefits you mentioned also apply! But they weren't my primary goal.

      2 votes
  6. [3]
    unkz
    Link
    Just like JSON is better in almost every way than XML for representing data, so too is markdown better in almost every way for representing text. I don’t know how necessary this is though. When I...

    Just like JSON is better in almost every way than XML for representing data, so too is markdown better in almost every way for representing text.

    I don’t know how necessary this is though. When I scrape data for LLM ingestion, I always put it through an HTML to markdown conversion step anyway. Is trying to add yet another “standard” going to help? Well annotated HTML is trivial to markdownify, so probably continuing to advocate for well annotated HTML documents would be as useful with less friction.

    1 vote
    1. [2]
      skybrian
      Link Parent
      I wouldn't go quite that far. Markdown isn't a good choice for scientific papers (due to the math), interactive illustrations, web UI, or HTML mockups. For documentation and news articles it works...

      I wouldn't go quite that far. Markdown isn't a good choice for scientific papers (due to the math), interactive illustrations, web UI, or HTML mockups. For documentation and news articles it works fine.

      1 vote
      1. unkz
        Link Parent
        I mean interactive illustrations, web UI, and HTML aren’t text. LaTeX in markdown is super convenient and fairly well supported for math.

        I mean interactive illustrations, web UI, and HTML aren’t text. LaTeX in markdown is super convenient and fairly well supported for math.

        1 vote