11 votes

Supporting Markdown Search For LLMs

14 comments

  1. skybrian
    Link
    From the article:

    From the article:

    Claude Code now automatically searches for markdown versions of websites, and other tools like Codex have followed suit. Yet despite its importance, few websites have implemented this technique. Here’s how to be one of them.

    4 votes
  2. [2]
    xk3
    Link
    It's great that people are doing this! Maybe someday there can be a browser that just loads these markdown files (and redirects README.md to AGENTS.md so that people can read without marketing...

    It's great that people are doing this!

    Maybe someday there can be a browser that just loads these markdown files (and redirects README.md to AGENTS.md so that people can read without marketing copy and other BS)

    3 votes
    1. devalexwhite
      Link Parent
      It's like Gemini Protocol all over again (which I love)!

      It's like Gemini Protocol all over again (which I love)!

      3 votes
  3. [5]
    smores
    Link
    This seems.. fraught. It's true that HTML is more verbose than markdown (and that much of that verbosity is unnecessary information for an LLM unconcerned with style or functionality of a...

    This seems.. fraught. It's true that HTML is more verbose than markdown (and that much of that verbosity is unnecessary information for an LLM unconcerned with style or functionality of a website). But it's also true that HTML has much more semantic information than markdown — something that I would expect to be beneficial to LLMs. Markdown has extremely limited semantics (basically headers and not-headers), compared to the wide array of useful semantic elements available in HTML.

    It seems that perhaps a better strategy would be for LLM agents to have HTML pre-processing steps that clean up HTML before actually tokenizing? Strip out styles and scripts, remove class names and non-aria/semantic attributes, and maybe even only provide the LLM with the contents of the header and main elements, if they exist.

    Markdown is neat, and useful in many contexts (like the Tildes comment box!), but it is not a good semantic document format. I think it would be a shame to run toward it's use for representing documents on the web, especially if that means running away from HTML.

    3 votes
    1. archevel
      Link Parent
      I think most html is not good for providing meaningful semantic content anyway. XHTML tried to do this, but failed in my view. For most text rich websites, e.g. Wikipedia, I think could...

      I think most html is not good for providing meaningful semantic content anyway. XHTML tried to do this, but failed in my view. For most text rich websites, e.g. Wikipedia, I think could meaningfully be converted to markdown without loosing anything of importance. Might be wrong, but I can't think of a counter example.

      2 votes
    2. [3]
      skybrian
      Link Parent
      Markdown documents can use HTML tags. But I’m not sure it’s worth it for the sort of technical documentation that coding agents read.

      Markdown documents can use HTML tags. But I’m not sure it’s worth it for the sort of technical documentation that coding agents read.

      1. [2]
        smores
        Link Parent
        Well.. yes, sure (not all markdown parsers support this and it's almost always limited). But I'm responding to the premise of this post, which talks about providing a Markdown version of an...

        Well.. yes, sure (not all markdown parsers support this and it's almost always limited). But I'm responding to the premise of this post, which talks about providing a Markdown version of an already-HTML version of your website. If you want HTML, you can just serve your normal website haha

        1. skybrian
          Link Parent
          I've recently been working on what's effectively screen-scraping software. It's a bookmarklet that I use to select the quotes that I often include in the Tildes links I share. The main thing I...

          I've recently been working on what's effectively screen-scraping software. It's a bookmarklet that I use to select the quotes that I often include in the Tildes links I share. The main thing I find semantic HTML tags useful for is finding the main article (hopefully they use the <article> tag) and getting rid of all the navigation cruft surrounding the actual words in the article. And then I'm converting to Markdown.

          Another example of stripping most HTML is the Firefox Readability extension.

          So, I don't see the original HTML as being all that useful. For technical documentation, it might have started out as Markdown anyway. Providing the source for the article is removing some extra translation steps.

          Maybe there are other articles that really take full advantage of HTML, though?

  4. [6]
    Narry
    Link
    I think I have some questions about this: Is this more likely to make my website surface as part of AI search summaries? Or for those auto-browsing AI agents? Will it make my page rank higher in...

    I think I have some questions about this:

    Is this more likely to make my website surface as part of AI search summaries? Or for those auto-browsing AI agents? Will it make my page rank higher in human generated search results? Well it help with accessibility for people using reader software?

    In short: other than bandwidth (which is not nothing!) what is the benefit? I read the whole article, but admittedly, I did start to get a little glossy towards the end when he was talking about how to implement it. If he mentioned any of this there, I didn’t spot it.

    1 vote
    1. [4]
      kacey
      Link Parent
      I think the implied benefit is that you're doing your part to allow tech companies to extract training data from your content more efficiently, which is ... framed as a good thing? Just thinking...

      I think the implied benefit is that you're doing your part to allow tech companies to extract training data from your content more efficiently, which is ... framed as a good thing?

      As you can see adding support for the text/markdown Accept header is straightforward and easy! And it’s beneficial: LLMs get more accurate information, they use fewer tokens, and your website saves on bandwidth costs.

      Just thinking out loud, though, this would be a spectacular way to subtly poison LLM datasets.

      2 votes
      1. [3]
        skybrian
        (edited )
        Link Parent
        Not training data, or at least not as its primary use. Coding agents do web searches to look up a library's current documentation. There have likely been some software releases since the LLM was...

        Not training data, or at least not as its primary use. Coding agents do web searches to look up a library's current documentation. There have likely been some software releases since the LLM was trained.

        I suppose providing deliberately bad documentation to make coding agents write bad code is an option for malicious software providers, but if they're going to install your library anyway, that's a potential remote code execution attack.

        Supply chain attacks are a problem that package providers like npm have to deal with and naive coding agents are likely to make the problem worse.

        2 votes
        1. [2]
          kacey
          Link Parent
          I was thinking that this would be to inject malicious code into usage of someone "else's" library, to use the example of a coding agent searching for library usages. One could imagine SEO'ing...

          I suppose providing deliberately bad documentation to make coding agents write bad code is an option for malicious software providers, but if they're going to install your library anyway, that's a potential remote code execution attack.

          I was thinking that this would be to inject malicious code into usage of someone "else's" library, to use the example of a coding agent searching for library usages. One could imagine SEO'ing one's way to the top of the search results for a popular library, then heuristically detecting coding agents in order to serve them snippets with known flaws (e.g. intentionally flawed RNG seeds, known bad crypto configs, etc.). Again, though, just thinking out loud -- this is a brave new world, so I'm not sure what the best practices are going to coalesce around.

          Not training data ...

          Hmm. I'm curious what the terms of service are for these platforms, especially around search results vs. generated content vs. user input. It seems like a waste of bandwidth and processing time to throw away useful information, such as those scraped pages. Perhaps they're already drowning in training data and don't want to take on the legal liability of sourcing more?

          1. skybrian
            Link Parent
            Yeah, good point that it might be someone other than the author of the library. I don't know how much these specialized web searches are locked down. It might be used for training as well,...

            Yeah, good point that it might be someone other than the author of the library. I don't know how much these specialized web searches are locked down.

            It might be used for training as well, depending on the terms of service. I'm just saying that "training data" isn't the primary reason coding agents do web searches.

    2. skybrian
      Link Parent
      The article talks about a specialized search engine used by coding agents when they are writing code. Providing Markdown documentation for API's and libraries seems like the most obvious...

      The article talks about a specialized search engine used by coding agents when they are writing code. Providing Markdown documentation for API's and libraries seems like the most obvious application. This would be for commercial vendors or open source libraries (there is overlap) who want to be more popular with software developers who use coding agents.

      1 vote