8 votes

Extract clean(er), readable text from web pages via the Mercury Web Parser.

14 comments

  1. [8]
    DonQuixote
    Link
    Any chance we'll see a Firefox add-on?

    Any chance we'll see a Firefox add-on?

    3 votes
    1. [6]
      masochist
      Link Parent
      Firefox has a built in reader view already.

      Firefox has a built in reader view already.

      3 votes
      1. [5]
        onyxleopard
        Link Parent
        Do you happen to know if Mozilla built the content extractors themselves or if they’re using a library? Automatic content extraction is really interesting, and I’m curious to learn of additional...

        Do you happen to know if Mozilla built the content extractors themselves or if they’re using a library? Automatic content extraction is really interesting, and I’m curious to learn of additional open-source implementations. I know of Apache Tika, and boilerpipe (both Java). There are dozens of other small GitHub projects that attempt to do this with heuristics, but most are not very good. I think it would be neat to train an automatic content extractor with supervised machine learning, but I’m too lazy to tag a training set.

        2 votes
        1. [2]
          cfabbro
          Link Parent
          https://github.com/mozilla/readability
          4 votes
          1. onyxleopard
            Link Parent
            Thanks. This implementation is based on heuristics (like several other amateur implementations I’ve seen). I believe a robust reader would require a statistical model (like some of boilerpipe’s...

            Thanks. This implementation is based on heuristics (like several other amateur implementations I’ve seen). I believe a robust reader would require a statistical model (like some of boilerpipe’s extractors), and potentially several different models for different natural languages, in addition to a natural language classifier to decide which model to apply.

            2 votes
        2. [2]
          masochist
          Link Parent
          I don't, sorry. You could probably find out with some web searches. I'd suggest digging through the code but it's, uh, a mess. And there's so much of it.

          I don't, sorry. You could probably find out with some web searches. I'd suggest digging through the code but it's, uh, a mess. And there's so much of it.

          1 vote
    2. onyxleopard
      Link Parent
      Postlight had already made a Chrome extension. I imagine it wouldn’t take someone very long to make a Firefox extension if they already were familiar with building Firefox extensions.

      Postlight had already made a Chrome extension. I imagine it wouldn’t take someone very long to make a Firefox extension if they already were familiar with building Firefox extensions.

  2. [6]
    Comment deleted by author
    Link
    1. [4]
      onyxleopard
      Link Parent
      I’m not sure it’s equivalent to Embedly (from a quick look at Embedly’s site, I can’t really tell what Embedly does, technically). The Mercury Web Parser is mainly about extracting content from...

      I’m not sure it’s equivalent to Embedly (from a quick look at Embedly’s site, I can’t really tell what Embedly does, technically). The Mercury Web Parser is mainly about extracting content from web pages and representing them in a standardized structure. It preserves links to external media/sites, but I’ve mainly used it for extracting textual content. I don’t have a good sense of how it handles embedded images/videos etc. as I usually consider such things noise.

      2 votes
      1. [2]
        Comment deleted by author
        Link Parent
        1. onyxleopard
          Link Parent
          OK, yes, extracting metadata such as those you mentioned is something that the parser does. So, Tildes could use it (though someone would have to evaluate the accuracy/coverage of metadata for...

          OK, yes, extracting metadata such as those you mentioned is something that the parser does. So, Tildes could use it (though someone would have to evaluate the accuracy/coverage of metadata for Mercury, Embedly, or any other such services to see how they stack up). Obviously now that Mercury is open source under friendly licenses, I think that is an advantage in a different dimension, but it’s not the only thing to consider.

          2 votes
      2. [2]
        Deimos
        Link Parent
        Embedly does a few different things overall, but Bauke is referring to their "Extract" API that I'm using, which pulls content and metadata from a lot of sites. For example, here's what it pulls...

        Embedly does a few different things overall, but Bauke is referring to their "Extract" API that I'm using, which pulls content and metadata from a lot of sites. For example, here's what it pulls out of another article that was recently submitted: https://embed.ly/docs/explore/extract?url=https%3A%2F%2Fwww.bbc.com%2Fnews%2Ftechnology-47408969

        2 votes
        1. onyxleopard
          Link Parent
          Yeah, Embedly extracts quite a few more fields than Mercury does. That said, I imagine that Mercury could be extended to do those things, too.

          Yeah, Embedly extracts quite a few more fields than Mercury does. That said, I imagine that Mercury could be extended to do those things, too.

          1 vote
    2. balooga
      Link Parent
      Timely: I was just reading the thread about the Apple layoffs article that was published over a month ago, but the Embedly scrape didn't capture that, so people assumed it was breaking news. If...

      Timely: I was just reading the thread about the Apple layoffs article that was published over a month ago, but the Embedly scrape didn't capture that, so people assumed it was breaking news. If there's a more reliable way available to extract link metadata, it's worth a closer look I'd say.

      2 votes
  3. onyxleopard
    Link
    When I first found out about Postlight’s Mercury Web Parser, I was thrilled. It’s by no means 100% accurate, and I’ve encountered some issues with non-English language site content, but overall,...

    When I first found out about Postlight’s Mercury Web Parser, I was thrilled. It’s by no means 100% accurate, and I’ve encountered some issues with non-English language site content, but overall, the Mercury Web Parser does what many 'reader' modes do: make text on the web more friendly to read. It also extracts some metadata that is nice to have. What’s really nice about this is that you don’t even need a browser! One inconvenience of the Mercury Web Parser, originally, was that it was closed source but made available as a free-to-use ReSTful API. That API is going to be shut down in April, but Postlight has released the parser under fairly friendly licenses. If you wanted to, you could run the parser as a web service, just like Postlight had been doing.

    While the parser’s results are useful on their own, HTML is not ideal in all situations for me. I’m not a whiz at Javascript, so I wrote a Python script to extend the results of the parser to suit my needs (this script originally made requests to the API, but now just uses the JSON results from the command-line driver).

    2 votes