8 votes

Extract clean(er), readable text from web pages via the Mercury Web Parser.

Posted February 28, 2019 by onyxleopard

Tags: open source, mercury parser, web scraping, readability, reader, semantic web

https://github.com/postlight/mercury-parser

Link information

This data is scraped automatically and may be incorrect.

Title: postlight/mercury-parser

14 comments

[8]
DonQuixote
February 28, 2019
Link
Any chance we'll see a Firefox add-on?

Any chance we'll see a Firefox add-on?

3 votes
1. [6]
  masochist
  February 28, 2019
  Link Parent
  Firefox has a built in reader view already.
  
  Firefox has a built in reader view already.
  
  3 votes
  1. [5]
    onyxleopard (OP)
    February 28, 2019
    Link Parent
    Do you happen to know if Mozilla built the content extractors themselves or if they’re using a library? Automatic content extraction is really interesting, and I’m curious to learn of additional...
    
    Do you happen to know if Mozilla built the content extractors themselves or if they’re using a library? Automatic content extraction is really interesting, and I’m curious to learn of additional open-source implementations. I know of Apache Tika, and boilerpipe (both Java). There are dozens of other small GitHub projects that attempt to do this with heuristics, but most are not very good. I think it would be neat to train an automatic content extractor with supervised machine learning, but I’m too lazy to tag a training set.
    
    2 votes
    
    [2]
    cfabbro
    February 28, 2019
    Link Parent
    https://github.com/mozilla/readability
    
    https://github.com/mozilla/readability
    
    4 votes
    
    onyxleopard (OP)
    February 28, 2019
    Link Parent
    Thanks. This implementation is based on heuristics (like several other amateur implementations I’ve seen). I believe a robust reader would require a statistical model (like some of boilerpipe’s...
    
    Thanks. This implementation is based on heuristics (like several other amateur implementations I’ve seen). I believe a robust reader would require a statistical model (like some of boilerpipe’s extractors), and potentially several different models for different natural languages, in addition to a natural language classifier to decide which model to apply.
    
    2 votes
    
    [2]
    masochist
    February 28, 2019
    Link Parent
    I don't, sorry. You could probably find out with some web searches. I'd suggest digging through the code but it's, uh, a mess. And there's so much of it.
    
    I don't, sorry. You could probably find out with some web searches. I'd suggest digging through the code but it's, uh, a mess. And there's so much of it.
    
    1 vote
    
    onyxleopard (OP)
    February 28, 2019
    Link Parent
    @cfabbro hooked me up already.
    
    @cfabbro hooked me up already.
    
    2 votes
2. onyxleopard (OP)
  February 28, 2019
  Link Parent
  Postlight had already made a Chrome extension. I imagine it wouldn’t take someone very long to make a Firefox extension if they already were familiar with building Firefox extensions.
  
  Postlight had already made a Chrome extension. I imagine it wouldn’t take someone very long to make a Firefox extension if they already were familiar with building Firefox extensions.
[6]

Comment deleted by author
Link
1. [4]
  onyxleopard (OP)
  February 28, 2019
  Link Parent
  I’m not sure it’s equivalent to Embedly (from a quick look at Embedly’s site, I can’t really tell what Embedly does, technically). The Mercury Web Parser is mainly about extracting content from...
  
  I’m not sure it’s equivalent to Embedly (from a quick look at Embedly’s site, I can’t really tell what Embedly does, technically). The Mercury Web Parser is mainly about extracting content from web pages and representing them in a standardized structure. It preserves links to external media/sites, but I’ve mainly used it for extracting textual content. I don’t have a good sense of how it handles embedded images/videos etc. as I usually consider such things noise.
  
  2 votes
  1. [2]
    
    Comment deleted by author
    Link Parent
    
    onyxleopard (OP)
    February 28, 2019
    Link Parent
    OK, yes, extracting metadata such as those you mentioned is something that the parser does. So, Tildes could use it (though someone would have to evaluate the accuracy/coverage of metadata for...
    
    OK, yes, extracting metadata such as those you mentioned is something that the parser does. So, Tildes could use it (though someone would have to evaluate the accuracy/coverage of metadata for Mercury, Embedly, or any other such services to see how they stack up). Obviously now that Mercury is open source under friendly licenses, I think that is an advantage in a different dimension, but it’s not the only thing to consider.
    
    2 votes
  2. [2]
    Deimos
    February 28, 2019
    Link Parent
    Embedly does a few different things overall, but Bauke is referring to their "Extract" API that I'm using, which pulls content and metadata from a lot of sites. For example, here's what it pulls...
    
    Embedly does a few different things overall, but Bauke is referring to their "Extract" API that I'm using, which pulls content and metadata from a lot of sites. For example, here's what it pulls out of another article that was recently submitted: https://embed.ly/docs/explore/extract?url=https%3A%2F%2Fwww.bbc.com%2Fnews%2Ftechnology-47408969
    
    2 votes
    
    onyxleopard (OP)
    February 28, 2019
    Link Parent
    Yeah, Embedly extracts quite a few more fields than Mercury does. That said, I imagine that Mercury could be extended to do those things, too.
    
    Yeah, Embedly extracts quite a few more fields than Mercury does. That said, I imagine that Mercury could be extended to do those things, too.
    
    1 vote
2. balooga
  February 28, 2019
  Link Parent
  Timely: I was just reading the thread about the Apple layoffs article that was published over a month ago, but the Embedly scrape didn't capture that, so people assumed it was breaking news. If...
  
  Timely: I was just reading the thread about the Apple layoffs article that was published over a month ago, but the Embedly scrape didn't capture that, so people assumed it was breaking news. If there's a more reliable way available to extract link metadata, it's worth a closer look I'd say.
  
  2 votes
onyxleopard (OP)
February 28, 2019
Link
When I first found out about Postlight’s Mercury Web Parser, I was thrilled. It’s by no means 100% accurate, and I’ve encountered some issues with non-English language site content, but overall,...

When I first found out about Postlight’s Mercury Web Parser, I was thrilled. It’s by no means 100% accurate, and I’ve encountered some issues with non-English language site content, but overall, the Mercury Web Parser does what many 'reader' modes do: make text on the web more friendly to read. It also extracts some metadata that is nice to have. What’s really nice about this is that you don’t even need a browser! One inconvenience of the Mercury Web Parser, originally, was that it was closed source but made available as a free-to-use ReSTful API. That API is going to be shut down in April, but Postlight has released the parser under fairly friendly licenses. If you wanted to, you could run the parser as a web service, just like Postlight had been doing.

While the parser’s results are useful on their own, HTML is not ideal in all situations for me. I’m not a whiz at Javascript, so I wrote a Python script to extend the results of the parser to suit my needs (this script originally made requests to the API, but now just uses the JSON results from the command-line driver).

2 votes