57 votes

Starting to experiment a little with using data scraped from the destination of link topics

Posted September 12, 2018 by Deimos (edited September 12, 2018)

This is very minor so far, but I think it's good to have a topic devoted to it so that people have somewhere to discuss it, instead of having it come up randomly in topics that it applies to.

I've recently started scraping some data about the destination of link topics using Embedly's "Extract" API (Embedly was kind enough to give me a reasonable amount of free usage since Tildes is a non-profit). You can put in the url of an article/video/etc. on that page to get an idea of what sort of data I can get from it, if you'd like to see for yourself.

I've only just started tinkering with it, and so far the data is only being used in two small ways:

Tweets now display the entire text of the tweet on the topic listing page, similar to the "excerpt" from text topics. You can see an example here.
On topic listings, the date that an article was published will be shown (after the domain name) if the publication date was at least 3 days before it was submitted. There are a few examples in the recent posts in ~misc

I'll probably adjust this threshold, but I'd like it to be an amount of time where the age of the content might feel "significant". It would also be possible to just show this info all the time, but I think the topic listings are already fairly cluttered so it's probably best to hide it when it's not interesting/significant.

As I said, these are very tiny changes so far, but there are lots of other possibilities that I hope to start using before long. I've mentioned this before, but something I'd really like to do overall is try to bring in more data about the links where it's possible to be able to show things like the lengths of videos and so on.

Let me know if you have any thoughts about it or notice any issues, thanks.

32 comments

[2]
Celeo
September 12, 2018
Link
Seeing the tweet content in the listing is nice; I wonder if it could be shown it in the actual topic view as well (saves a click). The print date is also nice. I don't think it clutters up the...

Seeing the tweet content in the listing is nice; I wonder if it could be shown it in the actual topic view as well (saves a click). The print date is also nice. I don't think it clutters up the topic listing on desktop at all; on mobile, there's a slightly higher risk of a long domain name + print date overflowing the area allotted to it and pushing the vote button off to the right. https://i.imgur.com/63DlvLw.png

Thanks!

14 votes
1. Deimos (OP)
  September 12, 2018
  Link Parent
  Yeah, definitely could (and should) be shown on the topic page as well, that should be easy to do. Ah, thanks. That's an easy fix.
  
  Seeing the tweet content in the listing is nice; I wonder if it could be shown it in the actual topic view as well (saves a click).
  
  Yeah, definitely could (and should) be shown on the topic page as well, that should be easy to do.
  
  on mobile, there's a slightly higher risk of a long domain name + print date overflowing the area allotted to it and pushing the vote button off to the right.
  
  Ah, thanks. That's an easy fix.
  
  9 votes
lordpipe
September 12, 2018
Link
This sounds like the proper way to do it. No third party tracker infested 2 MiB embed for a 840 byte tweet.

I've recently started scraping some data about the destination of link topics using Embedly's "Extract" API

This sounds like the proper way to do it. No third party tracker infested 2 MiB embed for a 840 byte tweet.

12 votes
[2]
ourari
September 12, 2018
Link
Happy with what I'm seeing. Clearly adds value. From what I gather this is all happening server-side, right? So no need for us to start combing through Embed.ly's privacy policy?

Happy with what I'm seeing. Clearly adds value.

From what I gather this is all happening server-side, right? So no need for us to start combing through Embed.ly's privacy policy?

11 votes
1. Deimos (OP)
  September 12, 2018
  Link Parent
  Yes, all server-side. From Embedly's end, I don't think there's anything more than "the Tildes server requested Extract data for these urls".
  
  Yes, all server-side. From Embedly's end, I don't think there's anything more than "the Tildes server requested Extract data for these urls".
  
  9 votes
[3]
frickindeal
September 12, 2018
Link
Length of video is something RES does for Youtube posts, and I find it very handy, especially when I'm at my shop and don't want to dig into a 20-minute video on some esoteric subject. A bit...

Length of video is something RES does for Youtube posts, and I find it very handy, especially when I'm at my shop and don't want to dig into a 20-minute video on some esoteric subject.

A bit off-topic, but are you solid on the text size of the excerpts? As someone with aging eyes on a 15" 1080x1920 laptop screen, the excerpts are pretty tiny. I think it's nice that they don't clutter up the listings, though, and they look fantastic in Bauke's Dracula theme.

10 votes
1. Deimos (OP)
  September 12, 2018
  Link Parent
  Definitely not stuck on it (or many of the aspects of the design at all, I know that I'm bad at design), but it's difficult to balance de-emphasizing something and making it easily readable.
  
  Definitely not stuck on it (or many of the aspects of the design at all, I know that I'm bad at design), but it's difficult to balance de-emphasizing something and making it easily readable.
  
  7 votes
2. talklittle
  September 12, 2018
  Link Parent
  Agreed, the light text color and italics are difficult to read. I wonder if there's a solution that would let us "expand" short excerpts so they are readable like long expanded ones, but without...
  
  Agreed, the light text color and italics are difficult to read. I wonder if there's a solution that would let us "expand" short excerpts so they are readable like long expanded ones, but without misleading users into thinking short excerpts are longer than they really are?
  
  4 votes
[7]
unknown user
September 12, 2018
Link
#1: the tweet isn't visible on mobile as far as I can tell - I would second the proposal to put it in the topic page as well. #2: this is awesome; other websites often work around the "old...

#1: the tweet isn't visible on mobile as far as I can tell - I would second the proposal to put it in the topic page as well.

#2: this is awesome; other websites often work around the "old article" problem by putting the year in the title, but I don't think I've seen an implementation like this before. I'm assuming Embedly won't be able to determine a date for all links - would you consider altering the date manually as a potential feature for the poster and/or trusted users?

7 votes
1. [3]
  Deimos (OP)
  September 12, 2018
  Link Parent
  Yes, but not only the date—ideally I think having all "metadata" editable by trusted users would be best. It's not really much different than letting people change tags or anything else, it's just...
  
  would you consider altering the date manually as a potential feature for the poster and/or trusted users?
  
  Yes, but not only the date—ideally I think having all "metadata" editable by trusted users would be best. It's not really much different than letting people change tags or anything else, it's just a different set of metadata. But it's definitely best to do as much of it automatically as possible, I don't want people needing to enter a bunch of stuff on every post to try to keep the metadata filled out.
  
  7 votes
  1. [2]
    Soptik
    September 12, 2018
    Link Parent
    Would be usage of some search engine to determine article publish date possible? This way setting article publish date could be automated even without Embedly. For example, when I used google,...
    
    Would be usage of some search engine to determine article publish date possible? This way setting article publish date could be automated even without Embedly. For example, when I used google, there was date next to articles - probably when it was indexed. Could we use search engine index date when Embedly doesn't know the date? The downsite is that startpage doesn't provide API and ddg api is very limited because of how duckduckgo gets its results.
    
    1 vote
    
    Deimos (OP)
    September 12, 2018
    Link Parent
    I don't know of any search engine APIs to get that data, but if there's anything reasonable (and not expensive) it might be possible to add. Embedly should be pretty good at it in general though,...
    
    I don't know of any search engine APIs to get that data, but if there's anything reasonable (and not expensive) it might be possible to add. Embedly should be pretty good at it in general though, I think (and their support has been responsive in the past if I find sites/articles that don't scrape properly).
    
    1 vote
2. Tenar
  September 12, 2018
  Link Parent
  I second the manual change for some fringe cases: like when articles are republished (e.g. "this article appeared first on my old blog at abc.xyz, and has been republished because its renewed...
  
  I second the manual change for some fringe cases: like when articles are republished (e.g. "this article appeared first on my old blog at abc.xyz, and has been republished because its renewed relevancy")
  
  5 votes
3. [3]
  
  Comment deleted by author
  Link Parent
  1. [2]
    Deimos (OP)
    September 12, 2018
    Link Parent
    The dates should be visible on mobile, they're just not on every post. Currently, they're only on the listing page, and only show up if the link is reasonably old. If you look at this page, it...
    
    The dates should be visible on mobile, they're just not on every post. Currently, they're only on the listing page, and only show up if the link is reasonably old. If you look at this page, it should show on two link topics near the top: https://tildes.net/search?q=punctuation
    
    I realized that when using "Request Desktop Page" on tildes.net the site doesn't honor the request.
    
    I guess it depends exactly how that feature works (and I don't really know much about it). Tildes doesn't have "desktop" and "mobile" versions, it's just one site that changes its appearance at different screen sizes. If you're on desktop, you can shrink your browser window down to a small phone-like size and the window will change to showing "the mobile version" once it gets small enough.
    
    So a phone probably won't really be able to show the desktop version unless there's some way for it to fake its screen size and act like it's a much larger screen. This would just need to happen in the browser itself though, it's all done on the viewer's end.
    
    7 votes
    
    ruspaceni
    September 12, 2018
    Link Parent
    Firefox on a samsung galaxy s5's screen seems to honour the "desktop" view. Perhaps in the cases where it doesn't work, enabling the mode does change the screen size but just not enough to meet...
    
    Firefox on a samsung galaxy s5's screen seems to honour the "desktop" view. Perhaps in the cases where it doesn't work, enabling the mode does change the screen size but just not enough to meet the desktop threshold.
    
    1 vote
[7]
cain
September 12, 2018
Link
Very tiny changes but very good changes on both examples, good update and as always thanks for all the work. Definitely a QoL increase on the tweets appearing and a quality of content increase on...

Very tiny changes but very good changes on both examples, good update and as always thanks for all the work. Definitely a QoL increase on the tweets appearing and a quality of content increase on the article dates showing.

What other ideas are you thinking about for this?

6 votes
1. [3]
  Deimos (OP)
  September 12, 2018 (edited September 12, 2018)
  Link Parent
  Oh, realized that I completely forgot to mention a couple of other things that I'm hoping to use the Embedly data for as well in the near future: Link canonicalization (including removing tracking...
  
  Oh, realized that I completely forgot to mention a couple of other things that I'm hoping to use the Embedly data for as well in the near future:
  
  Link canonicalization (including removing tracking from urls). This should help with a few things, including replacing links to mobile/AMP versions of sites with normal ones, getting rid of utm_ and other tracking garbage in urls, and eventually help with detecting duplicate submissions.
  
  Fetching favicons automatically and getting rid of those dashed blue squares on a lot of link topics. My previous method of doing this wasn't very good (obviously, which is why it's been disabled for so long), but using the Embedly data should make it easier.
  
  23 votes
  1. [2]
    cain
    September 12, 2018
    Link Parent
    I don't have the knowledge to fully understand what the link canonicalization means but fetching favicons automatically and getting rid of those dashed blue squares on a lot of link topics will be...
    
    I don't have the knowledge to fully understand what the link canonicalization means but fetching favicons automatically and getting rid of those dashed blue squares on a lot of link topics will be a verrrry welcome change.
    
    I was just thinking about it not 30 minutes ago while looking at the topics page.
    
    5 votes
    
    Neverland
    September 12, 2018 (edited September 12, 2018)
    Link Parent
    Link canonicalization is much simpler than it sounds. It just means finding the simplest, lowest common denominator of a URL. For example, the non-mobile url, or the url minus the tracking...
    
    Link canonicalization is much simpler than it sounds. It just means finding the simplest, lowest common denominator of a URL. For example, the non-mobile url, or the url minus the tracking parameters.
    
    Here is the Tildes GitLab issue that addresses the concept with a few examples of why it’s a good thing.
    
    Edit: clarity
    
    14 votes
2. [3]
  Deimos (OP)
  September 12, 2018
  Link Parent
  For general articles, I think we could start showing a few pieces of data pretty easily. From what Embedly gives me, I can usually get the publication date of the article, a word count,...
  
  For general articles, I think we could start showing a few pieces of data pretty easily. From what Embedly gives me, I can usually get the publication date of the article, a word count, information about the author(s), and sometimes a reasonable short summary. Some or all of those might be interesting to show from the topic page, and might be able to use things like the word count to mark longer/shorter articles somehow (similar to the "long read" tags that some of us have been manually adding on longer ones).
  
  11 votes
  1. cain
    September 12, 2018
    Link Parent
    I like it. A summary that could be looked at to see if I have any interest in diving into an article as well as word count would be very nice. Another QoL increase that might be possible with it...
    
    I like it. A summary that could be looked at to see if I have any interest in diving into an article as well as word count would be very nice.
    
    Another QoL increase that might be possible with it is something like length of song/artist/etc on the youtube, soundcloud and song.link posts in ~music or just video length in general
    
    6 votes
  2. frickindeal
    September 12, 2018
    Link Parent
    If you don't want to display word count, you could use something similar to the "line of dots" that Kindle e-readers use to indicate the relative length of books. In fact, they disabled it for a...
    
    If you don't want to display word count, you could use something similar to the "line of dots" that Kindle e-readers use to indicate the relative length of books. In fact, they disabled it for a time, and the outcry was such that they brought it back—knowing the (approximate) length of a book is something inherently lost in an electronic device, where one would know from a physical book just by its size how long it is. I'd find something like that very helpful (or even just word count), because I'm often balancing whatever else I'm doing with time spent reading articles/links/etc., and would easily know if it's too long to be digging into right now.
    
    5 votes
Amarok
September 12, 2018
Link
Ah, that delicious metadata. This is just a little gold dust to get things started. :)

Ah, that delicious metadata. This is just a little gold dust to get things started. :)

4 votes
MacDolanFarms
September 13, 2018
Link
My only concern with this is that the features rely on a third-party proprietary service.

My only concern with this is that the features rely on a third-party proprietary service.

3 votes
Zeph
September 13, 2018
Link
This is really nice, I just hope it doesn't become too cluttered in the future with all the possibilities. I like that the date is hidden if it's recent for that reason but also because it makes...

This is really nice, I just hope it doesn't become too cluttered in the future with all the possibilities.

I like that the date is hidden if it's recent for that reason but also because it makes it more noticeable when there is a date there (and thus isn't something that gets glossed over)

3 votes
[3]
kaushalmodi
September 12, 2018 (edited September 12, 2018)
Link
Awesome! Will the users see a preview of the parsed meta data (like date) before hitting submit, after they have pasted a link? I use the Microformats 2 meta data. I assume they will be...

Awesome! Will the users see a preview of the parsed meta data (like date) before hitting submit, after they have pasted a link?

I use the Microformats 2 meta data. I assume they will be successfully scraped.. example.

Update: Sorry, I reread and realized you use Embed.ly.. looks like it parsed the published date as something.. 1534974420000 .. What date/time unit is that?

2 votes
1. Deimos (OP)
  September 12, 2018 (edited September 12, 2018)
  Link Parent
  It's a possibility for the future (and would be useful for features like giving topics a title automatically), but right now the scraping is done asynchronously after the post is submitted, and...
  
  It's a possibility for the future (and would be useful for features like giving topics a title automatically), but right now the scraping is done asynchronously after the post is submitted, and there's not any way to trigger it beforehand.
  
  You can see what will be scraped by Embedly for that link (or any other) here: https://embed.ly/docs/explore/extract?url=https%3A%2F%2Fscripter.co%2Fsplitting-an-org-block-into-two%2F
  
  The published date is milliseconds since the unix epoch, 1534974420000 converts to Wednesday, August 22, 2018 21:47:00 UTC.
  
  8 votes
2. Celeo
  September 12, 2018
  Link Parent
  Milliseconds since the epoch. new Date(1534974420000) // 2018-08-22T21:47:00.000Z
  
  Milliseconds since the epoch.
  
  new Date(1534974420000) // 2018-08-22T21:47:00.000Z
  
  3 votes
[4]
balooga
September 12, 2018
Link
It took a minute to get used to it, but I really like the publish dateline next to the link. I wish that data was available for more content. I also like the way the domain is indicated, but I...

It took a minute to get used to it, but I really like the publish dateline next to the link. I wish that data was available for more content.

I also like the way the domain is indicated, but I wonder if the source's name could be put there instead, for certain known domains. In other words, show YouTube instead of youtube.com, The New York Times instead of nytimes.com, etc. For brevity, maybe a favicon would be a good choice. Again, only for recognized sources, I wouldn't want to see dozens of who-knows.biz favicons filling up the page. That might be too much visual clutter even for known sites, I'm not sure.

2 votes
1. [3]
  Deimos (OP)
  September 12, 2018 (edited September 12, 2018)
  Link Parent
  Are the favicons in front of the titles not showing up for you? Not all domains have them, but a lot of the more common ones do (and as I mentioned, I'm hoping to use this data to start...
  
  Are the favicons in front of the titles not showing up for you? Not all domains have them, but a lot of the more common ones do (and as I mentioned, I'm hoping to use this data to start automatically fetching them all soon).
  
  4 votes
  1. [2]
    balooga
    September 12, 2018
    Link Parent
    Huh. There they are. Guess I just glossed over them. Apparently they're not as cluttered looking in practice as I imagined. Do we still need the domain name to the right of the link if we have a...
    
    Huh. There they are.
    
    Guess I just glossed over them. Apparently they're not as cluttered looking in practice as I imagined. Do we still need the domain name to the right of the link if we have a favicon to the left of it? And what's the determining factor for whether a site's favicon makes the cut for inclusion here?
    
    2 votes
    
    Deimos (OP)
    September 12, 2018 (edited September 12, 2018)
    Link Parent
    I think it's good to have both for now, mostly because I don't recognize the favicons for that many sites, so all the unrecognized ones wouldn't actually be giving me any information about what...
    
    I think it's good to have both for now, mostly because I don't recognize the favicons for that many sites, so all the unrecognized ones wouldn't actually be giving me any information about what site the link is from. The presence of a favicon (or the blue dashed square) in front of a title helps to show that it's a link topic, but it doesn't tell you where it goes unless you recognize it. Once we start having more data available maybe it would be better to use the domain name space for something else though, I guess we'll see.
    
    There's no particular criteria to have a favicon shown. It's supposed to have one for all links, but my scraper is disabled for now so it's not happening automatically. The Embedly data is going to help with fixing it.
    
    5 votes