Starting to experiment a little with using data scraped from the destination of link topics
This is very minor so far, but I think it's good to have a topic devoted to it so that people have somewhere to discuss it, instead of having it come up randomly in topics that it applies to.
I've recently started scraping some data about the destination of link topics using Embedly's "Extract" API (Embedly was kind enough to give me a reasonable amount of free usage since Tildes is a non-profit). You can put in the url of an article/video/etc. on that page to get an idea of what sort of data I can get from it, if you'd like to see for yourself.
I've only just started tinkering with it, and so far the data is only being used in two small ways:
-
Tweets now display the entire text of the tweet on the topic listing page, similar to the "excerpt" from text topics. You can see an example here.
-
On topic listings, the date that an article was published will be shown (after the domain name) if the publication date was at least 3 days before it was submitted. There are a few examples in the recent posts in ~misc
I'll probably adjust this threshold, but I'd like it to be an amount of time where the age of the content might feel "significant". It would also be possible to just show this info all the time, but I think the topic listings are already fairly cluttered so it's probably best to hide it when it's not interesting/significant.
As I said, these are very tiny changes so far, but there are lots of other possibilities that I hope to start using before long. I've mentioned this before, but something I'd really like to do overall is try to bring in more data about the links where it's possible to be able to show things like the lengths of videos and so on.
Let me know if you have any thoughts about it or notice any issues, thanks.
Seeing the tweet content in the listing is nice; I wonder if it could be shown it in the actual topic view as well (saves a click). The print date is also nice. I don't think it clutters up the topic listing on desktop at all; on mobile, there's a slightly higher risk of a long domain name + print date overflowing the area allotted to it and pushing the vote button off to the right. https://i.imgur.com/63DlvLw.png
Thanks!
Yeah, definitely could (and should) be shown on the topic page as well, that should be easy to do.
Ah, thanks. That's an easy fix.
This sounds like the proper way to do it. No third party tracker infested 2 MiB embed for a 840 byte tweet.
Happy with what I'm seeing. Clearly adds value.
From what I gather this is all happening server-side, right? So no need for us to start combing through Embed.ly's privacy policy?
Yes, all server-side. From Embedly's end, I don't think there's anything more than "the Tildes server requested Extract data for these urls".
Length of video is something RES does for Youtube posts, and I find it very handy, especially when I'm at my shop and don't want to dig into a 20-minute video on some esoteric subject.
A bit off-topic, but are you solid on the text size of the excerpts? As someone with aging eyes on a 15" 1080x1920 laptop screen, the excerpts are pretty tiny. I think it's nice that they don't clutter up the listings, though, and they look fantastic in Bauke's Dracula theme.
Definitely not stuck on it (or many of the aspects of the design at all, I know that I'm bad at design), but it's difficult to balance de-emphasizing something and making it easily readable.
Agreed, the light text color and italics are difficult to read. I wonder if there's a solution that would let us "expand" short excerpts so they are readable like long expanded ones, but without misleading users into thinking short excerpts are longer than they really are?
#1: the tweet isn't visible on mobile as far as I can tell - I would second the proposal to put it in the topic page as well.
#2: this is awesome; other websites often work around the "old article" problem by putting the year in the title, but I don't think I've seen an implementation like this before. I'm assuming Embedly won't be able to determine a date for all links - would you consider altering the date manually as a potential feature for the poster and/or trusted users?
Yes, but not only the date—ideally I think having all "metadata" editable by trusted users would be best. It's not really much different than letting people change tags or anything else, it's just a different set of metadata. But it's definitely best to do as much of it automatically as possible, I don't want people needing to enter a bunch of stuff on every post to try to keep the metadata filled out.
Would be usage of some search engine to determine article publish date possible? This way setting article publish date could be automated even without Embedly. For example, when I used google, there was date next to articles - probably when it was indexed. Could we use search engine index date when Embedly doesn't know the date? The downsite is that startpage doesn't provide API and ddg api is very limited because of how duckduckgo gets its results.
I don't know of any search engine APIs to get that data, but if there's anything reasonable (and not expensive) it might be possible to add. Embedly should be pretty good at it in general though, I think (and their support has been responsive in the past if I find sites/articles that don't scrape properly).
I second the manual change for some fringe cases: like when articles are republished (e.g. "this article appeared first on my old blog at abc.xyz, and has been republished because its renewed relevancy")
The dates should be visible on mobile, they're just not on every post. Currently, they're only on the listing page, and only show up if the link is reasonably old. If you look at this page, it should show on two link topics near the top: https://tildes.net/search?q=punctuation
I guess it depends exactly how that feature works (and I don't really know much about it). Tildes doesn't have "desktop" and "mobile" versions, it's just one site that changes its appearance at different screen sizes. If you're on desktop, you can shrink your browser window down to a small phone-like size and the window will change to showing "the mobile version" once it gets small enough.
So a phone probably won't really be able to show the desktop version unless there's some way for it to fake its screen size and act like it's a much larger screen. This would just need to happen in the browser itself though, it's all done on the viewer's end.
Firefox on a samsung galaxy s5's screen seems to honour the "desktop" view. Perhaps in the cases where it doesn't work, enabling the mode does change the screen size but just not enough to meet the desktop threshold.
Very tiny changes but very good changes on both examples, good update and as always thanks for all the work. Definitely a QoL increase on the tweets appearing and a quality of content increase on the article dates showing.
What other ideas are you thinking about for this?
Oh, realized that I completely forgot to mention a couple of other things that I'm hoping to use the Embedly data for as well in the near future:
utm_
and other tracking garbage in urls, and eventually help with detecting duplicate submissions.I don't have the knowledge to fully understand what the link canonicalization means but fetching favicons automatically and getting rid of those dashed blue squares on a lot of link topics will be a verrrry welcome change.
I was just thinking about it not 30 minutes ago while looking at the topics page.
Link canonicalization is much simpler than it sounds. It just means finding the simplest, lowest common denominator of a URL. For example, the non-mobile url, or the url minus the tracking parameters.
Here is the Tildes GitLab issue that addresses the concept with a few examples of why it’s a good thing.
Edit: clarity
For general articles, I think we could start showing a few pieces of data pretty easily. From what Embedly gives me, I can usually get the publication date of the article, a word count, information about the author(s), and sometimes a reasonable short summary. Some or all of those might be interesting to show from the topic page, and might be able to use things like the word count to mark longer/shorter articles somehow (similar to the "long read" tags that some of us have been manually adding on longer ones).
I like it. A summary that could be looked at to see if I have any interest in diving into an article as well as word count would be very nice.
Another QoL increase that might be possible with it is something like length of song/artist/etc on the youtube, soundcloud and song.link posts in ~music or just video length in general
If you don't want to display word count, you could use something similar to the "line of dots" that Kindle e-readers use to indicate the relative length of books. In fact, they disabled it for a time, and the outcry was such that they brought it back—knowing the (approximate) length of a book is something inherently lost in an electronic device, where one would know from a physical book just by its size how long it is. I'd find something like that very helpful (or even just word count), because I'm often balancing whatever else I'm doing with time spent reading articles/links/etc., and would easily know if it's too long to be digging into right now.
Ah, that delicious metadata. This is just a little gold dust to get things started. :)
My only concern with this is that the features rely on a third-party proprietary service.
This is really nice, I just hope it doesn't become too cluttered in the future with all the possibilities.
I like that the date is hidden if it's recent for that reason but also because it makes it more noticeable when there is a date there (and thus isn't something that gets glossed over)
Awesome! Will the users see a preview of the parsed meta data (like date) before hitting submit, after they have pasted a link?
I use the Microformats 2 meta data. I assume they will be successfully scraped.. example.
Update: Sorry, I reread and realized you use Embed.ly.. looks like it parsed the published date as something.. 1534974420000 .. What date/time unit is that?
It's a possibility for the future (and would be useful for features like giving topics a title automatically), but right now the scraping is done asynchronously after the post is submitted, and there's not any way to trigger it beforehand.
You can see what will be scraped by Embedly for that link (or any other) here: https://embed.ly/docs/explore/extract?url=https%3A%2F%2Fscripter.co%2Fsplitting-an-org-block-into-two%2F
The published date is milliseconds since the unix epoch, 1534974420000 converts to Wednesday, August 22, 2018 21:47:00 UTC.
Milliseconds since the epoch.
It took a minute to get used to it, but I really like the publish dateline next to the link. I wish that data was available for more content.
I also like the way the domain is indicated, but I wonder if the source's name could be put there instead, for certain known domains. In other words, show
YouTube
instead ofyoutube.com
,The New York Times
instead ofnytimes.com
, etc. For brevity, maybe a favicon would be a good choice. Again, only for recognized sources, I wouldn't want to see dozens of who-knows.biz favicons filling up the page. That might be too much visual clutter even for known sites, I'm not sure.Are the favicons in front of the titles not showing up for you? Not all domains have them, but a lot of the more common ones do (and as I mentioned, I'm hoping to use this data to start automatically fetching them all soon).
Huh. There they are.
Guess I just glossed over them. Apparently they're not as cluttered looking in practice as I imagined. Do we still need the domain name to the right of the link if we have a favicon to the left of it? And what's the determining factor for whether a site's favicon makes the cut for inclusion here?
I think it's good to have both for now, mostly because I don't recognize the favicons for that many sites, so all the unrecognized ones wouldn't actually be giving me any information about what site the link is from. The presence of a favicon (or the blue dashed square) in front of a title helps to show that it's a link topic, but it doesn't tell you where it goes unless you recognize it. Once we start having more data available maybe it would be better to use the domain name space for something else though, I guess we'll see.
There's no particular criteria to have a favicon shown. It's supposed to have one for all links, but my scraper is disabled for now so it's not happening automatically. The Embedly data is going to help with fixing it.