15 votes

Let's build a spec-compliant HTML parser

10 comments

  1. Bauke
    (edited )
    Link
    There will probably be a bunch of videos where Andreas builds out the parser, so instead of posting a topic for each video, I'll just make a list here: Part 1: Let's build a spec-compliant HTML...

    There will probably be a bunch of videos where Andreas builds out the parser, so instead of posting a topic for each video, I'll just make a list here:

    If you want to find this comment quickly, bookmark it. :) And if you like Andreas' videos, subscribe to him! He makes a lot of videos just like these that are really interesting to watch and explains everything in an understandable way. Highly recommend it even if you don't know C++ at all.

    3 votes
  2. [9]
    Apos
    Link
    I thought a project like that would be almost impossible nowadays.

    I thought a project like that would be almost impossible nowadays.

    3 votes
    1. [8]
      Akir
      Link Parent
      Parsing proper HTML isn't difficult. It's the everything else browsers do that is the problem.

      Parsing proper HTML isn't difficult. It's the everything else browsers do that is the problem.

      10 votes
      1. imperialismus
        Link Parent
        Parsing improper HTML, on the other hand, is quite difficult and something that modern browsers do remarkably well.

        Parsing improper HTML, on the other hand, is quite difficult and something that modern browsers do remarkably well.

        5 votes
      2. [6]
        cptcobalt
        Link Parent
        I really wish some really rich benefactor comes along and says “let’s make a browser engine, but do it the right way”, kinda like Apple did with WebKit before that got too big. Diversity in...

        I really wish some really rich benefactor comes along and says “let’s make a browser engine, but do it the right way”, kinda like Apple did with WebKit before that got too big. Diversity in browser/js/etc engines is still necessary.

        And, preferably, just obsesses over performance.

        2 votes
        1. [5]
          Moonchild
          Link Parent
          Why? The problem with the web isn't that it isn't democratized, it's that it's not democratizable. The solution is a content vehicle which is simple. Simple enough, that, say a skilled engineer...

          Why?

          The problem with the web isn't that it isn't democratized, it's that it's not democratizable.

          The solution is a content vehicle which is simple. Simple enough, that, say a skilled engineer would be able to implement it entirely in under a couple of months. (Some people even thing it should be simple enough to be a weekend project; see e.g. gopher, gemini.) A single new browser engine doesn't do anything for the pathological problem.

          3 votes
          1. [4]
            ThatFanficGuy
            Link Parent
            That's a bit of a stretch, as far as I'm concerned. Would you change my mind by providing reasoning behind this statement?

            The problem with the web isn't that it isn't democratized, it's that it's not democratizable.

            That's a bit of a stretch, as far as I'm concerned. Would you change my mind by providing reasoning behind this statement?

            1 vote
            1. [3]
              Moonchild
              Link Parent
              Back in the 90s, and even the early 2000s, there were quite a few independent browsers. Some of the text-based ones (w3m, links, elinks, lynx) survive to this day (though they're only marginally...

              Back in the 90s, and even the early 2000s, there were quite a few independent browsers. Some of the text-based ones (w3m, links, elinks, lynx) survive to this day (though they're only marginally useful), but none of the graphical ones do. Reason being? Microsoft decided to try to Embrace, Extend, and Extinguish the web1. They added many features to IE that were exclusive to that browser, and didn't work anywhere else, and then websites started targeting IE exclusively, so you had to use IE if you wanted to visit those websites.

              We were saved by mozilla firefox, a bastion of standards compliance, usability, and extensibility. (Firefox had somewhat of a golden age of addons—now defunct, sadly.) IE was thwarted, but the web had grown in complexity, and the cost of building a web browser had grown.

              Then came WHATWG.

              WHATWG and W3C were both committees that had been publishing versions of the html specification for many years, and those specs were largely identical. But in 2012, they diverged. Why? Simple. Google had published its 'chrome' browser a few years earlier; it was based on apple's webkit (still in use today powering safari, as well as various embedded browsers), which was in turn based on KHTML (one of the early now-defunct graphical browsers). And WHATWG was run by browser vendors (who had, completely separately from membership of the committee, significant influence over the direction of the web).

              As a vendor of ads on the web, google had significant vested personal interest in ensuring that the web was a platform where ad-tech and ad-tracking could be perpetuated. (And that it was not a platform where third-party cookies were blocked by default, as by firefox and safari.)

              Google's core business model is threatened by the existence of browsers other than chrome, and it has a vested interest in ensuring that people can't effectively use other browsers.

              This is the reason the web is so complicated today. Because google has used its influence to make it so, to discourage other browsers from cropping up so it can sell more ads. To succeed where microsoft failed. A single new browser won't fix that. Can't change that. No 'mysterious benefactor' has deeper pockets than google, to be able to start a browser from scratch, bring it up to par, and keep up with the shifting quagmire that is the web.

              What we need is a revolution. A completely new vision of the web that doesn't rely on ad-tech. I don't know if it's possible to build such a thing. But I would certainly like to think so~


              1: To be fair, it's not possible to prove intent. Charitably, they may have wanted to provide more tools for web developers to make better websites. But whatever the intent, the effects are indisputable.

              4 votes
              1. [2]
                ThatFanficGuy
                Link Parent
                That... certainly paints a bleaker picture than I'd like to see. What do you think of Beaker and P2P browsing? Mastadon? Matrix? MIT Solid? Is there any other technology that you see around that...

                That... certainly paints a bleaker picture than I'd like to see.

                What do you think of Beaker and P2P browsing? Mastadon? Matrix? MIT Solid?

                Is there any other technology that you see around that gives you hope for a more democratic Web?

                1 vote
                1. Moonchild
                  Link Parent
                  Haven't looked very in-depth at any of those, but none looks particularly interesting. Beaker is basically a newer and less cool version of freenet. It also still relies on the monolith that is...

                  Haven't looked very in-depth at any of those, but none looks particularly interesting.

                  Beaker is basically a newer and less cool version of freenet. It also still relies on the monolith that is html5/css3/js, as does solid. It's not really clear to me what solid does, or what it's supposed to do; its website has a list of stated goals, but not what it is or how it achieves them.

                  Matrix and mastadon are--nice enough (personally I prefer irc over matrix, and am not a huge fan of mastadon), but neither is trying to replace the web as a whole; just specific pieces of it. The reason the web is so successful is that it's a platform for every internet-bound app you might want to build.

                  Ultimately, though, I don't have much hope, no. The injustices affecting the web are a microcosm of the rapidly-worsening classism that affects the world as a whole, and there's no end in sight to either.


                  I made some notes a couple of years ago on what a web replacement should look like. A lot of them were more social than technical1 (though there's not really a clear division), but here are some of the more salient technical points:

                  Replace the PKI with--something. Honestly, it's probably the worst thing about the web currently, technologically. Trust on First Use (TOFU), or a Web of Trust (WoT), or even a TXT record. Its current state is a mess, and let's encrypt isn't a solution; it's a band-aid. I don't think any of those are actually the Right Thing--that's something that needs to be discussed further--but they're starting points

                  Explanation: PKI is a centralising force that focuses reliance on a number of root certificate authorities (as well as the browsers that choose which root certauths to trust). Trust in the root certauths is just proxy for their trust in TLD brokers and ICANN2, so a simple solution is to just cut out the middleman (hence the TXT record suggestion). But maybe in an ideal world we can cut dependence on ICANN entirely, maybe use something like opennic for dns, or a peer-to-peer solution (hence the TOFU and WoT suggestions). (I suspect that, if such a project were to go through, it would compromise and rely on ICANN; that's just not important enough to be a sticking point.)

                  Layout/markup language that's easier for non-technical people to write, and has less 'overhead' for technical ones. Perhaps something based on LaTeX?

                  Explanation: one of the reasons that people proliferate around centralised websites (twitter, reddit, facebook) is that setting up their own is difficult. Html is just hard, and especially plain html doesn't look good—so you have to learn css, too, if you want to make something usable. (This also encourages gathering around platforms like squarespace. Not only is it becoming impossible to make your own browser, it's getting harder to make your own website.) Having a simple markup language that can be made to look good with minimal effort helps with that.

                  Require layout/markup language to be well-formed

                  Explanation: HTML implementations have to parse ill-formed html. This is actually in the spec. (They actually have to be able to parse multiple variants of ill-formed html; look up 'quirks mode'.) This is a small thing, but it would help with making the web easier to implement.

                  Any potential replacement to the web will be destroyed by google, facebook, and their ilk if it does not allow for advertising, high-quality analytics, recaptcha, and bitcoin mining. (We may be able to avoid the latter, but it may be inadvisable.) This means that, unfortunately, it needs to be turing-complete.

                  Explanation: This is the biggest one. Essentially, to have a shot at success, a web replacement has to be pushed forward by a major os/browser vendor (that is, google, apple, or microsoft). Which means it has to have a value add to one, and not be an existential threat to any others. So ultimately, it'll be a compromise. But implementing latex + webassembly is doable for a small team, in a way that implementing the whole web isn't.


                  1: That is, how do we encourage quality discussion and avoid people being assholes to each other. There will always be places where there is high quality discussion—e.g. certain smaller or well-moderated subreddits on reddit, smaller fora like tildes or lobsters with a well-defined theory, some IRC channels, etc.—but by and large, on a platform like facebook, the majority of interactions will be negative.

                  2: Back when EV records meant something, there was marginally more reason for certauths to exist, but at this point they're pure overhead.

                  2 votes