10 votes

Programmer's critique of missing structure of operating systems

6 comments

  1. [3]
    onyxleopard
    Link
    Because standards are hard-won and never truly universal, and they don’t last as long as your data does. The fact that I can open a plain-text file, today, written in the 80s on my machine built 5...

    I am talking about direct serialization of "objects", about unified system supported by all languages, even if they don't have objects, but just some kind of a structure.
    Would it really be impossible? Why?

    Because standards are hard-won and never truly universal, and they don’t last as long as your data does.

    The fact that I can open a plain-text file, today, written in the 80s on my machine built 5 years ago, is miraculous. That miracle is due to *nix OSes agreeing that text is a fundamental denominator.

    How can you possibly get everyone unified on a singular object standard? JSON is probably as close as we’ll get, and even that is severely lacking in several respects, including the lack of standardized schema validators (or their use) and lack of support for comments. And guess what? JSON is a serialization format that happens to be expressible as text.

    There are also plenty of non-textual data formats that have evolved, and there is a whole system of magic numbers and file-header conventions that developed to deal with them. There’s also things like MIME and Mailcap. The fact that these remain esoteric, I think, is an indication that what the author is after isn’t feasible. But, the problem with all of this ’structured-data’ is that you need specific programs to deal with it. OSes treat this data as bytes in files that they are happy to shuffle around, but not look inside (unless, of course, you want to see the raw bytes, such as with a hex editor, which again, is still "stringly oriented").

    Treating text as a common denominator was not a mistake, IMO. It is brilliant, and the fact that nothing has replaced it, IMO, is a testament to its robustness. I’ve seen several efforts to try to 'modernize' the tty (and the *nix shell), and while they are all well-meaning, I have yet to see any succeed.

    If you ignore that you can find it and click on it using mouse and then fill some kind of form manually, then the mechanism of command line parameters is one of the most common ways how to tell the program what you want from it. And almost every program uses different syntax for parsing them.

    Theoretically, there are standard and best practices, but in reality, you just don't know the syntax until you'll read the man page. Some programs use --param. Others -param. And then there are programs that use just param. Sometimes, you separate lists by spaces, other times it is using commas, or even colons or semicolons. I've seen programs that required parameters encoded in JSON mixed with "normal" parameters.

    This is all because the de facto API for command line programs is text, for the exact reason that it is the common denominator. If the program author poorly implements their command line API, that is not the fault of the standard. That is the fault of the program author (or the language the program was written in for not providing a good argument parsing library).

    And, how do you actually transfer structured data to the program and from it?

    With files?

    I just don't get this mess. The command line parameters are usually just a list or a dictionary with nested structures. We could really use a unified and simple way to write language of specification of such structures. Something easier to write than JSON, and also more expressive.

    Well, I’ve seen a lot of people gravitate toward YAML for these kind of things. Personally I think JSON or .conf are fine.

    I completely omit that the necessity of command-line parameters is completely suppressed when you can send structured messages to a program just like calling a function in a programming language.

    You can send structured data to command-line programs just fine with files!

    My soul is screaming, terrorized by statements like "You need to store structured data to an env variable? Just use JSON, or a link to the file!" Why, for God's sake, can't we manage a uniform way of passing and storing data, so that we have to mix the syntax of env variables in bash with JSON?

    We do have a uniform way! The uniform way is with files! If you want your program to read/write some persistent, structured data that it needs to operate correctly, have it create the appropriate files and shove the data in there. Tons of programs do this. My ~/.conf directory is chock full of evidence that this is a solved problem.

    And, what about the format? I think you can guess it. It can be literally anything anyone ever thought of.

    There’s nothing wrong with this! This means I can update my OS, or even switch OSes, and bring my configurations of all my programs with me. I can’t imagine how horrible it would be if my OS changed and it broke all my configs.

    But, why it can't be standardized and the same across the system? Ideally, the same data format that is also a programming language used everywhere. Why it can't be just the objects themselves, stored in a proper location in the configuration namespace?

    Because, that would require getting everyone to agree to use the same language/system. And again, getting diverse groups of humans to agree on anything is an intractable problem.

    Something where you can just throw an arbitrary structure in and it will take care of saving it without needless serialisation and deserialisation. I realise that it ends up being just a stream of bytes at the end but it's exactly because of this that I can't see many reasons to make any significant difference between what is in the memory and what is on the disk.

    The reason is because you don’t know what who/what/where/when/why/how the data is going to be read in the future. When you have an object in memory that is being manipulated by a specific program, that program and its language constructs can deal with it. But, if you want to live in a world where someone else’s program, possibly on a different machine, written in a different language, in a different decade, is to be able to read the data, you have to concede that you can’t store some serialized object and hope for it to be interoperable.

    I guess that’s my fundamental takeaway. If we could get the world to standardize on one programming language, maybe we could standardize everything else: command line argument parsing, data file formats, logging systems, and configuration DSLs, etc. But, we can’t. So we compromise and let people make up their own stuff, as long as you can write it down, in a file, as text.

    6 votes
    1. [2]
      skybrian
      (edited )
      Link Parent
      Files (text or binary), by which I mean variable-length arrays of bytes, often interpreted based on standard file formats as they existed when the file was created, are fundamental for preserving...

      Files (text or binary), by which I mean variable-length arrays of bytes, often interpreted based on standard file formats as they existed when the file was created, are fundamental for preserving history. I'm not so sure about filesystems or the way typical OS directories work, though. An OS directory is considerably more complex than a git directory and a set of key-value pairs is simpler yet. It seems like in a lot of cases, a simple map would do? Or perhaps a map of maps?

      I'm doing embedded programming using a microcontroller, and there is no filesystem. Perhaps support for collections of files could be added in a different way?

      3 votes
      1. onyxleopard
        Link Parent
        Yes, hence the various archiving formats that have come about: tar gzip bzip, … I encounter .tar.gz or .tar.bz2 files very regularly because the tar format bundles many files into one, and then...

        Files (text or binary), by which I mean variable-length arrays of bytes, often interpreted based on standard file formats as they existed when the file was created, are fundamental for preserving history.

        Yes, hence the various archiving formats that have come about: tar gzip bzip, …

        I'm not so sure about filesystems or the way typical OS directories work, though.

        I encounter .tar.gz or .tar.bz2 files very regularly because the tar format bundles many files into one, and then one can use single-file compressors like gzip or bzip.

        Perhaps support for collections of files could be added in a different way?

        I’m particularly keen on the de facto JSON lines data format. I arrived on something very much like this myself on my own before finding out that others had discovered this independently. You can treat a JSON lines file like a database. I often do this along with jq for querying.

        3 votes
  2. skybrian
    Link
    In short, the author is criticizing the need to serialize things to text files or binary blobs. Why not keep every key and value separate, all the way down? But it's often useful to hide structure...

    In short, the author is criticizing the need to serialize things to text files or binary blobs. Why not keep every key and value separate, all the way down?

    But it's often useful to hide structure from the parts of the system that don't need to know about it. Consider encryption, compression, checksums, signatures, error-correction, and backups. These all work on uninterpreted bytes.

    5 votes
  3. [2]
    spit-evil-olive-tips
    Link
    There's a famous quote that "object-relational mapping is the Vietnam of computer science". A big part of what this author is getting at: Is in my opinion just as big, if not bigger, of a...

    There's a famous quote that "object-relational mapping is the Vietnam of computer science".

    It represents a quagmire which starts well, gets more complicated as time passes, and before long entraps its users in a commitment that has no clear demarcation point, no clear win conditions, and no clear exit strategy.

    A big part of what this author is getting at:

    I thought about it, and, in my opinion, throwing away the filesystem and replacing it with a database is an unavoidable step.

    Is in my opinion just as big, if not bigger, of a Vietnam-style quagmire. Seems simple at first, but the deeper you get into it, the more you realize it's an intractable problem with no solution that'll make everyone happy, and not even a good way to cut your losses and back out gracefully.

    Microsoft famously tried it with WinFS and couldn't make it work.

    The other big thrust of his article seems to boil down to the How Standards Proliferate XKCD. He bemoans that there's too many standards (XML, JSON, YAML, etc) and there should be only one universal standard that works for everyone. That will simply never happen. No matter how good his hypothetical standard data format was, there will always be use cases it couldn't address. If he chased the dream of making it support every possible use case, it would eventually get so complicated that someone would offer a simplified standard that was easier to use.

    4 votes
    1. skybrian
      Link Parent
      It's certainly very difficult, but I don't think that proves it will never change. There are some hints of this in how the Internet and cloud computing work. In cloud computing, it's quite common...

      It's certainly very difficult, but I don't think that proves it will never change. There are some hints of this in how the Internet and cloud computing work.

      In cloud computing, it's quite common never to expose a filesystem as writable persistent storage. There are a variety of key-value stores and databases competing, and they certainly do store files, but a filesystem abstraction like NFS doesn't seem to be quite the right thing for network storage.

      Also, Google Drive or Dropbox might look like they expose a filesystem, but behind the scenes they're not. Or look at the web: there are certainly files, and the naming conventions for URL's suggest directories, but often it's not implemented that way.

      Mobile phone OS's also hide a lot of the details of how filesystems work.

      The Windows effort failed, but maybe we will see a slow, gradual move away from the details of filesystem semantics at the OS level? Probably anything that can handle a git repo will be good enough for development, for example.

      1 vote