Why store code as text files?
Code is usually version controlled nowadays in git or some other VCS. These typically operate on text files and record the changes applied to the files over their history. One drawback from this is that formatting of the code can introduce changesbto the files that make no semantic difference, e.g. newlines are added/removed, indentation is altered etc.
Consistent formatting makes the code easier to read, but the style used is an aesthetic preference. There might be objective reasons for readability in at least the extreme cases, but in many cases the formatting is purely a preferred style.
If we instead version controlled code in the form of an abstract syntax tree (AST) (possibly even as just a series of transformations on that tree), we could have any formatting we'd like! When editing the code we would just be changing a projection of the AST and when we've made our changes the transformations could be made to the stored AST. If two languages shared the same AST the choice of language even becomes a choice for the programmer. Sadly this has some limitations since ASTs are usually language specific... But we could possibly take this a step further.
Could we take a compiled binary and use that as the basis for generating an AST? This is essentially what decompilers do. For heavily optimized code this is severely limited, but for debug builds a lot of extra information is retained in the binary that can be utilized to construct a sensible representation. This way of storing code the language used becomes a style preference! Code compiled from one language might become alien when viewed in another language (thinking of lazy Haskell code viewed in C), but maybe that is a corner case?
There are issues when considering binaries for different platforms. A binary for the JVM isn't the same as one for ARM64 or one compiled to run on an x86. So there are some limitations there...
One (very) good thing about storing code as text files is the ubiquity of software capable of viewing and editing text. It would however be cool if we could make programming language a stylistic preference that is compatible with other languages! At least the AST part should be perfectly achievable.
I think there are reasonable arguments for not using text as source code, but in creating a language that uses something other than text (really, something other than ASCII or UTF-8), you put yourself on the hook to build:
and a lot more. For thirty years, we have "[written] programs to handle text streams, because that is a universal interface," to the extent that it really is a near-universal interface. I could see a world where everything accepts JSON - indeed, the Language Server Protocol is moving in that direction - but JSON is, well, text, and you can use
grep
,wc
, etc on it. It is not near so alien as a SQLite database to the UNIX environment, and ultimately, the vast majority of programming these days - even on Windows - operates in a pseudo-UNIX environment.It's also worth mentioning that your language design might obviate one or two of these needs, but you will still need to deal with the rest. Darklang, for instance, makes search and organization tools pretty unnecessary through inherent language and tooling design - but it's basically unusable in any serious setting (in my opinion) because its debugging, profiling, analysis, and testing story are pretty poor.
My thinking wasn't to create a new language nor even a new format necessarily. One way would be to base this on existing disassemblers. As long as the binaries aren't optimized then reasonable projections could be made to different languages.
But alas, I'm not sure much would be gained in the end (and a lot would likely be lost or need to be reconstructed, e.g. code comments) and as @jzimbel mentioned the formatting issue is essentially solved by having a standard formatter. A team working on a project likely gains a lot more by using the same language to model a problem than they'd benefit from each member working in their own preferred language.
Problem is that not all languages are compatible like that. Take languages that are compiled, but have a runtime for example. And different implementations of conceptually identical/similar constructs. Go's go vs Crystal's spawn for example. Open one in the other and instead of a keyword and some operators you have dozens or hundreds or thousands of lines of scheduling and communication code that might not return to their tidy forms in their native language if the formatter can't recognize it, for example if changes are made to the scheduler to improve efficiency or just to tidy things up. Or take Rust for example, where there's some behavior that just isn't valid like in other languages.
I'm not sure that an intermediary form could be made that's compatible across languages and across vastly different types of abstractions without all those abstractions noisily and messily immediately unraveling.
I'm cursorily following the Unison programming language, which is stored in content-addressable AST form. It is a neat idea.
A given hash uniquely identifies a given piece of code, and the encoding makes it very simple to, e.g. rename variables, because those names are a non-essential detail that the language itself doesn't need. Different people could even call the same code by different names, and because the language is content-addressed, it is trivial to distribute code in a cloud environment (which is one of the stated goals of Unison).
Instead of source code files, they use a database of code (SQLite, I think) that can be edited by making
.u
scratch files that get incorporated into the local database once saved. Interestingly, I think they did have some problems with effectively sharing code though, because GitHub & co are so text-file centric.The reality is that good dev tooling is hard to make and there oftentimes isn’t that much will to do it. Working directly on ASTs would mean each tool would have to much more specific to a language. You’d have to recreate grep, fzf, git, and so forth, even before you get to text editors, before you’d be able to use the higher expressiveness of ASTs to give benefits.
Plain text is just more universal of a format and much easier to manually inspect as a human. It’s a format as old as computers, and that means it has decades of well trodden and effective tooling.
Now, I think it’s fine to have extensions to existing tools to be a bit smarter. I’m sure everyone has had git try to merge two files in the dumbest possible way. But a whole tooling stack is not practical.
To its credit git is surprisingly good at merging things most of the time. It would be nice if I could hook it into a compiler so that it could weigh its merge strategies by whether they result in compilable code.
Using a binary as a first-class vessel for editable code is a terrible idea. See image-based development, however.
One reason to prefer structural editors over textual ones is that they trivialise the work of making very high quality live and interactive analysers. (Including for things like syntax highlighting, autocomplete, search, search-and-replace, etc., and they enable ad-hoc user-defined analysers as well.) Textual methods are generally rather limited, and require quite a lot of work even to do as well as they do.
Language-specific tooling is always going to be necessary. I see no reason to try to avoid this. You will need to write a compiler for your language anyway; why not also an editor?
Could you give some reasons for this? There are some obvious downsides if the editing software can not generate a sensible representation from the binary, but what are some other reasons?
There are two issues (the second of which I have been wrestling with in my OS design).
The binary format interposes a fairly inconsequential layer of concreteness in front of the protocol the user or ui layer uses to interact with the state of the code. It also makes no allowance for live modification of a running image. This point may be difficult to appreciate if you have not used common lisp or smalltalk systems.
The tight coupling of all code into a single artefact inhibits modularity and in particular librarization. How am I supposed to distribute one portion of my code to you—and how are you supposed to use it—if it's all bound into one monolithic binary?
A separate but related issue is that binaries (in the sense of elf/coff/pe x amd64/arm/..) are not really what you want even for end-user-type distribution. An unfortunate reality is that architectures are not uniform, and you have to deal with that somehow. The elephant is CPU extensions; how do you account for sse2/sse4.2/avx2/avx512; softfp/hardfp/neon/sve; G/C/...; ...? (The more important (but less prominent—stockholm syndrome) issue is that of distinct architectures; you should be able to write an app targeting x86 and I should be able to run it on arm.)
You should see the hoops glibc jumps through to ensure the appropriate strings functions are called for the target architecture's extensions when dynamic linking. And that applies only to a small number of performance-critical routines hand-written in assembly. Some math routine libraries branch at runtime to the correct target, but that carries runtime overhead which is hard to quantify (doesn't show up in microbenchmarks). What about compiler-generated code (e.g. autovectorization)? Do you ship a separate binary for every extension level?
Additionally, cpu feature levels aren't the only form of runtime divergence you might want to specialise on. You also want at least to be able to specialise on runtime profiling information. E.G. java is much faster than c++ at virtual dispatch. (Incidentally, if you do have runtime branches to hand-optimised routines, in JIT context, the compiler will be able to get rid of them completely at runtime, with no linker magic or other special cases required.)
So, ideally the actual distribution format would be somewhat portable, and would be abstract enough to be optimisable at runtime for the target in question. The obvious example is java jars. Which are good at what they do. The main problem is that the jvm does not specify a complete set of OS interfaces, so if you want to speak e.g. gl or termios, you have to go through second-class c API wrappers. Another approach is that used (prospectively?) by the mill: it specialises binaries at runtime for the target CPU (makes a lot of sense, given it's a VLIW), but leaves the details of linking etc. to the OS. I don't think a binary format needs to permit full access to host APIs—I think the java approach of abstracting everything works well enough too—but trying to burn your cake and eat it from both ends too, as WASI tries to do, is the wrong route to take.
NB. someone I know says the solution to this (and other problems) is declarative. I think he's probably right, but I'm not convinced. At the very least, I think a lot of the problems with image-based are down to UI. I plan to take image-based as far as I can take it, and if I can't work out the problems, I'll settle for declarative.
Maybe look at how databases handle things? Sometimes schema updates are annoying but they're definitely focused on being able to change things in place without starting over (or backing up and restoring).
I don't quite see the significance. Schema changes are about changing how data is represented (and what data is represented). Here that's not really the problem; the data format is more or less fixed. The problem is managing changes to and relationships between different data.
By "image-based" do you mean like Smalltalk? I think there are similarities because the program is getting updated while it's running. Like, if you change the definition of a class, there are existing instances of the class that need to be updated.
There are similar issues with other programming tools that you do live updates to code being debugged.
Technically, git doesn't store differences between versions. It stores the old version and the new version. You can plug in a smarter diff algorithm that's language-aware and ignores formatting changes if you like. An advantage of this is that it will work for older history without changing the data.
(There are version control systems that do store diffs, and they have some advantages, but they haven't been as popular.)
One problem with storing AST's is that languages change. When you upgrade to a new version of a language, do you want to have to upgrade the version control system too? A version control system that understands not only the current version of each language but also every previous version would be quite hard to maintain.
An alternative is to just store binary files without understanding the format, but that means you sometimes don't get diffs at all.
IMO, snapshots are more coherent than patches.
You can always layer arbitrarily smart patchers on top. And patches fundamentally don't commute in all cases, so there's little point in trying to pretend they do.
I find the sales pitch for Pijul to be somewhat interesting. I think doing a better job on three-way merges and rebases would be nice, and I can definitely remember times when that would have come in handy.
I'm not sure it's enough for me to learn it, though, since as a hobbyist, my source control needs are very simple.
Looks like they are at 1.0 beta so it's a bit early for anyone else, too.
There are projectional editors that have been developed to essentially work directly with programs as objects rather than do the intermediate manipulation of text. None of them are very popular. Though, one could argue that WYSIWYG document editors are a rather popular form of working with various markup languages. Anyone who works with something like MS Word for a while, though, will realize that there are tradeoffs.
Once you get used to working with text and text-based tools you begin to develop superpowers for which there are no equivalents when working with other kinds of data. And, programming language-aware software is totally feasible while still manipulating text. Look at any modern IDE or editors like VS Code.
The amazing thing about computer science is that just about every idea, no matter how smart or how stupid it may be, has probably already been done.
For instance, storing code in a binary format is something that has been done many times with BASIC designed to run on very tiny systems. Off the top of my head, Sharp's pocket computers, TI's 99/4A, and most computers by Sinclair did this, all of which do so to save memory.
It's also somewhat similar in concept to languages which target a virtual machine. Both CPython and Java take your text and convert it into binary tokens which get interpreted one way or another. Though the practical implementation can be quite different from what you're talking about.
But like everyone else has already mentioned, the biggest reason why we don't do this today is tooling. You'd need a specialized tool to write code in a binary format, and that means that it will not likely be anywhere as full-featured as modern code editors are. It's not without it's positives though. You'd be avoiding both the vim vs. emacs and the tabs vs. spaces debates forever! Or at least until someone makes an emacs extension to handle your format.
There is also Forth. In some ways it was better than the old BASIC systems. For example, using short, named, subroutines instead of line numbers.
Not so good: unnamed arguments, unstructured access to the stack, 1k blocks for disk storage.
Regarding formatting, most projects with multiple contributors these days just use an automatic formatter and enforce its usage with a check in their continuous integration pipelines. Of the languages I have used, I can think of 2 with formatters that ship directly with the language (
gofmt
,mix format
) and 2 others that have third-party formatters that the community has generally settled on using (prettier
,black
).In my opinionated opinion, standardizing on a formatter is by far the simpler approach, and saves tons of time in code review. There have been very few cases where the formatter made my or my reviewees’ code less readable, and I’ll gladly accept that in exchange for its complete removal of person-by-person, line-by-line subjectivity from that aspect of software development.
Tl;dr: Save everyone’s time and let the authors of an official auto-formatter decide what’s best. Standardization is extremely valuable.
I think this would add a lot of complexity for the comparatively small gain of ending the tabs vs spaces war. All of our existing tools already support working with text files.
When I was a fresh young programmer, I would have thought this was a cool idea. Today, I don't actually care about tabs vs spaces, or newlines, or anything else like that. All I care is that a given codebase is consistent. Any good IDE will have tools for enforcing style rules. I have my own preferences, that I enforce in my codebases, but I have no problem adapting to the existing style of whatever project I am contributing to.
One issue is that pre-parsed code -> AST is a destructive process (e.g. in languages with zero-cost abstractions, all the abstraction is stripped away), so the inverse process will be very ill-defined (there's tons of different code that can produce the same AST).
Try
@code_llvm sin.(zeros(100))
(shows you the LLVM AST, which all languages that target LLVM should be compatible with) in Julia if you want an idea of how impossible going back to the nice abstract syntax would be.This generally does not occur during parsing.
That would seem to be an issue of a mis-specified AST.
Wouldn't this be a case where the projection of the AST into regular code is a style preference? So if there are multiple ways of arriving at the same AST then they are all equivalent and a tool could just pick one that makes the code nice (e.g the style preference of the tool author).
I think they are talking about the "lowering" of code constructs, essentially replacing them with equivalent but lower-level constructs during parsing. Some parsers do this immediately but it's not that common. A language with an AST designed for storage wouldn't do it.
Also, the AST would need to preserve comments. Normally, only formatters need to do this, but if persistence is done using a serialized AST, comments need to be represented in the AST.
Languages don't necessarily have a single AST. There could be different ones used for formatting and compiling.
I like the good intentions with this idea, though it's not novel. However, I think there are two issues. One, you already mentioned yourself:
An issue with a not-just-text-files system is that the supporting tooling (IDE plugins, AST compilers) needs to be rolled out to everyone in the team, and also to any relevant parts of the workflow (version control, continuous integration). Or, with a public, open-source project, it would be an extra hurdle (or even barrier) to entry for new contributors.
The other issue I'm thinking about is this: The thing about style arguments and linting is: it's actually not so much about Jane Developer wanting to see code in their personal style. It's actually about Joe Developer wanting to force OTHER people to see the code in their personal style (or some chosen popular standard style). A lot of people that care about style aren't going to be content with seeing their favoured style on their own screen. They care about what actually gets into the version control, and what other people see. While it's not every aspect, some aspects about code style are preferred by some people because they believe it helps [other] readers, or it helps other developers work with the code (e.g. trailing commas).
Was listening to Knowing where to start. A conversation with Casey Muratori. At 52:24, Casey (Handmade Hero) talks about how he thinks the future of programming isn't gonna be with text editors. He says that it's sad we haven't moved passed text files yet.
There's a language like you're saying called Enso. On YouTube. GitHub