17
votes
What's in a git repo?
Okay, I know the obvious answer is the history of the files. But how can I, from the command line, really understand what is hiding inside that .git directory?
Today I was doing one of my periodic disk space audits, trying to figure out where my usage goes. This comes from having a 64GB drive mounted as /home on my Linux laptop. I found some 15G of old video files to delete today, so I'm no longer as pressed for space. But my interest was piqued by one thing I have downloaded from Github that is ~120 megs for a very simple program. Poking around further I find that most of that usage is a single file:
$ ls -lh withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
-r--r--r-- 1 elijah elijah 102M Mar 14 23:28 withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
$ file withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack: Git pack, version 2, 299 objects
$
Is there a unzip
or tar xzf
equivalent for Git pack files? Naive usage of git unpack-file
is only generating errors for me.
GitLab has an amazing (31 min) video on their YouTube channel that goes over Git Internals and everything you've ever wanted to know about Git. https://www.youtube.com/watch?v=P6jD966jzlk
tl;dr: git has a lot of command line options abstracted away but available for everyone (albeit hidden away)
31 minute videos strike me as a terrible way to learn command line programs.
That doesn't mean I won't watch it, but it does mean I'm unlikely to right now.
Almost none of what is gone over in the video is necessary to learn git. You could probably learn git in about 5 mins at https://try.github.io/ but you asked about the internals of
.git
I use git regularly. I know all the learn in five minutes stuff. I've been using revision control software since Linux was a twinkle in Linus' eye. I can do sophisticated things with Perforce, but git uses a quite different model that I am not that comfortable with. And I have not learned much of the internals of git.
I mean I'm fine with every repo having local history, which is so much different than Perforce, and harks back to pre-distributed revision control systems like RCS in my mind. But it irks me that
git checkout
swaps files out from under me. Ugh. I would much rather have parallel trees for branches, although I know that's not in vogue any more.In its most its most simple form, git has a similar structure to the blockchain, but rather than being a linked-list, it is a DAG.
This is a pretty short and 1000 mile overview of the system, but I find it helps in visualising how everything fits together.
Edit: I see that others have posted pretty decent resources on it. I'll leave my comment anyway
A blockchain is also a DAG.
I'll be honest, I only read the original whitepaper and some accompanying implementations of blockchain in its infancy. And from what I understood of the concept I saw it as a linked-list. granted, this is technically a DAG, but it has the limitation that each node may have a max of two edges.
I have heard through the grapevine that Blockchain is changing its technological underpinnings to something more sustainable and scalable (and hopefully more energy-efficient) but I find it hard to read anything on the topic. Mostly because as soon as I start looking for blockchain related discussions or content it very quickly turns into a flamewar of speculation and namecalling.
Despite this, I am actually interested in the technology and if you have any good (tech-focused) resources on the new developments in blockchain technologies.
Disclaimer: I don't really follow this field, and when I refer to blockchain, I should really clarify it to mean Bitcoin v1.0
Yeah, blockchain and bitcoin aren't synonymous. I learned a lot by implementing my own (non-bitcoin-related) blockchain.
Yeah an article or blog post may seem more appropriate but I worked as a teacher and the most important thing I learned is that everyone resopnds differently, there's no single method that works for everyone. Some people will learn better from watching a video than reading. I think the idea of a preferred learning style has been debunked people do respond differently to different approaches.
Having said that by way of apology :) - the video that made git really click for me was Steve Smith's presentation Knowledge is Power: Getting out of trouble by understanding Git
set replay speed to 2x, 15 minutes isn't a bad investment.
There's some more information about the internals of Git on the official Git site: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain
Teach a man to fish huh?
Okay. With Git Internals Packfiles I was able to learn how to get a directory listing of a pack file.
The third column is file size, so you can see it starts to get interesting at the line with "11674692".
Then using Git Internals Git References, I was able to extract a few of those objects, specifically:
They are both zip files.
Inspecting one of those "index" files, I find it is a rather old executible:
For reference, RHEL 5 shipped with a 2.6.18 kernel when it came out in March 2007.
Near as I can tell, it's a C++ program. Look at those compiled in strings that have characteristic gcc mangling of C++ function names:
(The function names are in there because this is "not stripped".)
And this is super curious because the program of the package is a node.js project, and has no C or C++ source, nor does the README in the zip mention this unsuffixed
index
file or C or C++.This huge pack file apparently has multiple different versions of a zip'ed artifact, and has somehow found ways to diff them and save the differences. This sort of thing is a big strike against keeping all history with a project forever: people accidentally checking in huge binaries.
Yes, or rather, considering that I'm not well-versed in the internals of Git: Teach a man to fish to give me a fish.
Appreciate the detailed explanations of what you're doing and what you found out.
Next day update.
Git Internals Maintenance and Data Recovery shows me how to get a file name for that zip file. But the suggested way to find the commit isn't working.
Git has garbage collection. If a blob isn't referenced by any tree, you can run the GC and it'll clean up that blob and all other de-referenced artifacts.
I'm clearly not a git expert, but I thought garbage collection happened automatically before sending to a remote depot. Since this is something I've
git clone
d off of Github, it clearly has seen some "remote" motions.I think this is a zip file that was accidentally -- or misguidedly -- committed some years ago and then updated several times, then finally deleted. But because the whole history is there in the .git directory, it has every version of those binaries committed.
I suspect the proper thing to do here, is some sort of soft or hard fork to a new version that can allow history to be discarded. The specifics of how this is best done in git, I do not know.
Modifying history in this way is usually done via git filter-branch, or using the BFG. Be aware that, since each commit's hash depends on the hash of its parent(s), this changes the hash of every commit since the one you modify, which can be a bit of a pain to recover from if you have a local copy of the original history.
Perhaps not exactly what you are looking for, but a very good article on some metadata is also Making Sense of Git in a Legal Context.
Okay, I've read it now. This can be boiled down do "git blame is misleading" and "Linux kernel commit messages can't be trusted to credit authorship as defined by copyright law." There is nothing specific about git as a technical tool, these faults would exist in any other SCM that I have used.
Pretty much, yeah. But you’d be surprised how often it pops up “Git is essentially a immutable ledger like a blockchain and automatically traces copyright etc., you don’t need to do that by hand any more.”, just because it’s a bit more fancy than older CVSes.
The authors are gathering feedback and intend to roll out an improved version as well.
I'm actually curious how much
rebase
can screw up attributions, and thought the paper might offer at least case studies in that. Because I don't thinkrebase
type rewriting of history exists in most SCM tools, not that one can't achieve the functional equivalent through manual diffs and patches.It does, and royaly so. That's why you can not trust a Git repo unless you have a clone of it since forever (and perhaps backups even)
If it is
git blame
that interests you, you should check out cregit - it actually does what git blame should (still garbage in = garbage out applies).The title alone has my attention, thanks.
I found this explanation quite accessible and concise, if a little light on deep detail: https://medium.freecodecamp.org/understanding-git-for-real-by-exploring-the-git-directory-1e079c15b807
There's also http://think-like-a-git.net/ which is more in-depth.
The first is not at all in-depth enough. The second goes over the graph theory view of git in detail, a view I understand already, even if I'm not always certain of the command line to get where I want to go. That's the stated goal, how to think so git actions make sense. I'm coming from a different angle, which is more archeology than new construction.
If you've seen my post where I figured out the way to find the binary files in the pack file, you can see where I am now in my understanding. Git has this stuff, but
git log
doesn't show me any commits that look likely to relevant. (History begins in January 2017, but I suspect this stuff is older.)Internally git seems to use SHA1 hashes for both commits (collection of file states, the nodes in the graph view) and for file references (every change to every file is a represented by a different string). This is a little confusing because the two look exactly alike.
So where I am now, given a file SHA1, how do I find the corresponding commit? The usual access method is the reverse, given a commit, what are the corresponding files?