17 votes

What's in a git repo?

Posted June 24, 2018 by Elijah

Tags: git, unix, ask.help, version control

Okay, I know the obvious answer is the history of the files. But how can I, from the command line, really understand what is hiding inside that .git directory?

Today I was doing one of my periodic disk space audits, trying to figure out where my usage goes. This comes from having a 64GB drive mounted as /home on my Linux laptop. I found some 15G of old video files to delete today, so I'm no longer as pressed for space. But my interest was piqued by one thing I have downloaded from Github that is ~120 megs for a very simple program. Poking around further I find that most of that usage is a single file:

$ ls -lh withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
-r--r--r-- 1 elijah elijah 102M Mar 14 23:28 withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
$ file withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack: Git pack, version 2, 299 objects
$

Is there a unzip or tar xzf equivalent for Git pack files? Naive usage of git unpack-file is only generating errors for me.

27 comments

[11]
meghan
June 24, 2018
Link
GitLab has an amazing (31 min) video on their YouTube channel that goes over Git Internals and everything you've ever wanted to know about Git. https://www.youtube.com/watch?v=P6jD966jzlk tl;dr:...

GitLab has an amazing (31 min) video on their YouTube channel that goes over Git Internals and everything you've ever wanted to know about Git. https://www.youtube.com/watch?v=P6jD966jzlk
tl;dr: git has a lot of command line options abstracted away but available for everyone (albeit hidden away)

24 votes
1. [10]
  Elijah (OP)
  June 24, 2018
  Link Parent
  31 minute videos strike me as a terrible way to learn command line programs.
  
  31 minute videos strike me as a terrible way to learn command line programs.
  
  5 votes
  1. [8]
    Elijah (OP)
    June 24, 2018
    Link Parent
    That doesn't mean I won't watch it, but it does mean I'm unlikely to right now.
    
    That doesn't mean I won't watch it, but it does mean I'm unlikely to right now.
    
    6 votes
    
    [6]
    meghan
    June 24, 2018
    Link Parent
    Almost none of what is gone over in the video is necessary to learn git. You could probably learn git in about 5 mins at https://try.github.io/ but you asked about the internals of .git
    
    Almost none of what is gone over in the video is necessary to learn git. You could probably learn git in about 5 mins at https://try.github.io/ but you asked about the internals of .git
    
    7 votes
    
    [5]
    Elijah (OP)
    June 24, 2018
    Link Parent
    I use git regularly. I know all the learn in five minutes stuff. I've been using revision control software since Linux was a twinkle in Linus' eye. I can do sophisticated things with Perforce, but...
    
    I use git regularly. I know all the learn in five minutes stuff. I've been using revision control software since Linux was a twinkle in Linus' eye. I can do sophisticated things with Perforce, but git uses a quite different model that I am not that comfortable with. And I have not learned much of the internals of git.
    
    I mean I'm fine with every repo having local history, which is so much different than Perforce, and harks back to pre-distributed revision control systems like RCS in my mind. But it irks me that git checkout swaps files out from under me. Ugh. I would much rather have parallel trees for branches, although I know that's not in vogue any more.
    
    5 votes
    
    [4]
    moredhel
    June 24, 2018
    Link Parent
    In its most its most simple form, git has a similar structure to the blockchain, but rather than being a linked-list, it is a DAG. This is a pretty short and 1000 mile overview of the system, but...
    
    In its most its most simple form, git has a similar structure to the blockchain, but rather than being a linked-list, it is a DAG.
    
    This is a pretty short and 1000 mile overview of the system, but I find it helps in visualising how everything fits together.
    
    Edit: I see that others have posted pretty decent resources on it. I'll leave my comment anyway
    
    3 votes
    
    [3]
    sid
    June 26, 2018
    Link Parent
    A blockchain is also a DAG.
    
    A blockchain is also a DAG.
    
    [2]
    moredhel
    June 26, 2018
    Link Parent
    I'll be honest, I only read the original whitepaper and some accompanying implementations of blockchain in its infancy. And from what I understood of the concept I saw it as a linked-list....
    
    I'll be honest, I only read the original whitepaper and some accompanying implementations of blockchain in its infancy. And from what I understood of the concept I saw it as a linked-list. granted, this is technically a DAG, but it has the limitation that each node may have a max of two edges.
    
    I have heard through the grapevine that Blockchain is changing its technological underpinnings to something more sustainable and scalable (and hopefully more energy-efficient) but I find it hard to read anything on the topic. Mostly because as soon as I start looking for blockchain related discussions or content it very quickly turns into a flamewar of speculation and namecalling.
    
    Despite this, I am actually interested in the technology and if you have any good (tech-focused) resources on the new developments in blockchain technologies.
    
    Disclaimer: I don't really follow this field, and when I refer to blockchain, I should really clarify it to mean Bitcoin v1.0
    
    sid
    June 26, 2018
    Link Parent
    Yeah, blockchain and bitcoin aren't synonymous. I learned a lot by implementing my own (non-bitcoin-related) blockchain.
    
    Yeah, blockchain and bitcoin aren't synonymous. I learned a lot by implementing my own (non-bitcoin-related) blockchain.
    
    1 vote
    
    bhrgunatha
    June 24, 2018
    Link Parent
    Yeah an article or blog post may seem more appropriate but I worked as a teacher and the most important thing I learned is that everyone resopnds differently, there's no single method that works...
    
    Yeah an article or blog post may seem more appropriate but I worked as a teacher and the most important thing I learned is that everyone resopnds differently, there's no single method that works for everyone. Some people will learn better from watching a video than reading. I think the idea of a preferred learning style has been debunked people do respond differently to different approaches.
    
    Having said that by way of apology :) - the video that made git really click for me was Steve Smith's presentation Knowledge is Power: Getting out of trouble by understanding Git
    
    2 votes
  2. bme
    June 24, 2018
    Link Parent
    set replay speed to 2x, 15 minutes isn't a bad investment.
    
    set replay speed to 2x, 15 minutes isn't a bad investment.
[7]
hans
June 24, 2018
Link
There's some more information about the internals of Git on the official Git site: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

There's some more information about the internals of Git on the official Git site: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

5 votes
1. [6]
  Elijah (OP)
  June 24, 2018
  Link Parent
  Teach a man to fish huh? Okay. With Git Internals Packfiles I was able to learn how to get a directory listing of a pack file. $ git verify-pack -v...
  
  Teach a man to fish huh?
  
  Okay. With Git Internals Packfiles I was able to learn how to get a directory listing of a pack file.
  
  $ git verify-pack -v .git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack ... c2286dca4c913602e6e8c4e7a21cb5615f7aa0ac blob 104 110 23237 3 4f569aff3acb6175e94827f69868de6a7a8654c2 4df9ccab88ff6c36422d87f5e119e9b04da1332d tree 78 83 23347 b3b5f502870687f4a8b18b3ca61096b1f5745a87 blob 11674692 11572461 23430 81ecae4bf177ed325c19826d77160dddcfc7d479 blob 3947317 3876239 11595891 1 b3b5f502870687f4a8b18b3ca61096b1f5745a87 a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 blob 3592695 3528486 15472130 2 81ecae4bf177ed325c19826d77160dddcfc7d479 8ad29edd5cc8e09f5719b3a5ee2e299e2f28a88a blob 3781788 3712794 19000616 3 a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 ...
  
  The third column is file size, so you can see it starts to get interesting at the line with "11674692".
  
  Then using Git Internals Git References, I was able to extract a few of those objects, specifically:
  
  $ git cat-file -p b3b5f502870687f4a8b18b3ca61096b1f5745a87 > /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87 $ git cat-file -p a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 > /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4
  
  They are both zip files.
  
  $ unzip -l /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87 Archive: /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87 Length Date Time Name --------- ---------- ----- ---- 1076 2017-02-04 11:07 LICENSE 8420 2018-02-19 07:42 README.md 10831 2018-02-19 07:41 README.ja.md 33076979 2018-02-26 09:22 index --------- ------- 33097306 4 files $ unzip -l /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 Archive: /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 Length Date Time Name --------- ---------- ----- ---- 8420 2018-02-19 07:42 README.md 10831 2018-02-19 07:41 README.ja.md 33079580 2018-02-27 23:57 index 1076 2017-02-04 11:07 LICENSE --------- ------- 33099907 4 files
  
  Inspecting one of those "index" files, I find it is a rather old executible:
  
  $ file /tmp/index /tmp/index: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.18, BuildID[sha1]=5147ab7517c353e744eb5e335ce9927504ee1a9c, not stripped
  
  For reference, RHEL 5 shipped with a 2.6.18 kernel when it came out in March 2007.
  
  Near as I can tell, it's a C++ program. Look at those compiled in strings that have characteristic gcc mangling of C++ function names:
  
  $ strings /tmp/index | head -40000 | tail -1 _ZN2v88internal8compiler21JSOperatorGlobalCache26GreaterThanOrEqualOperatorILNS0_20CompareOperationHintE8EED2Ev $ strings /tmp/index | grep -c ^_ 86178
  
  (The function names are in there because this is "not stripped".)
  
  And this is super curious because the program of the package is a node.js project, and has no C or C++ source, nor does the README in the zip mention this unsuffixed index file or C or C++.
  
  This huge pack file apparently has multiple different versions of a zip'ed artifact, and has somehow found ways to diff them and save the differences. This sort of thing is a big strike against keeping all history with a project forever: people accidentally checking in huge binaries.
  
  8 votes
  1. hans
    June 25, 2018
    Link Parent
    Yes, or rather, considering that I'm not well-versed in the internals of Git: Teach a man to fish to give me a fish. Appreciate the detailed explanations of what you're doing and what you found out.
    
    Teach a man to fish huh?
    
    Yes, or rather, considering that I'm not well-versed in the internals of Git: Teach a man to fish to give me a fish.
    
    Appreciate the detailed explanations of what you're doing and what you found out.
    
    2 votes
  2. Elijah (OP)
    June 25, 2018
    Link Parent
    Next day update. Git Internals Maintenance and Data Recovery shows me how to get a file name for that zip file. But the suggested way to find the commit isn't working. $ git rev-list --objects...
    
    Next day update.
    
    Git Internals Maintenance and Data Recovery shows me how to get a file name for that zip file. But the suggested way to find the commit isn't working.
    
    $ git rev-list --objects --all | grep $id b3b5f502870687f4a8b18b3ca61096b1f5745a87 dest/linux-x86.zip $ git log --oneline --branches -- dest/linux-x86.zip $
    
    1 vote
  3. [3]
    FrozenInferno
    June 24, 2018
    Link Parent
    Git has garbage collection. If a blob isn't referenced by any tree, you can run the GC and it'll clean up that blob and all other de-referenced artifacts.
    
    This sort of thing is a big strike against keeping all history with a project forever: people accidentally checking in huge binaries.
    
    Git has garbage collection. If a blob isn't referenced by any tree, you can run the GC and it'll clean up that blob and all other de-referenced artifacts.
    
    [2]
    Elijah (OP)
    June 25, 2018
    Link Parent
    I'm clearly not a git expert, but I thought garbage collection happened automatically before sending to a remote depot. Since this is something I've git cloned off of Github, it clearly has seen...
    
    I'm clearly not a git expert, but I thought garbage collection happened automatically before sending to a remote depot. Since this is something I've git cloned off of Github, it clearly has seen some "remote" motions.
    
    I think this is a zip file that was accidentally -- or misguidedly -- committed some years ago and then updated several times, then finally deleted. But because the whole history is there in the .git directory, it has every version of those binaries committed.
    
    I suspect the proper thing to do here, is some sort of soft or hard fork to a new version that can allow history to be discarded. The specifics of how this is best done in git, I do not know.
    
    1 vote
    
    unknown user
    June 25, 2018
    Link Parent
    Modifying history in this way is usually done via git filter-branch, or using the BFG. Be aware that, since each commit's hash depends on the hash of its parent(s), this changes the hash of every...
    
    Modifying history in this way is usually done via git filter-branch, or using the BFG. Be aware that, since each commit's hash depends on the hash of its parent(s), this changes the hash of every commit since the one you modify, which can be a bit of a pain to recover from if you have a local copy of the original history.
    
    1 vote
[7]
hook
June 24, 2018
Link
Perhaps not exactly what you are looking for, but a very good article on some metadata is also Making Sense of Git in a Legal Context.

Perhaps not exactly what you are looking for, but a very good article on some metadata is also Making Sense of Git in a Legal Context.

4 votes
1. [5]
  Elijah (OP)
  June 25, 2018
  Link Parent
  Okay, I've read it now. This can be boiled down do "git blame is misleading" and "Linux kernel commit messages can't be trusted to credit authorship as defined by copyright law." There is nothing...
  
  Okay, I've read it now. This can be boiled down do "git blame is misleading" and "Linux kernel commit messages can't be trusted to credit authorship as defined by copyright law." There is nothing specific about git as a technical tool, these faults would exist in any other SCM that I have used.
  
  2 votes
  1. [4]
    hook
    June 25, 2018
    Link Parent
    Pretty much, yeah. But you’d be surprised how often it pops up “Git is essentially a immutable ledger like a blockchain and automatically traces copyright etc., you don’t need to do that by hand...
    
    Pretty much, yeah. But you’d be surprised how often it pops up “Git is essentially a immutable ledger like a blockchain and automatically traces copyright etc., you don’t need to do that by hand any more.”, just because it’s a bit more fancy than older CVSes.
    
    The authors are gathering feedback and intend to roll out an improved version as well.
    
    [2]
    Elijah (OP)
    June 26, 2018
    Link Parent
    I'm actually curious how much rebase can screw up attributions, and thought the paper might offer at least case studies in that. Because I don't think rebase type rewriting of history exists in...
    
    I'm actually curious how much rebase can screw up attributions, and thought the paper might offer at least case studies in that. Because I don't think rebase type rewriting of history exists in most SCM tools, not that one can't achieve the functional equivalent through manual diffs and patches.
    
    hook
    June 26, 2018
    Link Parent
    It does, and royaly so. That's why you can not trust a Git repo unless you have a clone of it since forever (and perhaps backups even)
    
    It does, and royaly so. That's why you can not trust a Git repo unless you have a clone of it since forever (and perhaps backups even)
    
    hook
    June 26, 2018
    Link Parent
    If it is git blame that interests you, you should check out cregit - it actually does what git blame should (still garbage in = garbage out applies).
    
    If it is git blame that interests you, you should check out cregit - it actually does what git blame should (still garbage in = garbage out applies).
2. Elijah (OP)
  June 25, 2018
  Link Parent
  The title alone has my attention, thanks.
  
  The title alone has my attention, thanks.
[2]
tan
June 24, 2018
Link
I found this explanation quite accessible and concise, if a little light on deep detail: https://medium.freecodecamp.org/understanding-git-for-real-by-exploring-the-git-directory-1e079c15b807...

I found this explanation quite accessible and concise, if a little light on deep detail: https://medium.freecodecamp.org/understanding-git-for-real-by-exploring-the-git-directory-1e079c15b807

There's also http://think-like-a-git.net/ which is more in-depth.

1 vote
1. Elijah (OP)
  June 25, 2018
  Link Parent
  The first is not at all in-depth enough. The second goes over the graph theory view of git in detail, a view I understand already, even if I'm not always certain of the command line to get where I...
  
  The first is not at all in-depth enough. The second goes over the graph theory view of git in detail, a view I understand already, even if I'm not always certain of the command line to get where I want to go. That's the stated goal, how to think so git actions make sense. I'm coming from a different angle, which is more archeology than new construction.
  
  If you've seen my post where I figured out the way to find the binary files in the pack file, you can see where I am now in my understanding. Git has this stuff, but git log doesn't show me any commits that look likely to relevant. (History begins in January 2017, but I suspect this stuff is older.)
  
  Internally git seems to use SHA1 hashes for both commits (collection of file states, the nodes in the graph view) and for file references (every change to every file is a represented by a different string). This is a little confusing because the two look exactly alike.
  
  So where I am now, given a file SHA1, how do I find the corresponding commit? The usual access method is the reverse, given a commit, what are the corresponding files?
  
  1 vote