17 votes

What's in a git repo?

Okay, I know the obvious answer is the history of the files. But how can I, from the command line, really understand what is hiding inside that .git directory?

Today I was doing one of my periodic disk space audits, trying to figure out where my usage goes. This comes from having a 64GB drive mounted as /home on my Linux laptop. I found some 15G of old video files to delete today, so I'm no longer as pressed for space. But my interest was piqued by one thing I have downloaded from Github that is ~120 megs for a very simple program. Poking around further I find that most of that usage is a single file:

$ ls -lh withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
-r--r--r-- 1 elijah elijah 102M Mar 14 23:28 withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
$ file withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
withExEditorHost/.git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack: Git pack, version 2, 299 objects
$

Is there a unzip or tar xzf equivalent for Git pack files? Naive usage of git unpack-file is only generating errors for me.

27 comments

  1. [11]
    meghan
    Link
    GitLab has an amazing (31 min) video on their YouTube channel that goes over Git Internals and everything you've ever wanted to know about Git. https://www.youtube.com/watch?v=P6jD966jzlk tl;dr:...

    GitLab has an amazing (31 min) video on their YouTube channel that goes over Git Internals and everything you've ever wanted to know about Git. https://www.youtube.com/watch?v=P6jD966jzlk
    tl;dr: git has a lot of command line options abstracted away but available for everyone (albeit hidden away)

    24 votes
    1. [10]
      Elijah
      Link Parent
      31 minute videos strike me as a terrible way to learn command line programs.

      31 minute videos strike me as a terrible way to learn command line programs.

      5 votes
      1. [8]
        Elijah
        Link Parent
        That doesn't mean I won't watch it, but it does mean I'm unlikely to right now.

        That doesn't mean I won't watch it, but it does mean I'm unlikely to right now.

        6 votes
        1. [6]
          meghan
          Link Parent
          Almost none of what is gone over in the video is necessary to learn git. You could probably learn git in about 5 mins at https://try.github.io/ but you asked about the internals of .git

          Almost none of what is gone over in the video is necessary to learn git. You could probably learn git in about 5 mins at https://try.github.io/ but you asked about the internals of .git

          7 votes
          1. [5]
            Elijah
            Link Parent
            I use git regularly. I know all the learn in five minutes stuff. I've been using revision control software since Linux was a twinkle in Linus' eye. I can do sophisticated things with Perforce, but...

            I use git regularly. I know all the learn in five minutes stuff. I've been using revision control software since Linux was a twinkle in Linus' eye. I can do sophisticated things with Perforce, but git uses a quite different model that I am not that comfortable with. And I have not learned much of the internals of git.

            I mean I'm fine with every repo having local history, which is so much different than Perforce, and harks back to pre-distributed revision control systems like RCS in my mind. But it irks me that git checkout swaps files out from under me. Ugh. I would much rather have parallel trees for branches, although I know that's not in vogue any more.

            5 votes
            1. [4]
              moredhel
              Link Parent
              In its most its most simple form, git has a similar structure to the blockchain, but rather than being a linked-list, it is a DAG. This is a pretty short and 1000 mile overview of the system, but...

              In its most its most simple form, git has a similar structure to the blockchain, but rather than being a linked-list, it is a DAG.

              This is a pretty short and 1000 mile overview of the system, but I find it helps in visualising how everything fits together.

              Edit: I see that others have posted pretty decent resources on it. I'll leave my comment anyway

              3 votes
              1. [3]
                sid
                Link Parent
                A blockchain is also a DAG.

                A blockchain is also a DAG.

                1. [2]
                  moredhel
                  Link Parent
                  I'll be honest, I only read the original whitepaper and some accompanying implementations of blockchain in its infancy. And from what I understood of the concept I saw it as a linked-list....

                  I'll be honest, I only read the original whitepaper and some accompanying implementations of blockchain in its infancy. And from what I understood of the concept I saw it as a linked-list. granted, this is technically a DAG, but it has the limitation that each node may have a max of two edges.

                  I have heard through the grapevine that Blockchain is changing its technological underpinnings to something more sustainable and scalable (and hopefully more energy-efficient) but I find it hard to read anything on the topic. Mostly because as soon as I start looking for blockchain related discussions or content it very quickly turns into a flamewar of speculation and namecalling.

                  Despite this, I am actually interested in the technology and if you have any good (tech-focused) resources on the new developments in blockchain technologies.

                  Disclaimer: I don't really follow this field, and when I refer to blockchain, I should really clarify it to mean Bitcoin v1.0

                  1. sid
                    Link Parent
                    Yeah, blockchain and bitcoin aren't synonymous. I learned a lot by implementing my own (non-bitcoin-related) blockchain.

                    Yeah, blockchain and bitcoin aren't synonymous. I learned a lot by implementing my own (non-bitcoin-related) blockchain.

                    1 vote
        2. bhrgunatha
          Link Parent
          Yeah an article or blog post may seem more appropriate but I worked as a teacher and the most important thing I learned is that everyone resopnds differently, there's no single method that works...

          Yeah an article or blog post may seem more appropriate but I worked as a teacher and the most important thing I learned is that everyone resopnds differently, there's no single method that works for everyone. Some people will learn better from watching a video than reading. I think the idea of a preferred learning style has been debunked people do respond differently to different approaches.

          Having said that by way of apology :) - the video that made git really click for me was Steve Smith's presentation Knowledge is Power: Getting out of trouble by understanding Git

          2 votes
      2. bme
        Link Parent
        set replay speed to 2x, 15 minutes isn't a bad investment.

        set replay speed to 2x, 15 minutes isn't a bad investment.

  2. [7]
    hans
    Link
    There's some more information about the internals of Git on the official Git site: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

    There's some more information about the internals of Git on the official Git site: https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain

    5 votes
    1. [6]
      Elijah
      Link Parent
      Teach a man to fish huh? Okay. With Git Internals Packfiles I was able to learn how to get a directory listing of a pack file. $ git verify-pack -v...

      Teach a man to fish huh?

      Okay. With Git Internals Packfiles I was able to learn how to get a directory listing of a pack file.

      $ git verify-pack -v .git/objects/pack/pack-df07816cd15fb091439112029c28ebc366501652.pack
      ...
      c2286dca4c913602e6e8c4e7a21cb5615f7aa0ac blob   104 110 23237 3 4f569aff3acb6175e94827f69868de6a7a8654c2
      4df9ccab88ff6c36422d87f5e119e9b04da1332d tree   78 83 23347
      b3b5f502870687f4a8b18b3ca61096b1f5745a87 blob   11674692 11572461 23430
      81ecae4bf177ed325c19826d77160dddcfc7d479 blob   3947317 3876239 11595891 1 b3b5f502870687f4a8b18b3ca61096b1f5745a87
      a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 blob   3592695 3528486 15472130 2 81ecae4bf177ed325c19826d77160dddcfc7d479
      8ad29edd5cc8e09f5719b3a5ee2e299e2f28a88a blob   3781788 3712794 19000616 3 a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4
      ...
      

      The third column is file size, so you can see it starts to get interesting at the line with "11674692".

      Then using Git Internals Git References, I was able to extract a few of those objects, specifically:

      $ git cat-file -p b3b5f502870687f4a8b18b3ca61096b1f5745a87 > /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87
      $ git cat-file -p a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4 > /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4
      

      They are both zip files.

      $ unzip -l /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87
      Archive:  /tmp/b3b5f502870687f4a8b18b3ca61096b1f5745a87
        Length      Date    Time    Name
      ---------  ---------- -----   ----
           1076  2017-02-04 11:07   LICENSE
           8420  2018-02-19 07:42   README.md
          10831  2018-02-19 07:41   README.ja.md
       33076979  2018-02-26 09:22   index
      ---------                     -------
       33097306                     4 files
      $ unzip -l /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4
      Archive:  /tmp/a50b2ae4e6a7f53c3fa89f8a03b4c4436f440ca4
        Length      Date    Time    Name
      ---------  ---------- -----   ----
           8420  2018-02-19 07:42   README.md
          10831  2018-02-19 07:41   README.ja.md
       33079580  2018-02-27 23:57   index
           1076  2017-02-04 11:07   LICENSE
      ---------                     -------
       33099907                     4 files
      
      

      Inspecting one of those "index" files, I find it is a rather old executible:

      $ file /tmp/index
      /tmp/index: ELF 32-bit LSB executable, Intel 80386, version 1 (GNU/Linux), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.18, BuildID[sha1]=5147ab7517c353e744eb5e335ce9927504ee1a9c, not stripped
      

      For reference, RHEL 5 shipped with a 2.6.18 kernel when it came out in March 2007.

      Near as I can tell, it's a C++ program. Look at those compiled in strings that have characteristic gcc mangling of C++ function names:

      $ strings /tmp/index | head -40000 | tail -1
      _ZN2v88internal8compiler21JSOperatorGlobalCache26GreaterThanOrEqualOperatorILNS0_20CompareOperationHintE8EED2Ev
      $ strings /tmp/index | grep -c ^_
      86178
      

      (The function names are in there because this is "not stripped".)

      And this is super curious because the program of the package is a node.js project, and has no C or C++ source, nor does the README in the zip mention this unsuffixed index file or C or C++.

      This huge pack file apparently has multiple different versions of a zip'ed artifact, and has somehow found ways to diff them and save the differences. This sort of thing is a big strike against keeping all history with a project forever: people accidentally checking in huge binaries.

      8 votes
      1. hans
        Link Parent
        Yes, or rather, considering that I'm not well-versed in the internals of Git: Teach a man to fish to give me a fish. Appreciate the detailed explanations of what you're doing and what you found out.

        Teach a man to fish huh?

        Yes, or rather, considering that I'm not well-versed in the internals of Git: Teach a man to fish to give me a fish.

        Appreciate the detailed explanations of what you're doing and what you found out.

        2 votes
      2. Elijah
        Link Parent
        Next day update. Git Internals Maintenance and Data Recovery shows me how to get a file name for that zip file. But the suggested way to find the commit isn't working. $ git rev-list --objects...

        Next day update.

        Git Internals Maintenance and Data Recovery shows me how to get a file name for that zip file. But the suggested way to find the commit isn't working.

        $ git rev-list --objects --all | grep $id
        b3b5f502870687f4a8b18b3ca61096b1f5745a87 dest/linux-x86.zip
        $ git log --oneline --branches -- dest/linux-x86.zip
        $
        
        1 vote
      3. [3]
        FrozenInferno
        Link Parent
        Git has garbage collection. If a blob isn't referenced by any tree, you can run the GC and it'll clean up that blob and all other de-referenced artifacts.

        This sort of thing is a big strike against keeping all history with a project forever: people accidentally checking in huge binaries.

        Git has garbage collection. If a blob isn't referenced by any tree, you can run the GC and it'll clean up that blob and all other de-referenced artifacts.

        1. [2]
          Elijah
          Link Parent
          I'm clearly not a git expert, but I thought garbage collection happened automatically before sending to a remote depot. Since this is something I've git cloned off of Github, it clearly has seen...

          I'm clearly not a git expert, but I thought garbage collection happened automatically before sending to a remote depot. Since this is something I've git cloned off of Github, it clearly has seen some "remote" motions.

          I think this is a zip file that was accidentally -- or misguidedly -- committed some years ago and then updated several times, then finally deleted. But because the whole history is there in the .git directory, it has every version of those binaries committed.

          I suspect the proper thing to do here, is some sort of soft or hard fork to a new version that can allow history to be discarded. The specifics of how this is best done in git, I do not know.

          1 vote
          1. unknown user
            Link Parent
            Modifying history in this way is usually done via git filter-branch, or using the BFG. Be aware that, since each commit's hash depends on the hash of its parent(s), this changes the hash of every...

            Modifying history in this way is usually done via git filter-branch, or using the BFG. Be aware that, since each commit's hash depends on the hash of its parent(s), this changes the hash of every commit since the one you modify, which can be a bit of a pain to recover from if you have a local copy of the original history.

            1 vote
  3. [7]
    hook
    Link
    Perhaps not exactly what you are looking for, but a very good article on some metadata is also Making Sense of Git in a Legal Context.

    Perhaps not exactly what you are looking for, but a very good article on some metadata is also Making Sense of Git in a Legal Context.

    4 votes
    1. [5]
      Elijah
      Link Parent
      Okay, I've read it now. This can be boiled down do "git blame is misleading" and "Linux kernel commit messages can't be trusted to credit authorship as defined by copyright law." There is nothing...

      Okay, I've read it now. This can be boiled down do "git blame is misleading" and "Linux kernel commit messages can't be trusted to credit authorship as defined by copyright law." There is nothing specific about git as a technical tool, these faults would exist in any other SCM that I have used.

      2 votes
      1. [4]
        hook
        Link Parent
        Pretty much, yeah. But you’d be surprised how often it pops up “Git is essentially a immutable ledger like a blockchain and automatically traces copyright etc., you don’t need to do that by hand...

        Pretty much, yeah. But you’d be surprised how often it pops up “Git is essentially a immutable ledger like a blockchain and automatically traces copyright etc., you don’t need to do that by hand any more.”, just because it’s a bit more fancy than older CVSes.

        The authors are gathering feedback and intend to roll out an improved version as well.

        1. [2]
          Elijah
          Link Parent
          I'm actually curious how much rebase can screw up attributions, and thought the paper might offer at least case studies in that. Because I don't think rebase type rewriting of history exists in...

          I'm actually curious how much rebase can screw up attributions, and thought the paper might offer at least case studies in that. Because I don't think rebase type rewriting of history exists in most SCM tools, not that one can't achieve the functional equivalent through manual diffs and patches.

          1. hook
            Link Parent
            It does, and royaly so. That's why you can not trust a Git repo unless you have a clone of it since forever (and perhaps backups even)

            It does, and royaly so. That's why you can not trust a Git repo unless you have a clone of it since forever (and perhaps backups even)

        2. hook
          Link Parent
          If it is git blame that interests you, you should check out cregit - it actually does what git blame should (still garbage in = garbage out applies).

          If it is git blame that interests you, you should check out cregit - it actually does what git blame should (still garbage in = garbage out applies).

    2. Elijah
      Link Parent
      The title alone has my attention, thanks.

      The title alone has my attention, thanks.

  4. [2]
    tan
    Link
    I found this explanation quite accessible and concise, if a little light on deep detail: https://medium.freecodecamp.org/understanding-git-for-real-by-exploring-the-git-directory-1e079c15b807...

    I found this explanation quite accessible and concise, if a little light on deep detail: https://medium.freecodecamp.org/understanding-git-for-real-by-exploring-the-git-directory-1e079c15b807

    There's also http://think-like-a-git.net/ which is more in-depth.

    1 vote
    1. Elijah
      Link Parent
      The first is not at all in-depth enough. The second goes over the graph theory view of git in detail, a view I understand already, even if I'm not always certain of the command line to get where I...

      The first is not at all in-depth enough. The second goes over the graph theory view of git in detail, a view I understand already, even if I'm not always certain of the command line to get where I want to go. That's the stated goal, how to think so git actions make sense. I'm coming from a different angle, which is more archeology than new construction.

      If you've seen my post where I figured out the way to find the binary files in the pack file, you can see where I am now in my understanding. Git has this stuff, but git log doesn't show me any commits that look likely to relevant. (History begins in January 2017, but I suspect this stuff is older.)

      Internally git seems to use SHA1 hashes for both commits (collection of file states, the nodes in the graph view) and for file references (every change to every file is a represented by a different string). This is a little confusing because the two look exactly alike.

      So where I am now, given a file SHA1, how do I find the corresponding commit? The usual access method is the reverse, given a commit, what are the corresponding files?

      1 vote