Chapter 9. Git Internals
1.??Git is fundamentally acontent-addressable file system with a VCS user interface written on top of it.
?
$ ls
HEAD
branches/
config
description
hooks/
index
info/
objects/
refs/
The branches directory isn't used bynewer Git versions, and the description?file is only used by the GitWeb program, so don't worry about those. The configfile contains your project-specific configuration options, and the infodirectory keeps a global exclude file for ignored patterns that you don't wantto track in a .gitignore file. The hooks directory contains yourclient- or server-side hook scripts. The objects?directory stores all the content for your database, the refs?directory stores pointers into commit objects in that data (branches), the HEADfile points to the branch you currently have checked out, and the index?file is where Git stores your staging area information.
$ echo 'test content'?? |? git hash-object -w --stdin
d670460b4b4aece5915caf5c68d12f560a9fe3e4
The -w tells hash-object to store the object; otherwise, the command simply tellsyou what the key would be. --stdin?tells the command to read the content from stdin; if you don't specify this, hash-object?expects the path to a file. The output from the commandis a 40-character checksum hash. You can see how Git has stored your data:
$ find .git/objects -type f
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
Git stores the content initially— as a single file perpiece of content, named with the SHA-1 checksum of the content and its header.The subdirectory is named with the first 2 characters of the SHA, and thefilename is the remaining 38 characters.
$ git cat-file -p d670460b4b4aece5915caf5c68d12f560a9fe3e4
test content
$ git cat-file -t 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a
blob
$ git cat-file -p master^{tree}
100644 blob a906cb2a4a904a152e80877d4088654daad0c859????? README
100644 blob 8f94139338f9404f26296befa88755fc2598c289????? Rakefile
040000 tree 99f1a6d12cb4b6f19c8655fca46c3ecf317074eO????? lib
$ git cat-file -p 99f1a6d12cb4b6f19c8655fca46c3ecf317074e0
100644 blob 47c6340d6459e05787f644c2447d2595f5d3a54b????? simplegit.rb
Conceptually, the data that Git is storing is like:
$ echo 'version 1' | git hash-object -w –stdin
83baae61804e65cc73a7201a7252750c76066a30
$ git update-index --add --cacheinfo 100644 \
?83baae61804e65cc73a7201a7252750c76066a30 test.txt
You're specifying a mode of 100644, which means it'sa normal file. Other options are 100755, which means it's anexecutable file; and 120000, which specifies a symboliclink. These three modes are the only ones that are valid for files in Git(although other modes are used for directories and submodules).
write-tree automatically creates atree object from the state of the index if that tree doesn't yet exist:
??? $ git write-tree
d8329fc1cc938780ffdd9f94eOd364eOea74f579
$ git cat-file -p d8329fc1cc938780ffdd9f94eOd364eOea74f579
100644 blob 83baae61804e65cc73a7201a7252750c76066a30????? test.txt
You can also call write-tree with a file path:
$ echo 'new file' > new.txt
$ echo 'version 2' > test.txt
$ git update-index test.txt
$ git update-index --add new.txt
$ git write-tree
0155eb4229851634aOf03eb265b69f5a2d56f341
$ git cat-file -p 0155eb4229851634aOf03eb265b69f5a2d56f341
100644 blob fa49b077972391ad58037050f2a75f74e3671e92????? new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a????? test.txt
?
Your staging area now has the new version of test.txtas well as the new file new. txt.?
You can read an existing tree into your staging area as asubtree by using the --prefix option to read-tree:
$ git read-tree --prefix=bakd8329fc1cc938780ffdd9f94eOd364eOea74f579
$ git write-tree
3c4e9cd789d88d8d89c1073707c3585e41bOe614
$ git cat-file -p 3c4e9cd789d88d8d89cl073707c3585e41bOe614
040000 tree d8329fc1cc938780ffdd9f94eOd364eOea74f579????? bak
100644 blob fa49b077972391ad58037050f2a75f74e3671e92????? new.txt
100644 blob 1f7a7a472abf3dd9643fd615f6da379c4acb3e3a????? test.txt
$ echo 'first commit'? |? git commit-tree d8329f
fdf4fc3344e67ab068f836878b6c4951e3b15f3d
You can look at your new commitobject with cat-file:
$ git cat-file -p fdf4fc3
tree d8329fc1cc938780ffdd9f94eOd364eOea74f579
author Scott Chacon <schacon@gmail.com> 1243040974 ?0700
committer Scott Chacon <schacon@gmail.com> 1243040974 ?0700
?
first commit
?
The format for a commit object is simple: it specifies thetop-level tree for the snapshot of the project at that point; theauthor/committer information pulled from your user.name and user.emailconfiguration settings, with the current timestamp; a blank line, and then thecommit message.
Then you can write the other two commit objects, eachreferencing the commit that came directly before it:
$ echo? 'second commit'?? |? gitcommit-tree 0155eb -p fdf4fc3
cac0cab538b970a37ea1e769cbbde608743bc96d
$ echo? 'third commit'??? |?git commit-tree 3c4e9c -p cac0cab
1a410efbd13591db07496601ebc7a059dd55cfe9
This is essentially what Git does when you run the gitadd and git commit?commands—it stores blobs forthe files that have changed, updates the index, writes out trees, and writescommit objects that reference the top-level trees and the commits that cameimmediately before them.
$ irb
>> content = "what is up, doc?"
=> "what is up, doc?"
>> header = "blob #{content.length}\0"
=> "blob 16\000"
>> store = header + content
=> "blob 16\000what is up,?doc?"
>> require?'digest/sha1'
=> true
>> shal = Digest::SHA1.hexdigest(store)
=> "bd9dbf5aae1a3862dd1526723246b20206e5fc37"
>> require 'zlib'
=> true
>> zlib_content = Zlib:: Deflate.deflate(store)
=> "x\234K\312\3110R04c(\317H,Q\310,V(-\320QH\3110\266\a\000_\034\a\235"
>> path =?'.git/objects/' + sha1[0,2] + '/'?+ sha1[2,38]
=>".git/objects/bd/9dbf5aae1a3862dd1526723246b20206e5fc37"
>> require 'fileutils'
=> true
>> FileUtils.mkdir_p(File.dirname(path))
=> ".git/objects/bd"
>> File.open(path, 'W') { |f| f.write zlib_content }
=> 32
$ echo "1a410efbd13591db07496601ebc7a059dd55cfe9" >.git/refs/heads/master
You aren't encouraged to directly edit the reference files.Git provides a safer command to do this if you want to update a referencecalled update-ref:
$ git update-ref refs/heads/master1a410efbd13591db07496601ebc7a059dd55cfe9
That's basically what a branch in Git is: a simple pointeror reference to the head of a line of work. To create a branch back at thesecond commit, you can do this:
$ git update-ref refs/heads/test cac0ca
Now, your Git database conceptually looks something like
When you run commands like git branch (branchname),Git basically runs that update-ref command to add the SHA-1of the last commit of the branch you're on into whatever new reference you wantto create.
$ cat? .git/HEAD
ref:? refs/heads/master
You can also set the value of HEAD:
$ git symbolic-ref HEAD refs/heads/test
$ cat .git/HEAD
ref: refs/heads/test
You can't set a symbolic reference outside of the refsstyle:
$ git symbolic-ref HEAD test
fatal: Refusing to point HEAD outside of refs/
$ git update-ref refs/tags/v1.0cac0cab538b970a37ea1e769cbbde608743bc96d
That is all a lightweight tag is—a branch that never moves.If you create an annotated tag, Git creates a tag object and then writes areference to point to it rather than directly to the commit:
$ git tag -a v1.1 1a410efbd13591db07496601ebc7a059dd55cfe9 -m? 'test tag'
$ cat .git/refs/tags/v1.1
9585191f37f7bOfb9444f35a9bf50de191beadc2
$ git cat-file -p 9585191f37f7bOfb9444f35a9bf50de191beadc2
object 1a410efbd13591db07496601ebc7a059dd55cfe9
type commit
tag v1.1
tagger Scott Chacon <schacon@gmail.com> Sat May 23 16:48:582009 ?0700
?
test tag
?
It doesn't need to point to a commit; you can tag any Gitobject. In the Git source code, for example, the maintainer has added their GPGpublic key as a blob object and then tagged it. You can view the public key byrunning
$ git cat-file blob junio-gpg-pub
$ cat .git/refs/remotes/origin/master
Ca82a6dff817ec66f44342007202690a93763949
Remote references differ from branches (refs/headsreferences) mainly in that they can't be checked out. Git moves them around asbookmarks to the last known state of where those branches were on thoseservers.
?
?
17.? Git compresses the contents of those files under objectsfolder with zlib. You can then use git cat-file to see how big oneobject is:
$ git cat-file -s 9bc1dc421dcd51b4ac296e3e5b6e2a99cf44391e
12898
$ git gc
$ find .git/objects -type f
.git/objects/71/08f7ecb345ee9d0084193f147cdad4d2998293
.git/objects/d6/70460b4b4aece5915caf5c68d12f560a9fe3e4
.git/objects/info/packs
.git/objects/pack/pack-7al6e4488ae40c7d2bc56ea2bd43e25212a66c45.idx
.git/objects/pack/pack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.pack
The objects that remain are the blobs that aren't pointedto by any commit. Because you never added them to any commits, they'reconsidered dangling and aren't packed up in your new packfile. The packfileis a single file containing the contents of all the objects that were removedfrom your file system. The index is a file that contains offsets into that packfileso you can quickly seek to a specific object.
$ git verify-pack -vpack-7a16e4488ae40c7d2bc56ea2bd43e25212a66c45.idx
It will show the SHA-1 of the objects packed in thepackfile, the object type, the object size, object offset, etc. If two objects are verysimilar, the most recent version one will be stored intact and the originalversion will be stored as delta, it’s because you're most likely to need fasteraccess to the most recent version of the file. Git will occasionally repackyour database automatically, always trying to save more space. You can alsomanually repack at any time by running git gc by hand.
$ git remote add origin git@github.com:schacon/simplegit-progit.git
It adds a section to your .git/config file, specifyingthe name of the remote (origin),the URL of the remote repository, and the refspec for fetching:
[remote "origin"]
????? url =git@github.com:schacon/simplegit-progit.git
????? fetch =+refs/heads/*:refs/remotes/origin/*
The format of the refspec is an optional +,followed by <src>:<dst>, where <src> is thepattern for references on the remote side and <dst> is where those references will be written locally. The +tells Git to update the reference even if it isn't a fast-forward.
In the default case that is automatically written by a gitremote add?command, Git fetches all the referencesunder refs/heads/?on the server and writes themto refs/remotes/origin/locally. If you want Git to pull down only the master?branch each time, and not every other branch on theremote server, you can change the fetch line to
fetch = +refs/heads/master:refs/remotes/origin/master
$ git log origin/master
$ git log remotes/origin/master
$ git log refs/remotes/origin/master
You can also specify multiplerefspecs:
$ git fetch origin master:refs/remotes/origin/mymastertopic:refs/remotes/origin/topic
You can also specify multiple refspecs for fetching in yourconfiguration file:
[remote "origin"]
?????? url =git@github.com:schacon/simplegit-progit.git
?????? fetch =+refs/heads/master:refs/remotes/origin/master
?????? fetch =+refs/heads/experiment:refs/remotes/origin/experiment
You can't use partialglobs in the pattern, so this would be invalid:
fetch = +refs/heads/qa*:refs/remotes/origin/qa*
$ git push origin master:refs/heads/qa/master
If they want Git to do thatautomatically each time they run git push origin,they can add a push value to their config file:
[remote "origin"]
?????? url =git(@github.com:schacon/simplegit-progit.git
?????? fetch =+refs/heads/*:refs/remotes/origin/*
?????? push =refs/heads/master:refs/heads/qa/master
$ git push origin? :topic
Because the refspec is <src>:<dst>, by leavingoff the <src>?part, this basicallysays to make the topic branch on the remote nothing, which deletes it.
$ git clone http://github.com/schacon/simplegit-progit.git
The first thing this command does is pull down the info/refsfile. This file is written by the update-server-info?command, which is why you need to enable that as a post-receive hook inorder for the HTTP transport to work properly:
=> GET info/refs
Ca82a6dff817ec66f44342007202690a93763949???? refs/heads/master
Now you have a list of the remote references and SHAs.Next, you look for what the HEAD?reference is so you know what to check out when you're finished:
=> GET HEAD
ref: refs/heads/master
Now, you know you need to check out the masterbranch, you start by fetching ca82a6 commit object you saw in the info/refsfile:
=> GET Objects/ca/82a6dff817ec66f44342007202690a93763949
(179 bytes of binary data)
That object is in loose format on the server. You canzlib-uncompress it, strip off the header, and look at the commit content:
$ git cat-file -p Ca82a6dff817ec66f44342007202690a93763949
tree Cfda3bf379e4f8dba8717dee55aab78aef7f4daf
parent 085bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
author Scott Chacon <schacon@gmail.com> 1205815931 ?0700
committer Scott Chacon <schacon@gmail.com> 1240030591 ?0700
?
changed the version number
?
Next, you have two more objects to retrieve—cfda3b, which is the tree of content that the commit youjust retrieved points to, and 085bb3, which is the parent commit:
=> GET Objects/08/5bb3bcb608e1e8451d4b2432f8ecbe6306e7e7
(179 bytes of data)
=> GET objects/cf/da3bf379e4f8dba8717dee55aab78aef7f4daf
(404 - Not Found)
It looks like that tree object isn't in loose format on theserver, so you get a 404 response back. There are a couple of reasons forthis—the object could be in an alternate repository, or it could be in a packfilein this repository. Git checks for any listed alternates first:
=> GET objects/info/http-alternates
(empty file)
If this comes back with a list of alternate URLs, Gitchecks for loose files and packfiles there—this is a nice mechanism forprojects that are forks of one another to share objects on disk. To see whatpackfiles are available on this server, you need to get the objects/info/packsfile, which contains a listing of them (also generated by update-server-info):
=> GET objects/info/packs
P pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack
You'll check the index file to see which packfilecontains the object you need:
=> GETObjects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.idx
(4k of binary data)
You can see if your object is in it—because the index liststhe SHAs of the objects contained in the packfile and the offsets to thoseobjects. Your object is there, so go ahead and get the whole packfile:
=> GETObjects/pack/pack-816a9b2334da9953e530f27bcac22082a9f5b835.pack
(13k of binary data)
…
$ git gc --auto
You must have around 7,000 loose objects or more than 50packfiles for Git to fire up a real gc command. You can modify these limitswith the gc.auto and gc.autopacklimit config settings,respectively.
$ cat .git/packed-refs
# pack-refs with: peeled
cac0cab538b970a37ea1e769cbbde608743bc96d refs/heads/experiment
ab1afef80fac8e34258ff41fc1b867c702daa24b refs/heads/master
cac0cab538b970a37ea1e769cbbde608743bc96d refs/tags/v1.0
9585191f37f7b0fb9444f35a9bf50de191beadc2 refs/tags/v1.1
^1a410efbd13591db07496601ebc7a059dd55cfe9
The last line of the file, which begins with a ^?means the tag directly above is an annotated tagand that line is the commit that the annotated tag points to. If you update areference, Git doesn't edit this file but instead writes a new file to refs/heads. To get the appropriate SHA for a given reference, Gitchecks for that reference in the refs?directory and then checks the packed-refs file as a fallback.
$ git count-objects -v
count: 4
size: 16
in-pack:? 21
packs: 1
size-pack: 2016
prune-packable: 0
garbage: 0
$ git verify-pack -v .git/objects/pack/pack-3f8c0...bb.idx? |? sort-k 3 -n? |? tail ?3
e3f094f522629ae358806b17daf78246c27c007b blob?? 1486 734 4667
05408d195263d853f09dca71d55116663690c27c blob?? 12908 3478 1189
7a9eb2fba2b1811321254ac360970fc169ba2330 blob?? 2056716 2056872 5401
To find out what file it is, you'll use the rev-listcommand, pass --objects to rev-list, itlists all the commit SHAs and also the blob SHAs with the file paths associatedwith them. You can use this to find your blob's name:
$ git rev-list --objects --all?|? grep 7a9eb2fb
7a9eb2fba2b1811321254ac360970fc169ba2330 git.tbz2
Now, you need to remove this file from all trees in yourpast. You can easily see what commits modified this file:
$ git log --pretty=oneline -- git.tbz2
da3f30d019005479c99eb4c3406225613985a1db oops - removed largetarball
6df764092f3e7c8f5f94cbe08ee5cf42e92a0289 added git tarball
You must rewrite all the commits downstream from 6df76to fully remove this file from your Git history:
$ git filter-branch --index-filter \
?? 'git rm --cached--ignore-unmatch git.tbz2'? -- 6df7640^..
The --index-filter option is similar tothe --tree-filteroption except that instead of passing a command that modifies files checked outon disk, you're modifying your staging area or index each time. The reason todo it this way is speed—because Git doesn't have to check out each revision todisk before running your filter, the process can be much, much faster. The --ignore-unmatchoption to git rm?tells it not to error out if thepattern you're trying to remove isn't there. Finally, you ask filter-branchto rewrite your history only from the 6df7640?commit up.
Now, your history no longer contains a reference to thatfile. However, your reflog and a new set of refs that Git added when you did thefilter-branchunder .git/refs/original?still do, so youhave to remove them and then repack the database. You need to get rid ofanything that has a pointer to those old commits before you repack:
$ rm -Rf .git/refs/original
$ rm -Rf .git/logs/
$ git gc
Counting objects: 19, done.
Delta compression using 2 threads.
Compressing objects: 100% (14/14), done.
Writing objects: 100% (19/19), done.
Total 19 (delta 3), reused 16 (delta 1)
The big object is still in your loose objects, so it's notgone; but it won't be transferred on a push or subsequent clone, which iswhat's important. If you really wanted to, you could remove the objectcompletely by running git prune --expire.