Now that we have repositories, putting things inside them is in order.
Also, repositories are boring, and writing a Git implementation
shouldn’t be just a matter of writing a bunch of
mkdir. Let’s talk
about objects, and let’s implement
git hash-object and
Maybe you don’t know these two commands — they’re not exactly part
of an everyday git toolbox, and they’re actually quite low-level
(“plumbing”, in git parlance). What they do is actually very simple:
hash-object converts an existing file into a git object, and
prints an existing git object to the standard output.
Now, what actually is a Git object? At its core, Git is a “content-addressed filesystem”. That means that unlike regular filesystems, where the name of a file is arbitrary and unrelated to that file’s contents, the names of files as stored by Git are mathematically derived from their contents. This has a very important implication: if a single byte of, say, a text file, changes, its internal name will change, too. To put it simply: you don’t modify a file, you create a new file in a different location. Objects are just that: files in the git repository, whose path is determined by their contents.
Git is not (really) a key-value store
Some documentation, including the excellent Pro Git, call Git a “key-value store”. This is not incorrect, but may be misleading. Regular filesystems are actually closer to a key-value store than Git is. Because it computes keys from data, Git should rather be called a value-value store.
Git uses objects to store quite a lot of things: first and foremost, the actual files it keeps in version control — source code, for example. Commit are objects, too, as well as tags. With a few notable exceptions (which we’ll see later!), almost everything, in Git, is stored as an object.
The path is computed by calculating the SHA-1 hash of its contents. More precisely, Git renders the hash as a lowercase hexadecimal string, and splits it in two parts: the first two characters, and the rest. It uses the first part as a directory name, the rest as the file name (this is because most filesystems hate having too many files in a single directory and would slow down to a crawl. Git’s method creates 256 possible intermediate directories, hence dividing the average number of files per directory by 256)
What is a hash function?
Simply put, a hash function is a kind of unidirectional mathematical
function: it is easy to compute the hash of a value, but there’s no
way to compute which value produced a hash. A very simple example of
a hash function is the
strlen function. It’s really easy to compute
the length of a string, and the length of a given string will never
change (unless the string itself changes, of course!) but it’s
impossible to retrieve the original string, given only its length.
Cryptographic hash functions are just a much more complex version of
the same, with the added property that computing an input meant to
produce a given hash is hard enough to be
practically impossible. (With
strlen, producing an input
strlen(i) == 12, you just have to type twelve random characters. With
algorithms such as SHA-1. it would take much, much longer — long
enough to be practically impossible1.
Before we start implementing the object storage system, we must
understand their exact storage format. A object start by an header
that specify its type:
tree. This header is
followed by an ASCII space (0x20), then the size of the object in bytes as an
ASCII number, then null (0x00) (the null byte), then the contents of the
object. The first 48 bytes of a commit object in Wyag’s repo look like this:
00000000 63 6f 6d 6d 69 74 20 31 30 38 36 00 74 72 65 65 |commit 1086.tree| 00000010 20 32 39 66 66 31 36 63 39 63 31 34 65 32 36 35 | 29ff16c9c14e265| 00000020 32 62 32 32 66 38 62 37 38 62 62 30 38 61 35 61 |2b22f8b78bb08a5a|
In the first line, we see the type header, a space (
0x20), the size in
ASCII (1086) and the null separator
0x00. The last four bytes on the
first line are the beginning of that object’s contents, the word
“tree” — we’ll discuss that further when we’ll talk about commits.
The objects (headers and contents) are stored compressed with