If you are a software developer, the likelihood that you use Git as your version control system is very high. Additionally, there is a good chance that you may encounter a circumstance where the commit history has been messed up at least once. What do you normally do in such situations? I guess you find some commands on Stack Overflow and, depending on the luck you have that day, you might resolve the problem quickly.
Have you ever attempted to understand how Git functions? If the answer is yes, you may also be aware of the strength and solace it offers. However, if the response is negative, you might never consider it necessary.
First of all, you must have adequate knowledge of Git’s operations. Let’s start by creating a local Git repository. Create an empty directory called “my_repo” and navigate inside it using the terminal. Run the command git status. It will indicate that it is not a Git repository (fair enough). So, how can we make a directory a Git repository? It’s simple! Run the command git init. Then you will notice that a new hidden directory called “.git” has been created, which stores all the magic that Git creates. Now, let’s see what is inside this directory. (I have used the tree /f .git command in Windows to visualize the tree)
Content of the initial .git folder:
Let’s add a file called “file1.txt” to the “my_repo” directory. Add simple text content (If you need to get the same result as shown, add a single line “first line of content” to the file), and run the command git status. You will see the following output:
Git will not track anything automatically. You can verify it by looking at the “.git” directory. The above output explains how to track this file. Now let’s run git add file1.txt. Then, check the “.git” directory’s content.
Two new files and one directory are now available in the “.git” folder. Let’s understand what these are. Git is a version control system, so it is required to store all the files that we track, version them, and provide a way to access these versions. The “objects” directory is used by Git for this purpose. Git has four types of objects: blobs, trees, commits, and annotated tags.
Blobs: All the files (not directories) are stored as blobs. These are generated when you stage changes with the git add command. Hence, the file name “faed594a40f2d0aa7e45c5f26138cc24316565” is a blob that contains the content of “file1.txt”.
But why is it in a directory called “d4”? The file name is also strange. This name is a hash generated by Git using the content of the file. Git provides a command by which you can generate this hash.
Now you see “d4” is the first two letters of the hash, and “faed594a40f2d0aa7e45c5f26138cc24316565” is the remaining part. This directory works like an index. All the data with hash values starting with “d4” will end up in this directory, which increases the performance when searching for them.
These files are in binary format. But Git provides commands to identify the type and content of these objects.
Before understanding what this index file is, let’s add a directory called “sub” and copy “file1.txt” to this directory, renaming it to “file2.txt”. Then use the command git add . to track these changes. Now check the “.git/objects” directory.
Did you notice that nothing new was added? It’s because “file2.txt” has the same content, so there’s no need to duplicate the same hash.
A single blob but two files, I know that you are guessing about what is to be in the index file now. Given that it is also a binary file, we need the help of Git to view it.
That’s what you call indexing or staging in Git. Every time you make a change in a file and stage it, a new blob will be created, and the index file keeps track of these changes.
Even a single letter change in a file that is staged will create a new blob because the hash is changing, and the file will refer to the new blob hash. However, this is not enough for version controlling the files. That’s where commits come into play. Let’s commit the changes.
Let’s see what has been added to the “.git/objects” directory.
Before examining them, let’s think about what parts are missing in order to have proper version control. Remember that blobs do not contain any information about their file names or the directory in which they exist. During the staging phase, we have this data in the index file, and we need a similar file that contains this data. Since every file belongs to a directory, an obvious solution is to introduce an object that represents a directory and has references to its files and subdirectories. That’s what Git does by introducing an object type called a tree.
In our case, we have two directories at the root of the repository, and the subdirectory is called “sub”, which means we have two tree objects.
We can find and examine the two tree objects using the “git cat-file” command.
Now, think about what the missing part is by asking the question “What version are we controlling?” The answer is the repository directory, which in our case is the “my_repo” directory. We already have a tree object for this directory. The only part required is a reference to it, which is the commit object. The commit object also contains details about the author and the timestamp.
This is a special commit because it is the first commit. From the second commit onwards it will have pointers to its parent commits.
Now we know the fundamental components of the Git storage model. However, branching, merging, garbage collection, and the remote repository are still lacking. These components are what make the true version control system complete. Let’s examine these ideas in more
detail In the Part 2 of this blog post.
Author: Vinod Madubhashana