From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.
Whereas if I had used git, and created 2 additional commits, each making a change to a small text file, my total storage size would be barely larger than 100MB. Describing the commits as a diff, as opposed to a snapshot, leads to a better intuitive understanding of why this would be the case.
Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.
And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.
Obviously both "diffs" and "snapshots" are leaky abstractions. If you insist on using the "snapshot" abstraction, you will need to resolve all of the above points of confusion by adding more complexity to your abstraction. And if you prefer to use the "diff" abstraction, you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit. As a teaching tool, you can make either abstraction work. But I find it far more intuitive and useful to think of commits as "diffs + some metadata".
How to represent those snapshots, and fix the storage bloat a naive implementation would cause, is a completely different problem.
One of the things that makes Git smart is that it doesn't try to optimize things prematurely. SVN and co. would store actual diff data, but this made some operations really hard to implement (and, in many cases, slow).
Git has commits conceptually as snapshots. It's up to the storage code to figure out how to deal with this.
> But I find it far more intuitive and useful to think of commits as "diffs + some metadata".
Except that this is not what's happening. I wouldn't even call it an abstraction, it's how things actually work. What you call abstractions are actually operations. If we run a diff we are interested in the changes, but if you ask git to show you the commit it will show you just that.
If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
I find that thinking of commits as snapshots is not so useful. I prefer to think of them as a pair of parent commit and diff.
With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.
It's not clear to me how thinking of commits as snapshots helps me to explain operations such as rebase.
I do concede, however, that "git cat" (I think that's the command) seems more closely related to a snapshot: you identify a commit and a file, and it will give you the content of that file at that commit. Clearly in this case the concept of a snapshot works well. But I need this very rarely.
> With that in mind, things like rebase become obvious: Take the same diff and attempt to apply it to a different parent.
You can think of it that way if you want. But it's not what Git actually does.
Personally I much prefer to have my mental model match the actual reality of things.
You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.
What it does in actuality is find the snapshot of that SHA and change the working tree to match it.
If git did rebuild the graph, right from the very first commit, the end result of the operation would look identical to the user as it does now.
It seems to me the two mental models are interchangeable when it comes to the use of git from the users point of view. What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?
Now I think about it, it's probably that users have a bad understanding of the commit-as-diff models; they could similarly have a bad understanding of the commit-as-snapshot model I expect, I don't know that thinking in snapshots helps to understand git from an users point of view better than thinking (properly) in diffs.
The article for example explains that any two commits can be differenced because the underlying snapshot trees can be compared, but the commit-as-diff model can as easily explain why comparing two commits works by tracing each commit back to the common base commit - so the commit-as-diff mental model just needs to remember that commits are fundamentally tied to the path they have back to the root commit.
It seems to me if you take the diagrams from the article and remove the under-the-covers stuff leaving just the circles, the commits-as-diffs and commits-as-snapshots models look exactly the same.
Merge commits are a bit hard to understand from the perspective of "a commit is basically just a parent commit plus diff".
On the flip side, cherry-picking is hard to understand from the perspective of "a commit is basically just a snapshot, nothing more" (it's _also_ weird from the parent-commit-plus-diff perspective -- cherry-pick is kind of a weird operation, but useful enough that we keep it anyway despite it not fitting quite as cleanly into the git model as other operations).
Outside those edge cases, though, people with "snapshot" and "parent + diff" mental models will make basically identical predictions about what the results of various operations with git will be.
> What is missing, from the users point of view, when they model commits as diffs+parents vs as snapshots?
With the wrong mental model it's harder to predict what operations are expensive. If "git checkout <SHA>" truly did have to replay all diffs from the beginning of time, it would be a very expensive operation that is best avoided unless you absolutely need it. But in practice it is a very fast operation (one of the fastest) that there is no need to shy away from.
A fair point possibly, but given checkout/switching branches is probably just about the most common action when working with git repos, I'd hope people would notice that it's fast pretty quickly.
> You may not use "git cat" very often, but what about "git checkout <SHA>"? If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff up to the SHA you asked for.
Yes, this is true. I don't know why it never bothers me. Maybe it's because you could also store the diffs in the opposite direction (i.e. store the tip of each branch in the clear, then store diffs from each commit to its parent). Computing the inverse of a diff should be a quick operation. Usually, when you check out something, it's the tip of a branch or near the tip of a branch.
Anyway.
Of course I know that storing trees makes it easy to compute diffs. Computing diffs will becomes slower with larger trees. On the other hand, storing diffs makes it slow to compute trees, and the more commits we've got, the slower the tree computation goes.
> Computing diffs will becomes slower with larger trees
Not usually. Computing a diff is roughly O(n) with the size of a diff. This is because unchanged leaves of the tree can be seen as identical (because the are content addressed) and are skipped. So to compute the diff you only need to recurse into changed directories.
So having a million files in the root directory and one has changed is very fast to diff as you just diff that one file. The worse case is the diff happening in a very deeply nested directory with lots of files in each of the subdirectories but even that is quite cheap as diffing a sorted directory listing is O(n) with the size of the listing.
(The actual worst case is diffing large files as most text diff algorithms are worse than O(n))
> If commits were stored as diffs, then Git would have to rebuild a tree of the very first commit, then replay every single diff
Well, it would usually be more efficient to figure out where the current checked out branch differ from the branch that is checked out, and then unapply and apply diffs as needed.
Rebase doesn't work that way, though [0]. It first extracts the 3 versions (2 leafs and their common ancestor) and then does a diff & patch.
This allows git to store the deltas between versions in the most efficient way on disk, while also letting it use contextual diffs to minimize the chance of spurious merge conflicts. Patching algorithms have various heuristics that make sense for programming languages, like special treatment for lines with only changes in whitespace.
(Edited to add:) also, minimal diff algorithms have to do a lot of work to detect large blocks of text being moved around. This is part of what made Subversion, which used the same diff algorithm for storage compression and merging, painfully slow.
Here is the paragraph that describes what rebase does:
> This operation works by going to the common ancestor of the two branches (the one you’re on and the one you’re rebasing onto), getting the diff introduced by each commit of the branch you’re on, saving those diffs to temporary files, resetting the current branch to the same commit as the branch you are rebasing onto, and finally applying each change in turn.
Is "applying the diff to a different parent" not a good way to describe this?
You're using the word 'diff' for 2 different things:
- an efficient way to store 2 very similar files
- the minimal set of changes made by a programmer to a file.
Subversion uses the same diff algorithm for these 2 functions, which is why people conflate them. But git uses different algorithms. The first one (which it calls deltas) are optimized for speed and compression ratio. The second set of algorithms (you can choose from a few, some of which are better at identifying rearrangements of large blocks of text) are optimized for merging 2 programmer's changes without conflicts.
The way you try to apply a diff to a different parent is by doing a three-way merge... the vast majority of tools do this by taking three files as arguments and producing a fourth as output. The three-way merge is the underlying process which makes merge, rebase, cherry-pick, and revert work. They are all just "three-way merge, shuffle the arguments around, and adjust metadata".
The parent + diff storage is not isomorphic to snapshot storage. Snapshot storage reflects the actual usage of VCS tools... people make changes, and record the final state. Parent + diff does not do this, it records the changes, which requires creating a diff, and there are multiple ways to create a diff between two snapshots.
Git postpones the "which diff is correct" question until you actually care about the answer.
> If you think a commit is a diff, you have a mismatch between the mental model and what's actually happening behind the scenes. This will make it difficult to understand concepts later on.
I don't think those concepts are distinct as you're painting them. At a user visible level commits will almost always be visualized as diffs, which puts us at a place where - at the highest level and lowest level they're defined as pretty close to diffs, while at an intermediary level they're defined closer to snapshots.
I honestly think they're neither, each expression method (diff vs. snapshot) can be translated pretty easily and both are trying to represent the same end goal. It can be helpful to know that commits are representative of the full state of the codebase that exists at a time, but that view can be at odds with merging and rebasing which use actual change sets to calculate - when a commit is being manipulated it's helpful to view it as a diff (and git does this) - while as, when a commit is being read, we're using it as a snapshot.
One way I like to think about this is that when you rebase a branch, the diffs are the same (barring any conflicts) but the commits are different. Just another reason commits aren't the same as diffs.
The diffs are often different, even without conflicts. Try comparing them some time, and look closely at the diff... look at the lines starting with @. People usually ignore those lines but "patch" does NOT.
This is not an irrelevant detail, but it's the result of a three-way merge. The three-way merge can update those @ lines if it has a complete set of inputs (all three inputs). If you to make a patch from one branch and then apply it to a different branch without using the three-way merge algorithm (stripping the diff of all its context), the patch may fail to apply even if the three-way merge succeeded without conflicts.
I think this is more a sign that git (porcelain) is not aligned with the underlying model.
It is actually a pity that so little effort went into git UI. I find the OP explanation of git model awesome and the presented concepts beautiful, but the cli utility has countless naming and consistency problems which make me sad that hg didn't win over git. Life would be much simpler for many developers if it did, imho.
> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.
That's not what one would expect. Suppose I have a directory that is 100MB in size. If I take a snapshot of it ("btrfs subvolume snapshot"), my snapshot would be 100MB in size, but the storage needed for the original and the snapshot together would still be 100MB (plus a few kilobytes of overhead). If I take a second snapshot of it tomorrow ("btrfs subvolume snapshot" again), my second snapshot would also be 100MB in size, and my total storage needs would still be 100MB (plus a few kilobytes of overhead).
If I made a change to a small text file before each snapshot, my total storage size would still be barely larger than 100MB.
That is, when creating a snapshot, one would expect it to be copy-on-write. While not exactly what git does (it's a content-addressable storage instead of a copy-on-write storage), the end effect is similar enough for most purposes (the main difference being that undoing a change in git would not need extra storage, while a copy-on-write storage would store a new copy of the contents).
Clearly people are using two diametrically opposed definitions of snapshot.
If a snapshot is defined is opposed to a diff, then it's clear snapshot means "full copy". If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.
You are describing snapshots that exist as part of a diff system or copy-on-write system, where they use virtually no storage at all, because further changes are assumed to be applied as diffs rather than overwriting previous data. Where the snapshot is a "marked" diff that can specifically be rewinded to, as opposed to a general ongoing stream of diffs.
But that's a more advanced and system-specific definition of snapshot.
As a general mental model, when you say "think of it as a snapshot not a diff", I think it's clear that the former definition is being used, and that the expectation is a fully copy that takes up disk space. Because otherwise, in the second case, all the snapshots are just the most recent diff (on top of the entire prior history), so the sentence "think of it as a snapshot not a diff" doesn't really mean anything. The snapshot and the diff are the same.
> If I snapshot the state of my cloud server, it creates a full copy of its disk in block storage somewhere, and takes several minutes to complete.
Which cloud provider are you using? Neither Amazon nor Google take snapshots this way. Amazon EBS and Google Persistent Disk both use copy-on-write semantics for snapshots. If you take a hundred snapshots of a 100 GB disk, your total usage is 100 GB plus metadata. When you run a VM instance from that disk, the storage usage will increase as blocks change, to a maximum of 200 GB total storage (for live disk + out of date snapshot).
When I use QEMU or VirtualBox at home, I also get copy-on-write snapshots of disks, although it's certainly possible to get a full copy if you want. I think the feature is pretty standard.
That’s an incorrect notion of “definition”. The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning. You may fight over the details of the definition or the implications, but at most it means that you need to revise the definition a little bit, not that you need to add a new sense to the word.
Nope, pretty sure different concepts means different definitions. Well -- or different "senses" if you want to be technical, but of course nearly everyone outside of dictionary editors uses "definition" to mean "sense".
> The concept of a snapshot is that you make a copy of something at a moment in time. That’s one concept, one definition, one meaning.
Except one of the two definitions isn't making a copy of anything. It's creating a new pointer to something that already exists, that's all. Zero copying. That's the entire point here.
Which is why it's two concepts, two definitions, two meanings.
Copy-on-write is an implementation detail that allows for lower storage. The snapshot is still the full copy. One could try to argue that the same is true for git in that diffs (or content addressable storage) are just an internal implementation detail, but as the parent pointed out that's not quite true--our commits document the diff, not the materialized snapshot.
> our commits document the diff, not the materialized snapshot
That's not actually true, though. This is what a raw commit object looks like:
$ git cat-file commit bfc766d38e1fae5767d43845c15c79ac8fa6d6af
tree 99768f8965d5382d1c1695c371a854d061f2548b
parent 860a3b34854d8abe9af9f1eb584691de926ce897
author Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
committer Peter Maydell <peter.maydell@linaro.org> 1462981466 +0100
Update version for v2.6.0 release
Signed-off-by: Peter Maydell <peter.maydell@linaro.org>
(This commit is from the QEMU project repository.) Note that there is no reference to a diff. There is a reference to a tree, which is a binary object representing a directory structure—a full snapshot of the state of the working tree as of that commit:
For storage purposes there is deduplication, delta encoding, and compression going on behind the scenes, so committing a 100M working tree with a few small changes doesn't take up 100M of additional storage space, but these are invisible to the upper layer. When Git needs a diff, for example to perform a three-way merge, or in response to `git show` command, it generates one on the fly from the snapshots.
That's not really true. The copy-on-write filesystems just allow multiple files to reference the same blocks, and only allow modifications to blocks if the refcount is 1. At least, at its simplest, that's how copy-on-write works. To copy a file, you copy the block references and increment the reference counts. You won't end up with a diff or deltas stored anywhere.
Sure it is. You just need to look at it a little differently.
Even in classical COW of memory pages in a Unix forked process, the set of pages mapped into the process with refcount 1 are a diff to all those with refcount > 1.
Virtual machine snapshots are more explicitly diff-oriented. Deltas to the base disk or snapshot are stored separately (that's your diff), and "deleting a snapshot" actually means remapping all the separately stored blocks and collecting the newly released blocks. There's two strategies snapshotting can follow: copy-then-write-in-place, or redirect-on-write. Either way, the set of copied or redirected blocks are a diff to the in-place blocks, just the polarity of the difference is switched.
Things get more interesting with e.g. ZFS snapshots, where the whole filesystem, including metadata, is copy-on-write, and tree-structured to maximize sharing and permit atomic writes (how ZFS solves RAID5 write hole). There, snapshots hold on to one of the old roots in the tree. The diff is implicit in the difference in tree structure; shared blocks are common, different blocks are different. It's super-easy to do a recursive comparison between such trees, extracting a diff is a sublinear time operation because it can trivially skip over identical subtrees. It's a matter of perspective, when you're in the middle of a recursive tree compare, whether you think you're actually diffing data, or whether the data in one leg of compare is simply telling you which subtrees are shared and which subtrees are different, and thus the data is a delta, or diff. You certainly don't need a complete traversal, which tells me that the data is doing most of the work.
> The diff is implicit in the difference in tree structure;
I can only interpret this as, "the data is not described as diffs." There's a meaningful difference here and I'm not being picky about it. To some extent, you can convert between a diff structure and shared structure, but that doesn't mean that the differences aren't meaningful.
Two structures may be isomorphic but they represent data in different ways and the operations have different algorithmic complexity.
I've learnt something new today, thanks for sharing. Looks like I had a naive understanding of how snapshotting actually works.
I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.
But given that you can also "check out" a commit, in order to get a specific snapshot of the repo, I can see the parallels between commits and snapshots. Maybe both analogies are equally useful in describing the different features that git provides.
Internal implementations and external interfaces are not necessarily the same thing. When reading a single-threaded application's code, it is helpful to read it as a series of instructions, executing serially. In reality, both the compiler and your CPU are constantly reordering instructions, and executing them in parallel/out-of-order. However, this is all done while still preserving the illusion of serial-execution. Taking a beginner programmer down this rabbit-hole of implementation details, is going to be more harmful than helpful.
Thank you for the suggestion, but I already find git easy to use. And thinking of commits as diffs that can be cherry-picked, rebased and reordered, is something that helps me greatly in understanding it.
The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots. I.e. the fundamental data-structure is the snapshot and from those we can build diffs. Those diffs are necessary to implement cherry-picking and rebasing but it's also possible to imagine an implementation git that has those features missing. It would still fundamentally work in the same way - it would just be slightly less useful.
Edit: If you think this is just splitting hairs, I encourage you to look at the differences between git and pijul which is a VCS where the fundamental building block is diffs: https://pijul.com/
> The correct way to think about snapshots and diffs when it comes to cherry-picking and rebasing is to realise that diffs are always derived from snapshots.
Ironically, git snapshots are themselves derived from diffs. Creating a snapshot without diffs, would require making a full copy, which git most definitely does not do.
So would you rather think of cherry-picking as diffs derived from snapshots which are derived from diffs? Or as simply diffs? I find the latter better as a mental model.
It's helpful to understand git in terms of the "porcelain" and the "plumbing".
The git commands you know and love are largely the porcelain, nice fixtures over other things. When you "git cherry-pick", under the hood what it's actually doing is querying that commit's parent(s), finding the diff the commit introduced relative to its parent(s), and then applies those same changes to the index and your working tree.
Cherry-pick is porcelain on top of the plumbing.
There are a few "write git yourself" tutorials out there, of which "Write yourself a Git!" is I think the most popular. In it, you'll learn how git really stores data, and you'll write a (fairly basic) git client that can do several things to locally manage a repository.
>I still think that it's more intuitive to describe commits as diffs, in the context of things like cherry-picking a commit or rebasing/reordering a series of commits.
If I understood the article correctly, those things actually are implemented via diffs. It's just that the diffs are calculated on-the-fly, used to create a new snapshot, and then discarded.
You can still think of them as snapshots. Git just does compression on the entire folder of snapshots, including de-duplication of data that doesn't change between snapshots.
In fact, when I teach git to students, I don't even bother with the trees/blobs, which in my view are just an implementation detail. I just tell them to think of git zipping up their working directory together with some metadata (commit message, reference to parents), and putting that zip file into its own "compressed" storage inside the .git directory. That seems to be sufficient for a good mental model of how to work with git (independently of the git's somewhat baroque command line interface, which just takes getting used to)
This is the thing though. You're talking about snapshots which actually have duplication removed... in my mind this really fits more with the 'diff' model. I've already done the exploratory diving-into-git-internals thing years ago, so I could develop a better understanding of how things actually work.
But for newcomers who want to understand how git is working, it really makes more sense to tell them it's 'like a diff. Not exactly under the hood, but think of it like a diff for now'. This is what I've been telling people as I've mentored a number of people in getting acquainted with git over the years, and if they're curious enough to look under the hood, they'll get a better understanding of the internals.
As a programmer, what you're working with is essentially the diff. This is the easiest way to think about things initially. The fact that git is storing blobs under the hood, shallowly deduplicating blobs but still storing large chunks separately that may contain duplicate data, until it generates packfiles which do a deeper deduplication/compression, is really not that helpful. Telling people it's more like zipping is a bit disingenuous because it doesn't really explain how things are compressed more efficiently over the course of many changes.
If I have a 1MB code file and make 1000 commits of one-line changes then sure, git is initially storing large blobs representing those, but then will compress over the change set when it generates the packfile.
Compared to making a zip of the file for every change (say these are 100KB compressed) and now you have people thinking the 1000 one-line changes generate 100MB in the .git directory.
You may think that a 1MB file with many smaller changes is a fabricated example, but consider that dependency lockfiles (package-lock.json I'm looking at you) can easily grow to this size, and contain this many changes.
It may depend on the background of who you're talking to. Programmers may be very comfortable with diffs, but non-programmers (in my case, physics graduate students) usually aren't. On the other hand, everybody is familiar with snapshots: even high school student will end up with "report_v1.docx", "report_v2.docx", etc, which are snapshots at the file level (and work reasonably well as long as you have a consistent scheme and don't need branches). I've also routinely seen less-technical people organize their research / paper writing by making a weekly snapshot of their work folder ("project-2020-04-1"). Telling these people that git basically does the same thing for them automatically with a tree-like "labeling scheme" that allows for branches tends to go over quite well, in my experience. For actually programmers, I'd be inclined to give them a more technical introduction to git's internals. I'd still point out that git stores compressed snapshots, not diffs (especially if they're older and may have previous SVN experience)
Those non-programmers are likely going to have a worse understanding of what is happening when you zip/compress something anyway, but I concede this is probably the most straightforward path if they have some understanding of what a zip is, and can't understand what a diff is. But even then I question if they should be using git, since `git diff`, `git show`, basically everything git exposes, is going to show them diffs.
A storage with pure diff would be impossible to recover if you get a error in any commit.
It would also be much slower to examine the data, and newer version control do not use pure diff.
The version control system Mercurial had description about these problems on the homepage, "behind the scense", which was good reading.
I am not sure if GIT is the best solution, but at least a "pure snapshot" is okey, but where a diff storage must in practise include some snapshot logic as well.
The "snapshots which are stored as deltas, if that works" part is unrelated to the diffs the git porcelain generates for you when you do a git-diff or git-show. The former is purely an implementation detail of the storage (albeit an important one), while the latter is entirely virtual, calculated from the snapshots every single time you view the data. That's why operations like git-diff and git-blame can take some time on large trees or histories (and why e.g. git-blame has various options to tweak how it tracks files across revisions, because that is not something git does), while git-log is fast.
Also (for less-technical audiences), I don't exactly dwell on the de-duplication. It's just "Git makes snapshots and puts them into .git in some efficient way. Don't worry about it. Or, if you want the details, read the Git SCM book."
I actually haven't had a problem with this, though perhaps it's because I understand what's happening at a deeper level. You're generally referencing commits which exist somewhere in this family of commits you can view with `git log --graph`. You can easily think of checkout as the path of diffs to get there. Files at commits are still whole objects, mentally, but the thing we care about as programmers working with multiple versions are the diffs.
I have had it break down a bit more when working with stash though, because now the object you're referencing can exist outside of that graph-like commit family.
No. If it chooses to compress the commits, which it if I remember correctly not does automatically for each commit, but rather occasionally as a larger step, it uses the difference to whatever it deems to be a good candidate, if it finds one. E.g. if you have a file in commit A, change it massively in later commit B, and then on a different branch create commit C that also changes the file to one very similar to the one in B, git might very well compress C by storing the difference from B to C, despite those having no direct relationship in the commit graph. It can also choose to not use a delta to a different version entirely, and this is 100% an internal implementation detail of the storage system in git (afaik one of those implementation details is that it prefers candidates that are in the same commit chain, but it doesn't have to - and it can easily jump multiple commits if that works better). If you ask git to show you a diff to the previous commit, it does not pull a diff from storage, but pulls two file versions from its storage backend (which if deltas have been used to store will resolve those) and diffs them.
No, it stores an entirely new set of references to objects, as well as some of those objects themselves (any that are not identical to previously stored objects).
You cannot look at a commit on its own and know exactly how it's different from the previous commit, but you do have the complete new state. You have to look at the parent commit's references and do an object-by-object comparison to identify exact changes. On the other hand, when you look at a diff, you can see exactly what has changed, but you cannot produce the version that came before without also having a complete copy of the current version.
To some extent, this is true. I don't feel the need to totally understand gits packing logic or the specific mechanics of the various diff/merge algorithms.
But some knowledge of how/why your tools work the way they do can be very helpful.
Some knowledge of a tools internal working can be fundamental to efficient use of that tool. At the very least it can allow you to understand or derive your useful interactions with that tool rather than simply memorize how it is used.
> Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size.
Not with `btrfs subvolume snapshot`, it won't. If that's not a snapshot, I don't know what is.
From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it. This isn't something that people who like to write blog posts about Git made up for pedagogical purposes, it's how Git actually works.
As you point out, it's wonky for pedagogical purposes; what does it mean to "cherry-pick" a snapshot? When thinking about cherry-picking, yeah, a diff makes more sense than a snapshot. But saying a diff is better pedagogically doesn't change the fact that a commit actually is a snapshot (and when cherry-picking, it diffs to snapshots to create a patch, then applies that patch).
> From a storage perspective, no dammit, Git commits are snapshots, look at the bits on disk if you don't believe it
Except they're not. They're (often) packfiles, which are a delta encoding i.e. a diff. It's not necessarily the same as a specific commit, but appealing to "the bits on disk" is wrong.
It is certainly true that the git object model each commit object refers to a tree that represents the complete state of the repository at that commit.
It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.
It is simultaneously true that the on-disk storage system is optimised for performance and so doesn't map onto the object model in a trivial way.
> They're (often) packfiles, which are a delta encoding i.e. a diff. …appealing to "the bits on disk" is wrong.
That's fair. The diffs in a packfile have no relation to the "diff" that a commit would be if the commit were a diff; so it's wrong to use "but packfiles" when arguing that commits are diffs and not snapshots; but you're right, packfiles make my "bits on disk" argument not quite right.
The way I look at it is that packfiles are a compression mechanism; and they don't alter the fact that fundamentally it's snapshots that are being compressed. But that's not the only way of looking at it.
> It is also true that many git commands implictly treat a commit as being the diff between the state of the tree in that commit and the state in the parent. For example git show, git rebase and git cherry-pick.
A commit is a snapshot, and you can compute the diff between a commit and any of its parents. If a commit has multiple parents, git cherry-pick bails out unless you pick a parent (usually -m 1), and git rebase, I think implicitly assumes the first parent.
> If a commit has multiple parents, … git rebase, I think implicitly assumes the first parent.
`git rebase`'s behavior regarding merge commits is shockingly complicated, but much of the time: Because by default it linearizes the history, it actually just skips merge commits because it assumes that the merge has already happened implicitly by applying one of the merge's parents on top of the other parent.
> Obviously both "diffs" and "snapshots" are leaky abstractions.....
Joel Spolsky wrote many great things, but "all abstractions leak" was not one of them (edit his but not good). I am very tired of programmers excusing their poor imagination with appeals to this nonsense.
------
Commits store snapshots. Full stop.
The "bad mental model" is not commits being snapshots, but things behind stored individually, i.e.
> Sum |things| = |Product things|
This comes up in many other contexts, especially when storage quotas are involved and it's unclear what to do when storage is deduped across quotas.
-----
git packfiles do use a delta encoding, but it's important to understand that there isn't any necessarily any correspondence between the history and the delta encodidng. In fact, commands like `git repack` exist precisely to avoid path dependency issues from the repacks matching the history too much.
Saying commits are diffs to explain the delta-encoding storage characteristics is wrong and confuses, not clarifies.
------
> And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.
It's git tree objects that are snapshots, commit objects have tree child and a prev commit child, so it is natural for them to describe the relationship between two states without appealing to hypothetical alternatives.
> Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.
I might `git checkout somethingelse .` mid-rebase. What does that mean if commits are diffs? Nothing very clear. The better thing to teach people is about darcs and patch theory and those other modules. I think the git model and the patch theory model both have uses, and the fact that git makes people always work in the git model is a fundamental issue that cannot be fixed with analogies.
- Patch theory is good for the things are you still working on
- merkle dag of states is good for the things you've already done / agreed upon.
if your filesystem was copy on write and implemented snapshot semantics internally (like WAFL for example, over 20 years old now), then the second snapshot would not take 100MB, it would just cost the metadata.
A commit is a snapshot of a tree with a reference to it's prior ancestors. It's important to know that because it becomes extremely relevant when trying to do things like merges properly.
If you commit 100MB file, change few bytes in it and commit it again your .git/objects will almost certainly contain two 100MB objects. The fact that it is somewhat likely that running "git gc" or something similar will convert one of them into reference to the other one and some compact representation of the difference is implementation detail.
While commit object does represent the snapshot it also references the previous state, thus the commit message usually describes what was changed between the referenced snapshot and the parent(s) that are also referenced from the commit object.
As for the overall model and leakage between implementation details and how people use it interesting approach is used by SCCS/BitKeeper with its internal "weave" format that essentially is both snapshot and diff at the same time.
After going through the "Git Internals"[0] docs, I found that the snapshot mental model has been much more helpful in understanding what my Git commands are doing, how someone's history got into a confusing state, etc. The primary model is that of the Merkle tree, and subsequently hashing, which are very simple and powerful concepts.
I prefer to think of a repo as a whole as a tree, where the nodes are snapshots and the vertices between each node is a diff. This sort of lands us in both places
> From a storage perspective, describing commits as snapshots seems like a bad mental model. Suppose I have a directory that is 100MB in size. If I take a snapshot of it, my snapshot would be 100MB in size. If I take a 2nd snapshot of it tomorrow, my 2nd snapshot would also be 100MB in size. My total storage needs would now be 300MB.
That's not the way storage snapshot works under most (all?) storage targeted file systems, filers etc.. What you're talking about there is a backup.
Snapshots are not backups. Snapshots work on "copy on write" basis.
Roughly speaking, when you take a snapshot you draw a line in the sand. "These were the files at this time". Snapshot operations as a result are super cheap and super fast. Future changes to those files results in the filer/file system writing the modified blocks to new locations, not overwriting the original data.
So take a 100MB directory. I create a snapshot. That results in almost new storage usage, just a small amount of metadata. I write/modify 10MB of data, now the total storage cost is 110MB. If I take another snapshot after writing that 10MB. it's still only 110MB of storage usage.
If "diffs" and "snapshots" are leaky abstractions, that often enough lead you badly astray, then why insist on these abstractions in the first place?
Why not just teach people the mental model behind Git up-front? Objects form an immutable directed acyclic graph, human-readable names point at objects, there are some rules by which the graph is being extended and pruned, and by which names (references) are being updated to point at different objects.
This isn't a hard mental model, not for programmers (for whom the tool is intended in the first place). If you know how the most basic pointer-based data structures - a linked list, a tree, a directed graph - work, then learning the actual model isn't hard, immediately clarifies why Git does what it does. It should be taught to people up front.
A commit isn't a diff, and it isn't a snapshot. It's a bunch of objects Git creates for you, where the "commit" object points at previous commits and at a tree, built of "tree" and "blob" objects. When Git wants to know how to recreate your file structure, it starts at the "commit" object and walks the graph to discover what files and folders should exist. When you make a change and perform the "commit" action, Git creates a new "commit" object and a new "tree" object for it, and add more objects to the graph to encode what changed, while reusing previously existing objects for things that did not change. The end state is, if you start at your new "commit" object and walk the graph, the resulting description of your file structure should be equal to what's on your hard drive when you made your commit.
Trying to paper over that with "friendly abstractions" is what makes Git difficult to understand.
Depends on the diff. If the diff is not aligned by bits a single bit offset might cause double the size, ie the full file to delete and a full file add.
>If you insist on using the "snapshot" abstraction
But its not insisted. Both abstractions are used as needed.
>… you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit.
Only that this is completely wrong.
A commit is a snapshot of the tree. There are no diffs.
There is also no "metadata attached" — the commit is the actual data (!) describing the tree snapshot.
Git is a kind of simple content addressable object store storing kind of Merkle tree objects. That would be a proper (abstract) description.
Whereas if I had used git, and created 2 additional commits, each making a change to a small text file, my total storage size would be barely larger than 100MB. Describing the commits as a diff, as opposed to a snapshot, leads to a better intuitive understanding of why this would be the case.
Not to mention other features the article discussed, such as cherry-picking. What does it even mean to "cherry-pick a snapshot"? In comparison, cherry-picking a diff and applying it to your current state, is far more intuitive.
And let's not forget commit messages. If a commit is a snapshot, I would expect the commit-message to be descriptive of the entire snapshot. Whereas if a commit is a diff, I would expect the commit message to be descriptive of the diff. Which is exactly how most people use commit messages.
Obviously both "diffs" and "snapshots" are leaky abstractions. If you insist on using the "snapshot" abstraction, you will need to resolve all of the above points of confusion by adding more complexity to your abstraction. And if you prefer to use the "diff" abstraction, you will eventually need to explain that a commit is actually a combination of diffs, along with some other metadata like a pointer to a parent commit. As a teaching tool, you can make either abstraction work. But I find it far more intuitive and useful to think of commits as "diffs + some metadata".