Monday, December 19, 2011

Git advantages in the corporate environment

I've come across a couple of  blogs and message board posts (some of them admittedly dated) lately which insist git is not usable in a corporate environment because:


• git is not centralized.

Corporate environments MUST HAVE a centralized solution for backup purposes

• git does not have canonical revision numbers.

Corporate environments must have canonical version numbers.

 

The Backup Issue


The first point is of course, invalid.  Git can be used in a centralized manner, making all the backup monkeys happy.  Used in this manner, it's not different from subversion in that the repository exists on a server and it can be backed up.


However, that ignores a far bigger point. What if your IT department isn't quite as awesome as they think they are?  What if you were told about the crack sysadmin team, only to learn the system admin is low-rate overseas hourly worker?  What if he didn't quite get that backup job done before the bus came?  What if he hasn't checked if the script he wrote 4 months ago is still working?


One night your server stops responding, and after opening a case with IT, you learn to your horror that they're unable to restore the backup.


Not to worry, right?  Every developer on the team has a copy of your subversion repository, right?  Wrong.  They have a single checkout of repository.  One version.  Need to revert to a previous version? Can't do it.  Need to look at the changes involved in a particular bug fix?  Not happening.  Need to review the commit messages for a particular commit? Sorry, the entire log is gone.


This differs markedly from the git world.  In the git world, should your centralized repository be lost, every single developer has a copy of the entire repository.  It is possible that if a developer hasn't updated between the time another developer pushed some commits and the central repository was lost, he might be a couple commits behind.  This is true in the subversion world too.  If developers hadn't updated before the repository was lost, only the last person who checked in will have the last few commits.


It can't be stressed enough how much better your situation is with git.  Think about it.  With subversion, you'd figure out who had the up-to-date copy of the repository.  Then what?  You can't find out what revision your teammate had and send him a patch to bring his copy of the project up to date, since subversion can't generate a diff without access to the lost central repository.  I suppose the most up-to-date developer could put his copy on a file share somewhere and developers could manually diff the files and apply any changes.  Yeah, that's it, what fun.


The other choice of course is just to take the most up-to-date developer's project and use it to create a new subversion repository.  That will work and the team will be able to check out an up-to-date copy of the project.  An up-to-date copy with no history, no previous versions, nothing. Certainly not what one would consider "enterprise-level source code management."

Oh, but it also gets better.  What if your team's discipline is really lacking, and nobody has a version checked out that actually works?  You're hosed - you can't revert to a known working version - you have no central repository anymore.


Now let's look at the git situation.  Like with subversion, the team is likely to figure out who has the most recent copy of the repository.  With one command he can generate patch files that his colleagues can run to bring their repositories up-to-date before any kind of centralized server is restored.  Depending on the size of the team, team members could be completely up-to-date, sharing changes again and productive again literally within a few minutes.


Oh and those patch files?  Not only do they update the code to the correct state, but they also include the log messages, commit hash, etc.  So when the patches are applied, the resulting repository looks EXACTLY like the repository from which it came.


The Revision Number Issue


I don't know who came up with this gem, but talk about a ridiculous argument. Someone tell me what a canonical version number is?  Is it sequential?  Is it unique? Is it alphanumeric?  Is it octal?


Although subversion has simple version numbers, they're not particuarly useful.  Sure, I can tell my team member - "Hey, create that branch off revision 2142 of trunk," but that doesn't do him any good other than pointing to a version.


And while you would think you could figure out relationships between revisions using the revision number, it's actually difficult to do so, since revision numbers are global and increment when other developers check in, or even when some checks in on another branch.  So, think that revision 10 on trunk is the result of 5 commits on trunk since you checked in revision 5?  Could be, but it also could be that someone checked in 4 revisions to a branch, so that revision 10 actually follows revision 5 in trunk.  Not much value there.

Now, git has those super-complicated hex-looking numbers.  I mean, really.  We as developers are supposed to be able to deal with these hex strings?  How ridiculous is that?First of all, let's get the easy part out of the way.  Those big hex numbers are not difficult to deal with because we can shorten them dramatically.

In small repositories, four characters is sufficient.  In large repositories, 4, 5, or 6 characters is certainly going to be sufficient.  No more difficult than typing version "21042."  And of course we can use all sorts of symbolic names and expressions such as HEAD, HEAD^ (the parent of HEAD), time-based specifiers and others.

More importantly however, the commit is a hash of the branch at that particular time.  What this means is that you can GUARANTEE that code with the same hash is exactly the same.

Here's how cool this is:

Let's say for a minute that the repository that went down was a public project that my company exposes to the outside world and allows others to clone.  And now let's say that somehow, some way, I lost the last 5 commits out of that repository.  I have a copy of the log, so I know what the commits were, but I already checked with everyone in my company and nobody has a copy of those commits.Oh man, I'm in big trouble.  I don't have a trustworthy source.

Can I really just reach out to the some random 12-year-old on the Internet that has a copy of the repository and ask for a patch containing those commits?  Maybe I should find someone that works for another large company.  Surely that's a good way to make sure they're trustworthy.  Or maybe I can get my company to pay for a background check on someone.

The fact of the matter is, if you have the hash, you need not worry.  The code could come from the most notorious hacker group on the planet and as long as the hashes match, I know they did not change one single bit in my code.

Conclusion

Wow.  That's a lot more than I intended to write.  But hopefully you can see now why such arguments are not only ridiculous, but just plain wrong.

For the Mercurial fans out there, readers should note that I'm willing to bet Mercurial has the same advantages as git in these areas.  Both are excellent DVCS systems and I wouldn't hesitate to use either.

Finally, I do recognize that there are situations for which a DVCS may not be appropriate.  If you have terabytes of data in your repository, git or hg is probably not for you without a lot of repository re-organization.  In those cases, by all means stick with your centralized VCS.  Just realize that because you have something preventing you from using a DVCS doesn't mean the DVCS advantages don't still exist.

1 comment:

  1. Oh yeah. Git changed how our company writes code. Dramatically.

    Branching is done so well that it's actually the normal way to do business in git (for us, at least). Very powerful medicine that lets us do things we would never dream of doing in svn. Right now we've got no less than three branches outstanding on our web portal, each in active development, and much of the same code being touched in each branch, and we're not afraid--git's ability to deal with conflicting merges is outstanding. In svn, that's just not practical.

    And how about your repository being garbage collected? Accidentally blow away a branch before pushing it to the master repository? So, undelete it. It's easy.

    Oh, and being able to rewrite any non-shared history. Got four or five commits you'd rather make into two before you push them upstream? Easy.

    Oh, and it's fast.

    ReplyDelete