jsburbidge | Why git blame is Useless for the General Case

A few days ago, one of my co-workers contacted me about a possible bug in some code he had been going over. Part of the code went back to the original check-in of the code when it was migrated from a system called Harvest about four years ago (and lost all if its history in the process), and a number of lines had my name in them in git blame.

A little bit of checking showed that the algorithm was essentially as it had been four years ago. Several lines were marked as mine because I had converted the macro TRUE to the boolean value true on a couple of lines, and one because I had taken a freestanding C function and turned it into a member of the class in which it operated. For practical purposes, the code was the same as it had always been - but my name was all over it. In addition, the problem would be expected to take the form of a line having dropped out, and there is no blame tracking attached to deletions.

In actual fact, the conclusion to be drawn was that the code was legacy code. Minor tweaks obscured that fact.

Git blame operates on a per-line basis. But any change to the line - tweaking parentheses, for example, or converting a C-style cast to a C++ cast - makes you the owner of the line.

On a greenfield project where responsibility is doled out in blocks it might be useful, but on a legacy projects it's worse than useless.

By coincidence, I had been looking at the blame record for a makefile the day before. The makefile had an if-else block where both branches had the same statements. (In other words, there should have been no if-else block, but just the list of statements.) Blame shows five different names associated with the block of the code (all of whom, except (I think) the oldest one have some passive responsibility for the poor structure) but not in any coherent manner.

When I look at a block of code and want to see its history, I want to see how the algorithm evolved, not how different lines were tweaked while retaining the same algorithm.

You can't avoid this problem as long as your algorithms are line-based. It's a whole different level of difficult to provide a program which divides a program into logical chunks and applies that analysis to the raw record; or (worse) to determine when apparently minor changes create new logic but skip over better implementations using the same logic. (A for loop and a find_if statement may be exactly formally equivalent, but don't expect any automated help to know that.)

So I will continue to avoid git blame. If I really need to look at the history of code .... I'll look up diffs from the history of the codebase and look at them as integral wholes.