cat /dev/brain |

git du: historic object size

published on Tuesday, May 30, 2017

Ever wanted to do git du, accumulating the size of folders or files through the entire history? Or just wanted to know the maximum or average size of a file or folder?

Download the git-du-helper.pl script into your current working directory and then execute the following command:

git rev-list master | xargs -l1 -- git ls-tree -lr | ./git-du-helper.pl > raw_sizes.txt

You get a text file with columns SUM_SIZE MAX_SIZE NUM_REVS PATH.

What's going on here?

You can now pose additional queries to get pretty-printed output, e.g. sort by MAX_SIZE, show human readable sizes (KiB/MiB/…) and columnate:

< raw_sizes.txt sort -n -k2,2  |
    numfmt --to=iec-i --field=1,2 --format='%.1f ' --suffix=B |
    column -t > file_sizes.txt

Note that these numbers don't reflect the actual storage size on disk because git compresses objects and can pack similar objects based on deltas using packfiles (see also this excellent answer on SO and maybe this post). If you want to detect large files on disk, take a look at Steve Lorek's article How to Shrink a Git Repository which shows how to get actual file sizes using git verify-pack.

For completeness, my git-du-helper.pl looks as follows:

git-du-helper.pl

#! /usr/bin/env perl

use strict;
use warnings;

use File::Basename qw(dirname);
use List::Util qw(sum max);

my $just_files = scalar(@ARGV) > 0 && $ARGV[0] eq '-f';
my %folder_contents;

while (<>) {
    chomp;
    my ($mode, $kind, $hash, $size, $filename) = split(/\s+/, $_, 5);
    $size = 0 if ($size eq '-');
    do {
        $folder_contents{$filename}{$hash} = $size
    } until ($just_files || (($filename = dirname($filename) . "/") eq './'));
}

while (my ($path, $objects) = each %folder_contents) {
    print(sum(values %$objects), " ",
          max(values %$objects), " ",
          scalar(keys %$objects), " ",
          $path, "\n");
}

This entry was tagged git, history and plumbing