git-smart-backup Code

Brought to you by: notacle

Tree [af01dd] master /

History

HTTPS access

File	Date	Author	Commit
README.md	2021-10-21	notacle	[af01dd] Still more corrections
git-backup-all	2021-10-21	notacle	[935228] Initially published
git-backup-lastcommit	2021-10-21	notacle	[935228] Initially published
git-backup-one	2021-10-21	notacle	[935228] Initially published

Read Me

A set of bash scripts for backing up git repositories in a smart way

In this note I will describe how I have set up a backup process for all my personal git repositories on gitlab and elsewhere. A repo is only backed up when there were new commits (in any remote branch) since the last backup. The backups themselves are timestamped tar.gz files that live on the local file system and can be copied manually or automatically to external locations from there.

The big picture

Whenever I start up my work computer, cron runs a script that does a git fetch --all for the local git repo in question, then calculates the timestamp of the most recent commit regardless of the branch and checks to see if a backup already exists for that timestamp. If not, a backup is created.

Of course, this implies that I have git cloned the repo to my local file system. In fact, I have more than one side project I want to keep backups of, so I collect them all together in one directory and then run the backup script for every work/subdirectory in that root. These work directories only serve as backups. Whenever I actually work on side projects I use a regular work directory in my home directory project structure. Only after I commit changes to the remote repository from there will they eventually show up in my backup storage.

Note: in the source code for the scripts shown below I frequently reference my own linux user home directory on my development machine, being /home/mbp. When adopting these scripts you should of course use your own home directory.

How to use

For those of you who are impatient, here is the too long; didn't read
1. Set up a base directory for maintaining backups in this way. For example /data/git-backups.
1. In that base directory, create two subdirectories work and storage.
1. All these directories should be writeable by your user account (so don't be root when you set them up).
1. Create three script files in your own ~/bin directory with the names and contents shown in this note. Make sure they have their executable bit set.
1. Check the contents of the script files for pointing to the correct directories you set up in step 1 and 2, and make sure all cron related references point to your bin directory, not mine.
1. git clone all the projects you want to backup with this scheme as subdirectories of your work directory from step 2. It does not matter which branch is locally checked out. Do not actually work in these directories. Do not edit sources, commit changes and git push from these directories.
1. Checkout another copy of the repository somewhere else and use that for editing, debugging and committing. Once you push a commit from that directory upstream, the backup script will pick up the changes at the next restart of your computer and fetch them to your git-backup/work project directory.
1. Add the git-backup-all script to your own personal crontab so it will run every time you start up your workhorse.
1. With some regularity, copy all files in the storage subdirectory created in step 2 to a safe location. These are the actual backup files.
1. If you want to add a new project to the backup list, simply git clone it to /data/git-backups/work/. It will be picked up automatically.

Finding the most recent commit

The git fetch --all command copies the latest state of all branches from the remote repository to your local repository. After this, git show is applied to all known remote branches and the output of that command is tailored to print the timestamp of the commits. The time stamps are formatted as year-month-day-hour-minute with leading zeroes for each separate part so that an alfabetical sort has the effect of sorting them chronologically as well.

Since we don't need to know the timestamps of all commits to the branch we are investigating, but only the most recent one, we use the linux head command to get at the most recent commit for that branch. We know that git show sorts the commits in reverse chronological order so we simply pick the top most line of the output for the branch.

We do this for all branches and then once again select the most recent timestamp from amongst these.

How do we know which branches exist in the repository? We don't, but we can find out the remote branch names with git branch -r. This prints a list of all branches known in the remote repository. Using some shell glue to pipe these branch names into the above, we finally get a list of timestamps that represent the date and time of the most recent commit for each known remote branch. But git branch -r gives us one name too many, namely origin/HEAD. This is actually a reference and not a branch name so we use grep -v to weed it out.

The timestamp script

All in all, the bash script to find the timestamp for the most recent commit to the current local git repository looks like this:

filename: git-backup-lastcommit

#! /bin/bash
for branch in `git branch -r | grep -v HEAD`;  do   
  echo -e `git show  --format="%cd" --date=format:"%Y-%m-%d_%H-%M" $branch|head -n 1`;
done | sort -r | head -n 1

The output of this shell script is a formatted timestamp for the most recent commit across all remote branches. Note that it does not do a git fetch --all to synchronize state from remote to local. This is done in the script one level up.

A blow by blow explanation:
* #! /bin/bash in linux, the very first line of a script may specify which program must be used to interpret it.
* for branch in list ; do commands ; done repeat the enclosed commands for every line in the list. We name the variable for the current item branch and reference it as $branch. do and done are the opening and closing brackets of the bash for loop syntax.
* git branch -r list all branches known in the remote repository. This is the list the for loops over, except that
* grep -v HEAD print all lines not containing the string HEAD. This name is suppressed because it confuses things.
* combining the above enclosed in backticks ` commands ` tells bash to pipe the resulting output from the enclosed commands into the for loop.
* echo -e prints the result of whatever is placed inside of the following backticks.
* git show format output $branch | head -n 1 performs a git show for the current branch name in the for loop and then truncates the output of that git show to only the first line.
* --format="%cd" --date=format:"%Y-%m-%d_%H-%M" formats the output of the git show command. The %cd tells git we want a timestamp for each commit, and --date=format:"%Y-%m-%d_%H-%M" specifies how to format this timestamp. Notice that we need to sort timestamps alfabetically in order to sort them chronologically so the order of the constituent parts and the presence of leading zeroes are important.
* done | sort -r | head -n 1 done is the closing bracket to the bash for .. do loop. The loop prints one timestamp for each branch. We take this list of timestamp strings and sort -r them in reverse alfabetical order. Assuming our timestamp format is correct, this has the effect of also sorting them in reverse chronological order. Finally, we have a list of sorted timestamps and of this list we print the top most (head -n 1). This is the timestamp of the most recent commit across all remote branches.

Making an actual backup file for a repo

The next step in our backup process (for one single repo) is to use the above script to get a timestamp for the repo and check to see if a backup file using that timestamp in its name already exists. If not, we tar | gz the whole repo into a zipfile using the new timestamp in its name. This will be our new backup file for this repo, reflecting the most recent changes.

We want to make the above process agnostic of which directory name we are dealing with, so that we can re-use it in a big outer for-loop over all repositories under consideration. And of course, we need to know where to look for already existing backup files (and where to save them if necessary).

All in all, the second script looks like this:

filename: git-backup-one

#! /bin/bash

PATH='/bin:/usr/bin'
PROJECT=${PWD##*/}
echo "PROJECT = $PROJECT"

git fetch --all
git pull

LASTCOMMITSTAMP=`/home/mbp/bin/git-backup-lastcommit`
LASTBACKUP="../../storage/$PROJECT-$LASTCOMMITSTAMP.tar.gz";

echo "LASTCOMMITSTAMP = $LASTCOMMITSTAMP";
echo "LASTBACKUP = $LASTBACKUP";

if [ ! -f $LASTBACKUP ] then
    echo 'making new backup'
    tar -czhf $LASTBACKUP ../$PROJECT/
fi

Once again, a blow-by-blow explanation:
* #! /bin/bash see above.
* PATH='/bin:/usr/bin' peculiarities of running a cron job: no environment is defined, so we must define our own for the duration of the cronjob. Here we tell bash where it may look for executables such as git, tar and echo.
* PROJECT=${PWD##*/} this is some bash magic that stores the name of the current working directory (pwd = print working directory) in the variable named PROJECT. The ##*/ part is formatting so that only the directory and not the path is used. Which is to say, now $PROJECT holds the name of the project directory without pre- or postfixes. The project name, as it were.
* echo "PROJECT = $PROJECT" report the project we currently are processing. When the script runs under cron, this echo does not show up anywhere but when you run the script from the command line it does and is handy for debugging. More echoes follow for the same reason.
* git fetch --all update the status for this project from remote. The --all tells git to get the status for all branches in the remote, even those that are not yet known locally.
* git pull merge the remote changes into the local branch that is currently checked out. This line is not essential for a correct working of the backup but it is convenient for debugging purposes when you inspect the working project dir for expected changes.
* LASTCOMMITSTAMP=`/home/mbp/bin/git-backup-lastcommit` execute the git-backup-lastcommit script and store the output in the variable named LASTCOMMITSTAMP. Again, cron must be told in explicit detail where to find this script. The backticks surrounding the name of this script cause the output to be piped as text content into the variable.
* LASTBACKUP="../../storage/$PROJECT-$LASTCOMMITSTAMP.tar.gz" here we define the filename and relative path for the zipfile that should hold the most recent backup for this project. This file may or may not exist. Ingredients for the filename are a timestamp $LASTCOMMITSTAMP and a project name $PROJECT. The relative path is a result of how I chose to set up the directory for my git backup projects:

/data/git-backup/
                |
                work/
                |   |
                |   green/
                |   |    | .git/
                |   |    | README.md
                |   |    | src/
                |   |    | ...
                |   |
                |   splitify/
                |   |       | .git/
                |   |       | README.md
                |   |       | src/
                |   |       | ...
                |   ...
                storage/
                       | green-2021-01-02_21-56.tar.gz
                       | green-2021-01-23_08-11.tar.gz
                       | splitify-2019-07_30-19-34.tar.gz
                       | splitify-2020-08_05-11-05.tar.gz       
                       | ...

if [ ! -f $LASTBACKUP ] then ... fi test if a file with this name exists and then inverts that boolean. If the inverted condition is true (so when the file does not exist), do what is enclosed. fi is the closing bracket for the opening then
echo 'making new backup' a bit of reporting, useful when debugging.
tar -czhf $LASTBACKUP ../$PROJECT/ here all files in ../$PROJECT/ are tarred and zipped into a single file named $LASTBACKUP. This is the new backup file for the most recent commit known in the remote repository.

The outer loop: scanning all projects

As noted above, I have a general root directory where I keep all projects of interest checked out. There is one final script that scans all these projects and then invokes the other scripts project by project. This outer script is the one that is invoked directly by cron.

filename: git-backup-all

#!/bin/bash

WORKDIR="/data/git-backups/work"
BACKUP_ONE="/home/mbp/bin/git-backup-one"

cd $WORKDIR

for project in ./**
do
    echo "Processing $project"
    cd $project
    $BACKUP_ONE
    cd ..
done

The blow -by-blow:
* #! /bin/bash see above.
* WORKDIR="/data/git-backups/work" this is the directory where I checkout all projects that I want to backup in this way. Note that I do not actually work on the projects here, I use a different project directory for that.
* BACKUP_ONE=" path/git-backup-one" yep, cron must be told explicitly where to find this script.
* cd $WORKDIR go look in the directory that holds the git project work directories.
* for project in ./** a for loop in bash that scans all files and subdirectories in the current directory and invokes the encompassed script lines for each of them. Inside the loop, the loop variable is referenced as $project. Note that if you have a stray file or some directory that is not a git repository in your $WORKDIR this will result in errors.
* do ... done opening and closing braces in bash for loop syntax.
* echo "Processing $project" once again, cron does not show this output but the line is printed when you run the script in in a terminal window.
* cd $project dive into the git repository / project that we will process in this iteration of the loop.
* $BACKUP_ONE invoke this script for this project directory. The script requires no input parameters as it uses PROJECT=${PWD##*/} to find out the name of the current project directory.
* cd .. move one level up out of the project directory to get to the $WORKDIR again.

cron

All of the above is nice, and how do we make the overall outer script run automatically? This is where a bit of cron savvy kicks in. As your own user (not root) do crontab -e. This drops you into an edit session with your favourite editor attacking your users personal cron table. Depending a bit on your linux distro (I am running Ubuntu), you should add this line to the edited file:

@reboot            sleep 120 &&         /home/mbp/bin/git-backup-all

What does this do?
* @reboot run the following every time the system restarts.
* sleep 120 wait 120 seconds. Why? Mmm. As your system starts up, lots of things are going on simultenously as mandated by the default scripting of your linux distro. Give them a chance to settle down before adding your own little bit of chaos to the mayhem.
* && wait for one command to finish successfully before commencing on the next.
* <path>/git-backup-all after waiting, run this script. On my work machine, the script is located in /home/mbp/bin/ Where you decide to place your version of the script is to you but keep in mind that cron does not know about your environment so you need to provide it with a full path to the executable or script.

Restore

So here we are. Our old machine crashed and/or the origin upstream git repository has become inaccessible. What to do with our backup zip files?

Make sure you have a copy of the copy, and preferably two. If we are going to fiddle with backups, we don't want one wrong action to destroy the only backup copy we have.
Unzip your backup in a clean directory:

  tar xzvf projectname-timestamp.tar.gz

You now have a complete history of all commits to the code repository, but you will need to apply some git magic in order to make them visible locally. Basically you want to sync remote branches with your local. Note that remote branch in this case means actually a copy of the remote branch that was downloaded and included in your zipfile at the time that the backup script ran. So you don't need actual access to the remote repository. Instead, you are accessing in the git database a local copy of the remote repository that was downloaded (git fetch --all) when the zipfile was created. Assuming you want to restore branch master to the state it has in the backup for the remote:
git checkout master # make sure your local repo is on master branch git merge origin/master # merge changes from backup remote master into current branch
Your local master branch is now up to date with the latest commit to this branch in the remote repository (or to be more exact, the latest commit in the backup that you unzipped).

Troubleshooting

A script invoked by cron does not work the same way as a script invoked from the command line. When cron runs on your users behalf (it executes the commands in your personal crontab) it has the same file permissions as your user account, so reading and writing files should normally not be a problem. However, a shell started by cron does not have the same environment you have when typing commands in a terminal. This means that when your script relies on certain other scripts or executables, these either may not be found (there is no $PATH in the cron environment) or they will produce unexpected output because some environment variable is not set so they revert to default behaviour.
The lowest level script is git-backup-lastcommit. You can run this manually from the command line by cd'ing into any of the checked out project directories. It only inspects and does not change things. When run, this script should output a string representing a timestamp. The value of this timestamp should reflect when the last commit message across all branches was added to the remote branches as they were when the most recent git fetch --all was run for this repo. If you want to know the actual timestamp for the last commit in the remote repo, you'll have to manually synchronize the local git database using git fetch --all and rerun the script. Of course, this does make changes to your local repo.
The next level script is git-backup-one. This script relies on correct directory structures being in place and readable / writeable by the process that was spawned by cron. Normally this should not be a problem but I have seen issues caused by umask having a different value running scripts in cron than in my regular account.
Another potential problem source is that we drive git show to format dates in a certain way. We do this so that the string resulting from this formatting honors the expectation that the linux sort command will sort such strings implicitly in chronological order, while in fact sorting them in alfabetical order. This is a plausible expectation, but not an ironclad guarantee.