Here's a proof of how git projects can be hosted on a CDN.
Specifically, a cloneable repo and HTML code browser are hosted on
Rackspace Cloud Files (other CDN's like s3/CloudFront would work too).
Project collaboration now takes place over email, which works for the
Linux kernel and should work for most small projects also.
Before going further, here's the end
result, a browseable and cloneable repository of a small curses
game I wrote recently.
Many, many, many git code projects are hosted outside of walled
gardens like Github. The reasons include:
CDN's are among my favorite utility technologies. This blog is
hosted on CloudFiles, and the prospect of hosting code projects is
interesting for all the same reasons. A CDN is a low level workhorse
and unlikely to have outages or security issues, compared to "smarter"
code hosting solutions. I don't have to maintain a server, and the
monthly charges are a few quarters and dimes.
While using a CDN doesn't technically count as self-hosting, it is
by no means a walled garden either. The CDN's out there are close to
being interchangeable, and exporting is easier than importing.
Anyway, the solution here could be self-hosted if you want.
Code browser
Git ships with a CGI script for browsing repos over HTTP, but we
want a tool that generates the static HTML. A quick web search turned
up the competent git2html. While the
tool doesn't feel quite finished (TODO: CSS), it's very functional.
Let's run the tool on a small repo, and look at the output locally.
erik@msi ~/tmp $ git2html.sh
Usage /home/erik/src/git2html/git2html.sh [-prlbq] TARGET
Generate static HTML pages in TARGET for the specified git repository.
Project's name
Repository to clone from.
Public repository link, e.g., 'http://host.org/project.git'
List of branches to process (default: all).
Be quiet.
Force rebuilding of all pages.
erik@msi ~/tmp $ git2html.sh -p mountain -l http://foo.example.com/mountain.git -r /home/erik/src/mountain mountain
Rebuilding all pages as output template changed.
Cloning into '/home/erik/tmp/mountain/repository'...
done.
Note: checking out 'origin/master'.
You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.
If you want to create a new branch to retain commits you create, you may
do so (now or later) by using with the checkout command again. Example:
git checkout new_branch_name
HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
Deleted branch master (was 4ca5f13).
From /home/erik/src/mountain
* [new branch] master refs/origin/master
warning: refname 'origin/master' is ambiguous.
HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
warning: refname 'origin/master' is ambiguous.
Branch master (1/1): processing (2 commits).
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): processing.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): processing.
erik@msi ~/tmp $ git2html.sh -p mountain -l http://foo.example.com/mountain.git -r /home/erik/src/mountain mountain
warning: refname 'origin/master' is ambiguous.
HEAD is now at 4ca5f13... Explicitly set foreground/background for terminals that need it
warning: refname 'origin/master' is ambiguous.
Branch master (1/1): processing (2 commits).
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): processing.
Commit 4ca5f1339aaf6b69cffe06c77f0e5259aa3897f1 (1/2): already processed.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): processing.
Commit 4b83f06f9041e9445f0d15930ea42ad9fbc96f7a (2/2): already processed.
erik@msi ~/tmp $ find mountain -type f | wc -l
49
You see that the script starts with a checkout into a "repository"
directory. We'll push this along with the generated markup, and the
world can clone from it (with some extra steps).
I have some git warnings that I frankly haven't bothered to debug
since the output seems to be correct anyway.
Running the tool a second time results in some "already processed"
lines, showing that the markup is updateable and tries to skip
regenerating markup when possible.
Finally, notice that 49 files are generated, which seems a little
excessive for a repo with two commits and two files in its tree.
After testing with some larger repos, it looks like eight files are
generated per commit. This is the cost of pregenerating every diff,
every tree, etc.
Syncing to CloudFiles
The file count made me look again at sync tools. For the blog
engine I wrote some script using the pyrax python lib, which
is Rackspace's sanctioned lib for their cloud API (which differs in
subtle ways from a stock OpenStack implementation). But this approach
is slow and unidirectional. Searching for other sync tools, I was
disappointed at the slim pickings, and was preparing to write
something in erlang.
I searched one last time and hit the jackpot with cloudfuse, a FUSE
(filesystem in userspace) implementation for Linux. It built without
warnings and worked the first time I tried it. On top of that, it's
much faster than I was expecting. I've put it into my fstab for easy
re-mounting on demand (the cloudfuse tool itself isn't required):
erik@msi ~/tmp $ cat /etc/fstab
...
cloudfuse /home/erik/var/cf fuse username=foouser,api_key=fookey,region=ORD,user,noauto 0 0
...
Cloudfuse isn't a perfect solution. For example, I've given up
fine-grained control of the MIME types that I enjoy with pyrax. It's
good at quickly and easily syncing a lot of files, though, so I can
overlook this minor problem.
A nice property of cloudfuse is that it's bidirectional. If I want
to migrate off of CloudFiles, retrieving the whole thing is as trivial
as a tar command. I'm also quite sure that FUSE modules are available
for other CDN's, making the data easily portable.
Fixing minor problems
Trying to run git2html.sh directly in a cloudfuse mount results in
a "Operation not implemented" error. It turns out that git2html.sh
very reasonably symlinks HEAD to the parent commit in the HTML output,
but a CDN has no notion of a symlink. The solution was to generate to
a staging directory, and then copy to the cloudfuse mount with
cp --recursive --dereference
which causes symlinks to be
followed through to their backing files.
A second fix is required in the staging directory. While
git2html.sh has cloned the source repository, we ultimately want folks
to clone from the CDN, which git calls a "dumb HTTP" transport. We
must run git update-server-info
, which results in git
generating whatever packed metadata is required, things that "smart"
transports can generate on the fly.
Thirdly, folks are used to cloning from bare repositories, whose
clone path ends in "project.git". However, git2html.sh clones
non-bare into "proj/repository/.git". Thus, the correct argument for
the -l command looks like
"http://blog.mackdanz.net/code/mountain/repository/.git" and results
in the user cloning to a directory called "repository", regardless of
the project name. Fortunately, git2html.sh isn't picky about the
value of its -l argument, so we'll specify it with two args, appending
the final clone argument so that the cloning user clones to a
directory named for the project. In other words, we'll invoke this:
git2html.sh -p mountain -l 'http://blog.mackdanz.net/code/mountain/repository/.git mountain' -r /home/erik/src/mountain mountain
so that the clone instructions render like this:
Clone this repository using:
git clone http://blog.mackdanz.net/code/mountain/repository/.git mountain
Automation
I like to keep my home directory looking Unix-y: code goes in
~/src, data in ~/var, scripts in ~/bin, config in ~/etc, etc.
Having all my code in a flat tree under ~/src makes it easy to
create a script (~/bin/synccodetocdn) that can publish any code
project to CDN:
bash
set -e
set -x
mountpoint=~/var/cf
if ! (findmnt $mountpoint >/dev/null); then
echo "No cloudfuse mount. Run:"
echo " mount $mountpoint"
exit 1
fi
projroot=$mountpoint/blog/code
stageroot=~/tmp/stageroot
mkdir -p $projroot $stageroot
function syncrepo() {
local repo
repo=$1
pubcloneurl="http://blog.mackdanz.net/code/${repo}/repository/.git ${repo}"
pushd $stageroot
git2html.sh -p $repo -l "$pubcloneurl" -r ~/src/$repo $repo
pushd $repo/repository/.git
git update-server-info
popd
popd
pushd $projroot
cp --recursive --dereference $stageroot/$repo .
popd
}
repos="mountain"
for repo in $repos; do
syncrepo $repo
done
One inefficiency here is that cp
will write every
file, even unchanged ones. Since CloudFiles supports Etag, it should
be possible to push only new/changed files. Maybe cloudfuse's "stat"
call could include an etag check, or it could report the Etag
(sometimes implemented as an encoded date) as the file's ctime. Maybe
this is somehow handled already and I just need to change my
cp
to rsync
(which can also delete files
from the target, a necessary behavior cp can't provide).
Changing workflow
I've collaborated over email, specifically when sending a patch to
password-store, although
I haven't done it regularly for a project of my own.
Git supports it well, and documentation is plentiful:
Note especially that git has a "request-pull" command, which
predates Github and works over the internet. Github's pull requests
are internal only, designed to be a lovely wall for their garden.
Summary
This project demonstrates that it's possible to host major parts of
an open source code project on a static CDN, deferring to email for
the collaborative bits.
I'm not sure what I'll do with this next. I could say "proven" and
be done with it. At the other extreme, I could move all my Github
code to CloudFiles, and do the same with some home projects. That
amount of effort feels premature for something I'm calling a proof. I
have some interest now in maintenance of git2html.sh and cloudfuse,
both of which could use a few patches from me and also a friendly
ebuild. Stay tuned, I guess.