Distributed Version Control in JavaScript

or any other language that takes your fancy

Tony Garnock-Jones tonyg@lshift.net

LShift Ltd.

Morals, Right Up Front

Version Control, even distributed, isn't hard if you factor it right
- diff tools: ~360 LOC
- graph utilities: ~40 LOC
- DVCS code: ~420 LOC
- time spent: three Saturday afternoons
Version Control can seem jolly difficult if you tangle all the pieces together (CVS, SVN, ...)
Merging is the most important bit for correctness; storage design is important for usability
Little Unix-style composable tools are the way to go (but user interface is also very important)

Modern DVCSs factor apart networking, synchronisation, history, storage, and merging; older VCSs tangled them all together, which led to much implementation and conceptual complexity.

Basic Algorithms

Diff and patch
- can be used for resolution (= the process of deciding what the user did to a file)
- can be used to compress multiple stored versions
- "two-way merge"
Diff3, a 3-way merge algorithm
- used for history-sensitive textual merge
- other merge algorithms exist; some are better!
Least Common Ancestor
- used to find the most appropriate ancestral revision to use in a merge

Diff

Longest Common Subsequence
```
hello, world
head   co ld
    
```

Diff produces a delta (a.k.a. "patch", "hunk", "chunk", "diff")

var delta = Diff.diff_patch("hello, world", "head cold");

/* [{file1: {offset:2, length:4, chunk:["l", "l", "o", ","]},
     file2: {offset:2, length:2, chunk:["a", "d"]}},
    {file1: {offset:7, length:1, chunk:["w"]},
     file2: {offset:5, length:1, chunk:["c"]}},
    {file1: {offset:9, length:1, chunk:["r"]},
     file2: {offset:7, length:0, chunk:[]}}] */

diff uses a Longest Common Subsequence algorithm to find a short description of the differences between two files. The notion of minimum edit distance is a related idea.

It so happens that the output of diff often makes sense to a human trying to figure out how a file has been changed. How lucky!

Note that Diff.diff_patch can operate equally well on lists of strings an on lists of characters (strings). It doesn't work very well when given single strings, as in the example above, but it does work.

Sometimes called two-way merge: every difference is a conflict

Bram Cohen has invented a diff algorithm that works well for programming-language (or other line-oriented) text. It uses uniquely occurring lines to anchor the LCS.

Patch

Patch applies a delta — A + delta = B


js> uneval(Diff.patch("hello, world", delta).join(""))
"head cold"

Inverting a delta is possible — B – delta = A


js> Diff.invert_patch(delta);
js> uneval(Diff.patch("head cold", delta).join(""))
"hello, world"

In some revision-control systems, e.g. darcs, inverting a patch is a central operation. Darcs in particular has a full (and very useful!) "theory of patches", where patch inversion, commutation and merging are developed formally.

Variations

Diff.diff_comm - works like a simple Unix comm(1)
Diff.diff_patch - works like a simple Unix diff(1)
Diff.invert_patch - inverts a patch produced by diff_patch
Diff.patch - works like a (very) simple Unix patch(1)
Diff.diff_indices - like diff_patch, but only gives offset and length information

Diff Demo

Diff3 and Three-Way Merging

History-sensitive merge
Some changes are conflicts; others are not
Automatically smart about which is which, in a fairly understandable and predictable way
Sometimes blows up badly - for details, see
- Khanna, Kunal and Pierce, "A Formal Investigation of Diff3"
- The Revision Control wiki, revctrl.org
Better merge algorithms than diff3 exist (revctrl.org has the details)

Diff3 and Three-Way Merging

"this"	"base"	"other"	Result	Notes
A	A	A	A	no changes
A	A	B	B	"other" wins
B	A	A	B	"this" wins
B	A	B	B or conflict	accidental clean merge
B	A	C	conflict	"true" conflict

Diff3 and Three-Way Merging

var base =
  "the quick brown fox jumped over a dog".split(/\s+/);

var derived1 =
  "the quick fox jumps over some lazy dog".split(/\s+/);
var derived2 =
  "the quick brown fox jumps over some record dog".split(/\s+/);

var mergeResult = Diff.diff3_merge(derived1, base, derived2, true);

/* [{ok:["the", "quick", "fox", "jumps", "over"]},
    {conflict:{a:["some", "lazy"],   aIndex:5,
               o:["a"],              oIndex:6,
               b:["some", "record"], bIndex:6}},
    {ok:["dog"]}] */

Diff3 Demo

Least Common Ancestor

LCA is defined for trees. Efficient algorithms are known to exist. It has also been defined for DAGs, which is the case we have in a DVCS, but the definition leads to some problems in our case.

DVCS Components

History
- ancestry DAG
Merging
- choosing an ancestor
- conflict detection and handling
Synchronisation
- copying history between repositories
- network transfer, file system, ...
Storage
- of file versions
- of changesets/patches

History

History is a DAG of changesets.

Each changeset should record

Its own unique identity (a UUID)
Which specific versions of files are alive, with their paths and metadata
Which files are dead (deleted)
IDs of parent changeset(s)
Timestamp, comment, committer, ...

Many modern DVCSs use some function of the contents of an object to identify the object, e.g. a SHA-1 hash. This has a lot of nice properties, and is a good choice. JavaScript doesn't have particularly good support for binary data, which makes hashing (and canonical binary representations!) awkward, so I chose to use simple random UUIDs for identifiers, instead.

History

Merging

Resolution step
Merge step
Deciding what makes a good merge algorithm is an actively researched area
DieDieDieMerge

Synchronisation

What do you know that I do not? ("pull")
What do I know that you do not? ("push")
Updating a working-copy is a separate operation (a merge)
Can synchronise using any transport or representation that can
- inform others about the revision IDs it holds
- export revisions and their contents in a standard form
- import revisions submitted to them in standard form

Storage

The database used to store all information about current and past state in the repository, in every branch, for every commit.

Design storage around query patterns: user interface is central
Design for efficient synchronisation with other repositories
Finally, efficient use of disk space can be important, too (less and less as time goes by)

Storage

Narrow API - interface to merging, synchronisation, history is fairly simple
Must be robust; should enjoy ACID properties
Can use compression (e.g. gzip) under the covers
Can store full snapshots, or deltas from newer to older versions, or both, as required
- tradeoff space/speed/convenience
- Mercurial does well here: deltas, with periodic snapshots when deltas grow too large, giving O(1) arbitrary version retrieval

Storage

Example user-interface-led design criteria:

retrieving an old version of a single file should be fast
retrieving snapshots of old versions of the whole repository should be fast
retrieving history for a single file or the whole repository should be fast
avoid dragging in too many unrelated records when performing single-file operations (think Wikipedia-scale system: how to avoid querying the whole repo when viewing a single page?)

DVCS Demo

The End

Any questions?

Ambiguous LCA

Here's a problem case. The LCA of "e" and "i" is either "c" or "h". Both "c" and "h" are two steps away from the root.

Note that the path from "e" to "i" through "c" is three steps long, while the path through "h" is two steps long. This could mean that "h" is a more suitable ancestor for use in 3-way merging.

The algorithm I've implemented is very naive and inefficient. It also answers "c" or "h" depending on the order of arguments you give it.

Criss-cross Merge

Truncating History

Can truncate, missing piece may be held by other servers, or can fall back to two-way merge

Open Source Show 'n Tell, 5 June 2008