Distributed Version Control in JavaScript

or any other language that takes your fancy

Tony Garnock-Jones tonyg@lshift.net

LShift Ltd.

Morals, Right Up Front

Modern DVCSs factor apart networking, synchronisation, history, storage, and merging; older VCSs tangled them all together, which led to much implementation and conceptual complexity.

Basic Algorithms

  • Diff and patch
    • can be used for resolution (= the process of deciding what the user did to a file)
    • can be used to compress multiple stored versions
    • "two-way merge"
  • Diff3, a 3-way merge algorithm
    • used for history-sensitive textual merge
    • other merge algorithms exist; some are better!
  • Least Common Ancestor
    • used to find the most appropriate ancestral revision to use in a merge

Diff

  • Longest Common Subsequence
    hello, world
    head   co ld
        
  • Diff produces a delta (a.k.a. "patch", "hunk", "chunk", "diff")
    var delta = Diff.diff_patch("hello, world", "head cold");
    
    /* [{file1: {offset:2, length:4, chunk:["l", "l", "o", ","]},
         file2: {offset:2, length:2, chunk:["a", "d"]}},
        {file1: {offset:7, length:1, chunk:["w"]},
         file2: {offset:5, length:1, chunk:["c"]}},
        {file1: {offset:9, length:1, chunk:["r"]},
         file2: {offset:7, length:0, chunk:[]}}] */
        

diff uses a Longest Common Subsequence algorithm to find a short description of the differences between two files. The notion of minimum edit distance is a related idea.

It so happens that the output of diff often makes sense to a human trying to figure out how a file has been changed. How lucky!

Note that Diff.diff_patch can operate equally well on lists of strings an on lists of characters (strings). It doesn't work very well when given single strings, as in the example above, but it does work.

Sometimes called two-way merge: every difference is a conflict

Bram Cohen has invented a diff algorithm that works well for programming-language (or other line-oriented) text. It uses uniquely occurring lines to anchor the LCS.

Patch

  • Patch applies a delta — A + delta = B
    
    js> uneval(Diff.patch("hello, world", delta).join(""))
    "head cold"
        
  • Inverting a delta is possible — B – delta = A
    
    js> Diff.invert_patch(delta);
    js> uneval(Diff.patch("head cold", delta).join(""))
    "hello, world"
        

In some revision-control systems, e.g. darcs, inverting a patch is a central operation. Darcs in particular has a full (and very useful!) "theory of patches", where patch inversion, commutation and merging are developed formally.

Variations

  • Diff.diff_comm - works like a simple Unix comm(1)
  • Diff.diff_patch - works like a simple Unix diff(1)
  • Diff.invert_patch - inverts a patch produced by diff_patch
  • Diff.patch - works like a (very) simple Unix patch(1)
  • Diff.diff_indices - like diff_patch, but only gives offset and length information

Diff Demo

Diff3 and Three-Way Merging

  • History-sensitive merge
  • Some changes are conflicts; others are not
  • Automatically smart about which is which, in a fairly understandable and predictable way
  • Sometimes blows up badly - for details, see
  • Better merge algorithms than diff3 exist (revctrl.org has the details)

Diff3 and Three-Way Merging

"this" "base" "other" Result Notes
A A A A no changes
A A B B "other" wins
B A A B "this" wins
B A B B or conflict accidental clean merge
B A C conflict "true" conflict

Diff3 and Three-Way Merging

var base =
  "the quick brown fox jumped over a dog".split(/\s+/);

var derived1 =
  "the quick fox jumps over some lazy dog".split(/\s+/);
var derived2 =
  "the quick brown fox jumps over some record dog".split(/\s+/);

var mergeResult = Diff.diff3_merge(derived1, base, derived2, true);

/* [{ok:["the", "quick", "fox", "jumps", "over"]},
    {conflict:{a:["some", "lazy"],   aIndex:5,
               o:["a"],              oIndex:6,
               b:["some", "record"], bIndex:6}},
    {ok:["dog"]}] */

Diff3 Demo

Least Common Ancestor

Tree with LCA marked

LCA is defined for trees. Efficient algorithms are known to exist. It has also been defined for DAGs, which is the case we have in a DVCS, but the definition leads to some problems in our case.

DVCS Components

  • History
    • ancestry DAG
  • Merging
    • choosing an ancestor
    • conflict detection and handling
  • Synchronisation
    • copying history between repositories
    • network transfer, file system, ...
  • Storage
    • of file versions
    • of changesets/patches

History

History is a DAG of changesets.

Each changeset should record

  • Its own unique identity (a UUID)
  • Which specific versions of files are alive, with their paths and metadata
  • Which files are dead (deleted)
  • IDs of parent changeset(s)
  • Timestamp, comment, committer, ...

Many modern DVCSs use some function of the contents of an object to identify the object, e.g. a SHA-1 hash. This has a lot of nice properties, and is a good choice. JavaScript doesn't have particularly good support for binary data, which makes hashing (and canonical binary representations!) awkward, so I chose to use simple random UUIDs for identifiers, instead.

History

History DAG

Merging

  • Resolution step
  • Merge step
  • Deciding what makes a good merge algorithm is an actively researched area
  • DieDieDieMerge

Synchronisation

  • What do you know that I do not? ("pull")
  • What do I know that you do not? ("push")
  • Updating a working-copy is a separate operation (a merge)
  • Can synchronise using any transport or representation that can
    • inform others about the revision IDs it holds
    • export revisions and their contents in a standard form
    • import revisions submitted to them in standard form

Storage

The database used to store all information about current and past state in the repository, in every branch, for every commit.

  • Design storage around query patterns: user interface is central
  • Design for efficient synchronisation with other repositories
  • Finally, efficient use of disk space can be important, too (less and less as time goes by)

Storage

  • Narrow API - interface to merging, synchronisation, history is fairly simple
  • Must be robust; should enjoy ACID properties
  • Can use compression (e.g. gzip) under the covers
  • Can store full snapshots, or deltas from newer to older versions, or both, as required
    • tradeoff space/speed/convenience
    • Mercurial does well here: deltas, with periodic snapshots when deltas grow too large, giving O(1) arbitrary version retrieval

Storage

Example user-interface-led design criteria:

  • retrieving an old version of a single file should be fast
  • retrieving snapshots of old versions of the whole repository should be fast
  • retrieving history for a single file or the whole repository should be fast
  • avoid dragging in too many unrelated records when performing single-file operations (think Wikipedia-scale system: how to avoid querying the whole repo when viewing a single page?)

DVCS Demo

The End

 

 

Any questions?

Ambiguous LCA

 

Difficult LCA example

Here's a problem case. The LCA of "e" and "i" is either "c" or "h". Both "c" and "h" are two steps away from the root.

Note that the path from "e" to "i" through "c" is three steps long, while the path through "h" is two steps long. This could mean that "h" is a more suitable ancestor for use in 3-way merging.

The algorithm I've implemented is very naive and inefficient. It also answers "c" or "h" depending on the order of arguments you give it.

Criss-cross Merge

Criss-cross Merge Diagram

Truncating History

Can truncate, missing piece may be held by other servers, or can fall back to two-way merge