Java Database, Versioned

Introduction

This article explores the implementation of a mechnism for versioning records in a database. The database used is Berkely DB JE and the implementation is available in rez.

History

This project has a very long slow-burning history stretching back over a decade. Originally started int C and Berkely DB, it has undergone multiple test implementations which tried and discarded many ideas. The intended target is and always was a personal micro-server for serving documents, diaries, and software on the World Wide Web.

The result-so-far should be a library which is quite "sane", very "small", yet mostly "complete".

Several ideas which were tried and rejected and include capabilities such as "cheap copies" as seen in subversion. These are conceptually difficult to understand and add unecessary complication to the database design. This and some other ideas were thrown in the bin where they belong.

Features

The design starting point is the ability to support multiple revisions of named objects, and to support branching.

Starting from and including these minimalist core requirements the following features are desirable.

Objects can be renamed (moved);
Objects have other properties or metadata;
Efficient storage;
Any version can be retrieved;
Branching for one or more objects;
Symbolic names for branches (tags);
Atomic updates;
Transaction protection.

The design presented is capable of all these goals.

Berkely DB JE DPL

The libarary is a layer atop the Berkeley DB JE DPL interface. This interface uses annotated plain old Java objects (pojos) to describe RDBMS-like tables.

It supports many of the typical RDBMS features including primary and secondary keys, sequences, foreign key constraints, ordered access and transactional integrity.

All of these features are utilised in this library.

Revision Tree

The following textual diagram illustrates the requirement and will be referred to in the following sections.

It shows a 'typical' scenario of a set of documents being authored and updated over time.

        object 1                         object 2
        trunk     branch1  branch2       trunk      branch1
 2      create                           
 3      update
 4                update
 5      update                           create        
 6      update
 7      update                           update
 8                update                            update
 9                update
10                         update

The vertical axis represents time divided into indiviual revision steps. Without losing generality it has intentially been formatted so that updates to multiple objects are only shown at the same step if they are written to the same branch.

The goal is to be able to support storage and retrieval of multiple objects across multiple revisions and branches. Ideally to be space and time efficient.

How?

Revision Record

All that is required to encode this is a simple tree node which has an identifier and a parent link. The identifier is the revisionid and the parent link is the branch id.

The following shows how the data can be encoded this way. A revsion id of 1 is reserved for the trunk branch.

    revisionid    branchid
         2          1
         3          1
         4          3
         5          1
         6          1
         7          1
         8          7
         9          7
        10          9

This covers adding to the table but the question is how is branch membership determined?

A revision is on a given branch if it has a sub-branch in common with the test branch. This can be further simplified to simply testing whether the part of the branch a given revision belongs to is in common with the test branch. If an obvious constraint is added that all revision numbers are monotonically increasing then this is short series of trivial range tests.

For example, to test if any revision is on branch 9, the following history table is required. It is simply the branch to trunk path excluding the trunk since that can be inferred by design.

  history = [9,7]

Branch membership can be tested using the following pseudocode. It requires the two values defined for each revision record: the revisionid and the branchid.

  boolean isOnBranch(long revisionid, long branchid) {
    long tail = MAX_VALUE;
    for test in history ; do
      if branchid == test AND revisionid < tail then
        return true
      fi
      tail = test
    end
    return branchid == TRUNK AND revisionid < tail
  }

Note that this is only a function of how many branches there are from a given leaf to the trunk - irregardless of how many total revisions or branches are stored. Even in the most complex of document trees this rarely exceeds the number of fingers of one (even animated) hand.

This is the core around which the entire library is built. The rest follows not-so-trivially from applying some basic and sane use-cases and making them work with this trivial data structure.

Data Model

To support the desired goals the data is split into several independent trees which share the same global version tree.

Revision

The revision record simply records the revisionID and branchID as defined previously.

@Entity
public class Revision {
    
    @PrimaryKey(sequence = "RevisionID")
    public long id;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long branchID = Revision.TRUNKID;

    public long ctime;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Author.class)
    public long authorID;    
}

DPL enforces the referential integrity through key constraints and the only other requirement is that id be assigned sequentially.

The other fields provide an audit trail as explained in a following section.

Tag

A tag simply assigns a symbolic name to a specific revision.

@Entity(version = 1)
public class Tag {

    @PrimaryKey
    public String name;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long revisionID;

    public long ctime;
    
    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Author.class)
    public long authorID;    
}

Again, DPL provides trivial enforcement of uniqueness and integrity constraints.

Tags also have an Author and creation time.

Versioned Data Blob

As with other similar systems data versioning is accomplished by storing a chronologically ordered history of revisions in the database.

Ordering allows the changes to be stored as deltas for space efficiency and in the case of rez, reverse deltas are used. DEZ1 is used as the delta encoding mechanism. Reverse deltas are particularly efficient if modifications consist largely of additions as the reverse is simply a delete.

@Entity
public class Data {

    @PrimaryKey(sequence = "DataID")
    public long id;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Data.class)
    public long rootID;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long revisionID;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long branchID;

    public byte[] data;
}

Ordering is enforced by using sequential primary key identifiers and having a secondary key constraint against this primary key. This rootID value always refers to the first revision.

This independent blob history is tied to the revision history via the revisionID field. The corresponding branchID field for a given revision is copied into the Data record for simplified retrieval; it doesn't need to load any revision records in order to perform branch membership tests.

The full record contains some information about the data stored and allows for compression, external storage, and delta algorithm selection. It doesn't contain anything regarding creating time or author as that is supplied implicitly by the Revision record.

Artifact

Whilst the previous description of the data storage is complete and sufficient to store versioned objects it does not include a symbolic name.

Artifact records attach path-name and possibly other meta-data to sets of data objects. As with data records they are ordered chronologically and attached to the revision tree which allows for renaming and deletion.

@Entity(version = 1)
public class Artifact {

    @PrimaryKey(sequence = "ArtifactID")
    public long id;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Data.class)
    public long rootID;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long revisionID;

    @SecondaryKey(relate = MANY_TO_ONE, relatedEntity = Revision.class)
    public long branchID;

    @SecondaryKey(relate = MANY_TO_ONE)
    public String name;

    @SecondaryKey(relate = MANY_TO_MANY)
    public Set<String> keywords = new HashSet<>();

    public Map<String, String> properties = new HashMap<>();
}

Again the DPL makes the implementation almost trivial via secondary key constraints. As with a data record both the revisionID and branchID are included for efficient retrieval.

The rez (0.1) library currently includes additional meta-data information such as stringified properties and string flags in the Artifact record but due to limitations this may be changed. This prevents the records being write-once for example.

Author

Even for personal or embedded use an audit trail is useful.

This is provided by an Author record which is attached to most updates either indirectly by the Revision the update occured in or with explicit inclusion.

@Entity
public class Author {

    @PrimaryKey(sequence = "AuthorID")
    public long id;

    public long ctime;
    public long mtime;

    @SecondaryKey(relate = ONE_TO_ONE)
    public String name;

    public byte[] pass;
}

Currently Author records are not versioned themselves but they could be using a mechanism similar to the Data records via a rootID.

Implementing common "use" cases

This section will cover basic implementation of the scenarios which demonstrate that this design meets the stated goals.

In most cases following "on a given branch" is an implied requirement.

Checkpoint - add new

A checkpoint operation first begins by creating a revision record. Once created any number of modifications can be included. This operation is bounded by a transaction to ensure referential integrity is retained.

A Data object is created with both the id and rootID is set to a new value from the DataID sequence. The other fields are initialised to match the authoring revision and the desired data value. As there is only one revision of the data it is stored in raw format.

An artifact is then created. It's rootID is set to match the Data.rootID and the other fields initialised to appropriate values.

Once both written the operation is complete.

Issues

Additional checks are required to ensure this name is unique on the branch; which introduces a race condition which must be handled separately.

Checkpoint - update data

The most common case for checkpointing data is that the data changes and the meta-data does not.

The previously newest record is read and it's value is replaced with a delta encoding the changes from the new value to it's previous value. A new record is created and it's rootID set to match, and the other fields initialised appropriately. It holds a raw copy of the latest revision.

Once both are written the operation is complete.

Issues

Again, ordering must be guaranteed outside of the database itself.

Checkpoint - update metadata

This is a trival storage of a new artifact record with the updated values and a matching rootID.

Obviously this can be combined with a data update in the same transaction.

Currently in rez (0.0) deletion is implemented by writing a new metadata record with a deleted flag set but this may change, for example to write a record with a null name.

Locate an artifact by name

This is implemented in two phases. The first is finding the newest artifact with the given name on the target branch. The second is verifying that the given name is the current name of target object; it may have been renamed since for example.

The first is achieved by first looking up the name in the secondary index artifact-by-name. This will return an ordered list of artifact ids. This is iterated in reverse order until one matches the desired branch.

The second phase is performed in a similar manner but the artifact-by-rootID secondary index is scanned relative to the rootID of the found artifact instead.

If these both point to the same artifact then this is the result, otherwise that name is not present on this branch (at the latest revision).

Iterate all artifact history

As artifacts refer to data objects the rootID is used as the artifact key. The full history if the artifact is contained as the subindex of the artifact-by-rootID secondary index. It is already ordered chronologically.

This list can be filtered by the branch if so desired.

Iterate all data history

The data records for a given blob are all contained in the data-by-rootID subindex of the data.rootID secondary index.

This is simply iterated in reverse order, with each record filtered by branch if desired.

Retrieve value for a branch

Because records are stored using delta compression in chronological order, all records until the desired one must be visited in reverse order.

Again the data-by-rootID subindex provides the necessary ordering and the records themselves provide sufficient information to calculate branch membership.

The index is iterated in reverse order, applying reverse deltas as appropriate, until the target branch is matched.

Retrieve value for a snapshot

Again the data-by-rootID is scanned and all records must be visited in order to reconstruct the value.

As the revision numbers are global they will typically be 'sparse' with respect to an individual blob so the specific snapshot revision is unlikely to match exactly. As long as the blob revision is on the branch then the youngest blob revision which is older than the snapshot revision is the correct one to use.

Iterate all artifacts on a branch

This uses the artifact-by-rootID index directly. This index will contain duplicated keys as all artifacts will be grouped by the rootID.

By first iterating over the duplicate keys in reverse order it can find the newest artifact that matches the branch. As soon as one is found, or if the duplicates are exhausted, it then moves to the next blob by stepping to the previous non-duplicate key.

An obvious optimisation may be to initially read the oldest artifact record and confirm it is on the target branch - if not then the rest of the key duplicates can be skipped. As renames are not common this is likely not worth the added complexity.

Iterate all artifacts in a snapshot

This is similar to the previous case with the added proviso that the artifact revisionID must be less than or equal to the snapshot revisionID.

Iterate all artifacts in a path-tree

Artifacts are stored in a flat namespace. But as the names are sorted automatically by the btree index, a filesystem-like heirarchical view of the namespace can be achieved by performing indexed sub-range searches rather than full table scans.

As with other artifact resolution each matching result must be cross-checked to verify it is the latest active name on a branch.

Streams & DPL

Whilst previous iterations used lists, callbacks, and so on, rez uses Java8 Streams for returning any iterated items. Why? Mostly because it's the new thing I think. Potentially it does allow for streamed processing which reduces peak memory requirements.

These have been implemented using a small number of special case spliterators. They make use of the Stream.onClose() callback in order to close the DPL EntityCursor, so the streams must be explicitly closed by the caller. It is still unclear how to handle declared exceptions cleanly as streams seem to preclude their use at every level.

CursorPrevSpliterator

This simply calls EntityCursor.prev and passes any result onwards.

This is used to iterate through an index in reverse order.

The standard stream filter() and findFirst() can be used to further fine-tune the results.

CursorPrevStopSpliterator

After calling EntityCursor.prev and passing on a result it invokes a termination test which finishes the stream.

It may be that findFirst() can be used to provide a similar functionality in practice but it doesn't return the same results. findFirst() only returns a single value whereas this spliterator provides the whole stream as output.

This is used to produce a truncated stream based on a stopping condition. For example to calculate the value of a blob at a particular revision it's history from that point until now must be processed.

CursorScanSpliterator

This runs an inclusion predicate and if true passes the result onwards. If it fails it skips over any duplicate records and then continues testing.

This special case iterator is required because the inclusion test affects iteration order and cannot be codified using a filter() function or findFirst() operation.

This is used to scan all live artifacts on a given branch, skipping duplicate keys once one has been found.

Contact

notzed on various mail servers, primarily gmail.com.

Java Database, Versioned

Introduction

History

Features

Berkely DB JE DPL

Revision Tree

Revision Record

Data Model

Revision

Tag

Versioned Data Blob

Artifact

Author

Implementing common "use" cases

Checkpoint - add new

Issues

Checkpoint - update data

Issues

Checkpoint - update metadata

Locate an artifact by name

Iterate all artifact history

Iterate all data history

Retrieve value for a branch

Retrieve value for a snapshot

Iterate all artifacts on a branch

Iterate all artifacts in a snapshot

Iterate all artifacts in a path-tree

Streams & DPL

Links

Contact