|
|
|
|
|
|
|
What is Multi-Master Replication?
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Automated Conflict Resolution
|
|
|
Assisted Conflict Resolution
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| This document provides a cursory overview the underlying technology making up a multi-mastered data repository. The techniques listed are specifically tuned toward XML data, although extensions to object, relational, and other data types are entirely possible. This multi-mastered data repository provides a foundation for data sharing between all components of the Ubiquity system. |
|
|
|
|
What is Multi-Master Replication?
|
|
|
|
| In essence, a system can claim "multi-master replication" as a feature if it satisfies two criteria: |
|
>
|
Replicated Data: Multiple, physically distinct copies of the data exist. |
|
>
|
Simultaneous Changes: Each copy of the data can be independently modified by a different "master". |
|
| At the price of temporary global inconsistency, a system equipped with "multi-master" capability can throw off the burdens of global locks and realize massive scalability benefits. However, despite the tremendous benefits of such a system, the complexity and difficulty of multi-mastered replication should not be understated. First, the benefits: |
|
>
|
No Global Locks: By far the biggest benefit is the elimination of "global locks", meaning anybody can make a change at any time without notifying anyone else. Seeing as how Internet-wide global locks require Internet access, eliminating global locks effectively eliminates dependency upon Internet access. Not only does this allow data to be modified when the Internet is not available, it allows servers to intentionally choose to delay synchronization until a convenient time. This intentional delaying allows servers clusters to scale far beyond what is possible with global locks. |
|
>
|
Local Access Speeds: Because the most commonly accessed data is cached locally, both data reads and writes operate at "local speeds". This means that modifications can occur without touching the network, particularly the Internet, achieving access latencies orders of magnitudes quicker. |
|
>
|
Storage Redundancy: A critical component of high availability clusters is redundant storage. Not only does this avoid loss of access in the event of partial failure, it greatly increases the chance of data recovery after a total failure. |
|
>
|
Offline Modifications: In addition to the scalability benefits afforded by the elimination of global locks, data can be safely modified in multiple, simultaneous locations, completely without network availability. This allows applications to be deployed "beyond the Internet", into very constrained form factors with only irregular network connectivity. Likewise, it allows for applications to operate effectively in extremely harsh networking environments where connections come and go in an erratic and unpredictable fashion. |
|
| Needless to say, single-mastered (meaning, globally locked) data is a major bottleneck for scalability, resilience, and speed. As detailed, systems that overcome the limitations of single-mastered data enjoy tremendous benefits. However, these systems only do so by overcoming substantial challenges. These challenges are discussed below: |
|
>
|
Global Inconsistency: Although difficult to quantify into concrete side-effects, applications that operate upon multi-mastered data must undergo a serious paradigm shift from single-mastered applications. This paradigm shift is introduced by the natural result of there being multiple simultaneous versions: there is no single correct version. As such, applications must anticipate the possibility that the data they work on is out-of-date or is currently being modified by some other application. In practice this complexity is effectively handled through Ubiquity's version control features, but it is still a real complexity that must be dealt with. |
|
>
|
Conflict Resolution: A key component of data sharing is learning of and integrating changes made by other applications. However, seeing as how any number of applications can change the same data at the same time, it's possible that simultaneous changes interfere with each other. For example, one application could set a particular value to be 23, while another one sets it to be 32 at the same time. This type of situation is called a "conflict". The process of choosing one, or the other, or some entirely different value is called "conflict resolution". Ubiquity has extensive technology to assist and simplify the conflict resolution process. |
|
| In summary, there are big benefits and big detriments to working in a multi-mastered, replicated data environment. However, Ubiquity has extensive technology support to mitigate the detriments, while at the same time amplifying the benefits. The result is an extremely powerful system for global data sharing. |
|
|
|
| Critical to Ubiquity's multi-master replication technology is a powerful, general-purpose version control system for XML documents. To quickly paraphrase the Data Structure document, data is stored in individual documents. Collections of uniquely named documents are called corpuses. There is a type of corpus, called a corpus class, that stores documents with complete version and concurrency control. |
|
| To best understand how this version controlled corpus works, it's illustrative to consider its very first use: |
|
1.
|
Application A contacts the corpus and requests a "workspace". Because A is the first application to use the corpus, this workspace is the very first workspace ever created by the corpus. |
|
2.
|
All corpus modifications, including modifications to the version control corpus, must occur within strictly defined transactional boundaries. As such, application A begins a transaction. |
|
3.
|
With the transaction begun, A can read, write, and operate upon the workspace freely. Of course, the workspace starts out empty, so all it can really do is create new documents and work with those. |
|
4.
|
Once A is done making its modifications, A "commits" the transaction. This indicates to the version control corpus that A is completely done with these modifications. |
|
5.
|
The corpus, at the end of this commit, now has its very first "version". Because this version was made by A, and it's the first version, let's call the version "A0". Once created, A0 is made static, and all future accesses must be read only. No application or server is allowed to modify a version once it's been committed. |
|
| At this point, the corpus has a single version, A0. This version is simply the final state of whatever documents were in the corpus at the time A committed the transaction. Now, let's consider the second time the corpus is used: |
|
1.
|
Application B contacts the corpus, and requests and receives a workspace. Even though the corpus has its first version, version A0, the workspace is empty. In this way, any application can start anew with a clean slate. |
|
2.
|
Application B doesn't want an empty workspace, however, so it asks the corpus what versions are available. The corpus informs the application that version A0 is available. B then "upgrades" its workspace to version A0. After the upgrade, B's workspace contains the exact same documents, with the exact same data, as A's workspace at the time A committed the transaction. |
|
3.
|
Before B can interact with the workspace, both for reading or writing, it must begin a transaction. B does this, modifies and removes some of the documents, while adding others. B then commits the transaction. |
|
4.
|
When B commits the transaction, the corpus creates its second version. Because the second version is merely a modification of the first version, let's call it B1. In this way it's clear that the version is composed of A0's original data, as well as B's changes. |
|
| Now there are two versions in the corpus: A0 and B1. Additionally, when B1 is created, A will be immediately notified: |
|
1.
|
At some point, A's corpus learns that B created version B1. Because A's corpus knows that A made A0, and B1 is a direct descendent of A0, it will notify A that a new version update is available. |
|
2.
|
When we last left it, A's workspace was at version A0. Upon learning of the presence of B1, A upgrades the workspace to the new version. This results in A's workspace being updated to the exact same state as B's workspace. |
|
| In this way applications A and B can safely share data through successive updates to the shared documents. |
|
|
|
| Because each application only learns of the other's update when the other completes its transaction, the data always remains in an internally consistent state. By "internally consistent" it is meant that that specific collection of documents represents a valid state of the shared data at some point in time. However, the data isn't always "globally consistent" as both applications might choose to begin transactions and modify data at the same time. In the event of simultaneous modification, each application sees a different version of the data. Each of these versions is internally consistent and valid, just different. Consider the following situation: |
|
1.
|
Application A, after updating its workspace to version B1, begins a transaction with the intent to determine what has changed and respond with more modifications. |
|
2.
|
However, at the same time, application B decides to also start a transaction. Because A hasn't yet completed its transaction, B's workspace is also at version B1. |
|
3.
|
After a time, A finishes its work and commits its transaction. As a result, A's corpus creates version A2 -- a direct modification of B1. |
|
4.
|
At around the same time, B finishes its work and commits version B2. |
|
5.
|
This results in two "next" version from version B1. Both A and B have made modifications to the same version at the same time. |
|
| Given two versions of the same data, it's natural to ask which one is "correct". Upon closer inspection, it appears that there are two answers: both, or neither. Regarding the first answer, both A2 and B2 contain valid data. Both A and B started with a valid version (B1), made valid modifications, and caused the creation of two more valid versions. For this reason, neither one or the other is "incorrect". However, at the same time it's obvious that neither versions paints a complete picture. A has made changes to A2 that do not appear in B2, and vice versa. As such, just as neither can be called "incorrect", neither can really be called "correct". When viewed graphically, the "path" of versions starting with A0 and B1 appears to "fork", resulting in new paths for both A2 and B2. This tendency to create "forked versions" is at the heart of the paradigm shift mentioned above, and is illustrated in Figure 1. |
|
| Eventually, both A and B learn that each of their changes were made at the exact same time as the other. Although both A and B have access to the other's version, neither can really make use of it. A can't upgrade to B's new version without sacrificing A's own changes, and vice versa. The two applications' workspaces have taken different paths. For one workspace to return to the same version as the other requires backtracking along its path, effectively giving up whatever changes sent it down the other path. Obviously, no combination of workspace upgrading and "downgrading" (reverting to a previous version) will result in the two applications returning to the same state without one repealing its changes. The solution to this problem is called "merging". Merging versions essentially requires creating a new version that both applications agree upon, as illustrated in Figure 2. |
|
Figure 2: Merged Version Fork
|
|
|
|
|
| Before getting into the complexities that arise when versions are merged, it's necessary to discuss how the version control technology works in the first place. In general, the technology is divided into roughly two tasks: (1) aggregating deltas and (2) applying deltas. However, before deltas can be discussed, it's necessary to define the document itself. |
|
|
|
|
Sample XML and Document Schema
|
|
|
|
| For the purposes of this example, assume that all documents in the version controlled corpus are stored in "validated" XML, meaning the XML form, or "schema", is well-defined. Between their hierarchical nature and completely specified schemas, validated XML documents are very easy to work with and make for a simplified example. To further simplify the example, assume that applications A and B have been sharing a single document. Let's define version A0 of this document to be as described in Figure 3. |
|
Figure 3: Version A0
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barret</NAME> <CITY>San Jose, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
| Without formally defining the schema of this document, let's just say that it consists of one ADDRESSBOOK node, which contains any number of ENTRY nodes in any order. Also, each ENTRY node consists of a NAME and CITY node, both of which contain an arbitrary alphanumeric string. |
|
|
|
| Assume that the client applications modify the XML documents through the standard Document Object Model (DOM) API. The sample document could be created with the following steps: |
|
1.
|
Create ADDRESSBOOK root node |
|
2.
|
Create ENTRY node under ADDRESSBOOK |
|
3.
|
Create NAME node under ENTRY |
|
4.
|
Create "David Barret" node under NAME |
|
5.
|
Create CITY node after NAME |
|
6.
|
Create "San Jose, CA" node after CITY |
|
| Anybody that executes these instructions in order will recreate the exact same file. This observation leads to the creation of "deltas". A delta is simply a pre-recorded set of instructions that, when executed, produce a document. The above steps could be recorded with the steps described in Figure 4. |
|
Figure 4: Delta A0'
|
<ADD> <ADDRESSBOOK> <ENTRY> <NAME>David Barret</NAME> <CITY>San Jose, CA</CITY> </ENTRY> </ADDRESSBOOK> </ADD> |
|
| Without going into great detail, the XML delta script in Figure 4 describes how to create the XML data in Figure 3. By monitoring what operations the application performs upon the DOM API, a similar delta file can be created for any type of changes. Because this file is created a step at a time, this is called delta "aggregating". |
|
|
|
| Assume that the data created by application A corresponded to the XML in Figure 3. When A's corpus created version A0, it also created delta A0'. A0' is the delta "script" defined in Figure 4, which contains instructions to upgrade an empty workplace to version A0. Furthermore, assume that application B, when it made its first update to version A0, simply changed the CITY from "San Jose, CA" to "San Francisco, CA". This would result in B1 being described by Figure 5, with B1' in Figure 6. |
|
Figure 5: Version B1
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barret</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
Figure 6: Delta B1' (From Version A0)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[0]"/> <DESCEND> <BEFORE PATH="CITY"/> <DESCEND> <REMOVE>San Jose, CA</REMOVE> <ADD>San Francisco, CA</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
| As shown in Figure 6, the only difference between A0 and B1 is the CITY. In A0 the CITY is set to "San Jose, CA", whereas in B1 the CITY is "San Francisco, CA". In this way, any arbitrary changes to XML data can be recorded in delta scripts. Specifically, a program "executing" the B1' delta script on a version A0 workspace would follow these steps: |
|
1.
|
Advance the cursor to before the ADDRESSBOOK node |
|
2.
|
Descend into the ADDRESSBOOK node |
|
3.
|
Advance the cursor to before the first ENTRY node |
|
4.
|
Descend into the first ENTRY node |
|
5.
|
Advance the cursor to before the CITY node |
|
6.
|
Descend into the CITY node |
|
7.
|
Remove the "San Jose, CA" string under CITY |
|
8.
|
Add a "San Francisco, CA" string under CITY |
|
| This process of executing the delta script to upgrade a workspace of one version to another version is called "applying" the delta to the workspace. Applying the B1' delta a version A0 workspace containing upgrades it to version B1. Likewise, applying a delta in reverse is called "repealing" the delta. In this way, the workspace can be downgraded from version B1 to version A0 by repealing B1'. |
|
|
|
| Now let's return to the version fork where both A2 and B2 versions derive from the same version (B1). Without merging, it's not possible for either application to converge with the other application's version without first repealing a delta, which results in data loss. The solution is to find some common version that incorporates changes from both A and B. This common version can be created through a process called "delta merging". |
|
| First, assume that delta A2', created by application A from version B1, simply changed the entry name from "David Barrett" to "David Barrett". At the same time, assume that delta B2', created by application B from version B1, added a whole new entry. The resulting deltas are described in Figure 7 and Figure 8, respectively. |
|
Figure 7: Delta A2' (From version B1)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[0]"/> <DESCEND> <BEFORE PATH="NAME"/> <DESCEND> <REMOVE>David Barret</REMOVE> <ADD>David Barrett</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
Figure 8: Delta B2' (From version B1)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <AFTER PATH="ENTRY[0]"/> <ADD> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Diego, CA</CITY> </ENTRY> </ADD> </DESCEND> |
|
| Deltas A2' and B2' result in versions A2 and B2, which are shown in Figure 9and Figure 10. |
|
Figure 9: Version A2
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barrett</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
Figure 10: Version B2
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barret</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Diego, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
| Note how both versions A2 and B2 are valid data files. They each have data that some application at some point thought was correct. An application given either version of the document would work fine. However, the version in Figure 9 is missing the second entry, and the version in Figure 10 has a misspelled name (delta A2' changed "Barret" to "Barrett"). As such, both are in some part correct, but neither are fully correct. With this in mind, all that must be done to find a common merged version is create a document that has both the correct name from Figure 9 and new entry from Figure 10. Somehow we need to create the resultant merged version AB3, as shown in Figure 11. |
|
Figure 11: Version AB3
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barrett</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Diego, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
| This brings up the question: why "AB3"? First, it should be noted that this version nomenclature is exclusively used for this simple example -- a real implementation would surely use some much more powerful version naming scheme. However, "AB3" is meant to imply that both A and B made the change leading to the new version. It's not pretty, but it works. Besides, it's just a name. |
|
| Now, the real question is: what are the deltas that lead from versions A2 and B2 to AB3, the merged version? Interestingly enough, each previous delta can simply be applied directly to the other version, as there is no conflict. The resulting deltas are displayed in Figure 12 and Figure 13. |
|
Figure 12: Delta AB3' (From version A2)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <AFTER PATH="ENTRY[0]"/> <ADD> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Diego, CA</CITY> </ENTRY> </ADD> </DESCEND> |
|
Figure 13: Delta BA3' (From version B2)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[0]"/> <DESCEND> <BEFORE PATH="NAME"/> <DESCEND> <REMOVE>David Barret</REMOVE> <ADD>David Barrett</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
| By examining the delta script in Figure 12 and Figure 13 alongside the versions in Figure 9 (A2) and Figure 10 (B2), it's apparent that the scripts would work just fine as written. For example, the first delta (delta AB3') descends the ADDRESSBOOK node, skips past the first entry, and then adds a second entry. This works just as well on version A2 just as it does on the version it was designed to modify, B1. Likewise, the second delta (BA3') descends the ADDRESSBOOK node, then the first ENTRY, then the NAME, and changes "Barret" to "Barrett". Again, this works without change on B2, even though it has one more entry than the delta was originally conceived for (version B1). The end result of this a single new "merged" version that succeeds both forked versions, version AB3. The version map with the new deltas and versions is illustrated in Figure 14. |
|
Figure 14: Merged Versions
|
|
|
| Additionally, a new delta can be created that leads directly from version B1 to the new merged version, AB3. This new delta is named ?3', and applying it to version B1 brings the workspace to AB3. The question mark is meant to imply that it's a merged delta: although it works just fine, it wasn't actually created by either application A or B. The ?3' merged-delta is displayed in Figure 15. |
|
Figure 15: Delta ?3' (From version B1)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[0]"/> <DESCEND> <BEFORE PATH="NAME"/> <DESCEND> <REMOVE>David Barret</REMOVE> <ADD>David Barrett</ADD> </DESCEND> </DESCEND> <AFTER PATH="ENTRY[0]"/> <ADD> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Diego, CA</CITY> </ENTRY> </ADD> </DESCEND> |
|
|
|
| The simultaneous changes made by application A and B are said to not "conflict" because those same changes can be applied to the results of the other changes (1) successfully, and (2) correctly. Each of these criteria can be taken in turn. |
|
| First, delta A2' can be successfully applied to B2, meaning the delta from one fork can be applied to the result of another fork, without logically breaking. The instructions describe valid operations that can be actually executed and don't require changing the exact same node, modifying a deleted node, or any other conflicting behavior. For this reason, it's said there are no "structural conflicts", as the structure of the version supports the proper execution of the delta script. |
|
| Second, and this one is much more subtle, delta A2' can be correctly applied to B2, meaning the resultant version contains valid data. Now, what is or is not valid data is very complex and dependent upon many human factors. For example, if for some arbitrary reason the "John Wolthuis" entry should only be added if "David Barrett" is in "San Jose, CA", the merging would violate this constraint and suffer from a "validity conflict". Some types of extremely complex relationship might be impossible to detect automatically. However, most intra-data relationships (meaning data dependencies within a single document) are not so complex. Using the document schema, it's generally possible to describe most real-world data constraints (such as the types of elements allowed, in what order, what ranges of values, etc). Accordingly, as long as all constraints are accurately captured by the schema, any document that conforms to the schema's constraints is guaranteed to contain valid data. Ideally, through the use of robust schemas, validity constraints become identical to structural constraints; no document can be structurally sound without containing valid data. |
|
|
|
| The ideal situation is to create applications in such a fashion that conflicts rarely occur. This strategy, which is surprisingly successful, is called "conflict avoidance". The key to avoiding conflicts is also the key to good database design: normalized data. Without detailing the various normalized forms, normalized data is essentially orthogonal (independent). In this way, changes to one datum have no repercussions or dependencies upon other data. For example, consider the names on a guest list. The individual entries in a list are completely independent; an entry can be added, removed, or modified completely independent of any other entry. As such, the guest list is said to be normalized. Of course, normalized data is only valuable to an automated system if it knows the data is in fact normalized. Given an arbitrary list of names it's not possible to tell whether the list is normalized (and the names are independent), or if the names are arranged in some complex pattern. As such, this is where the schema becomes valuable. |
|
| A document schema describes exactly what can and cannot occur within the document, governing the arrangement and values of all the document's data. The schema for the list might merely say "any number of NAMEs in any order, where each NAME contains a single alphanumeric string". Given this schema, it's apparent that any name can be removed or changed, or any new name can be added, without "breaking" the list's structure. Alternatively, the schema might be "any even number of NAMEs in alphabetical order, where each NAME contains a single alphanumeric string beginning with 'Mr.' or 'Mrs.'". Obviously, this second schema has more complex constraints on the actions that can be performed on the list. |
|
| As such, the more simple and normal a given schema is, the less likely conflicts will occur. |
|
|
|
| Before it was said that delta A2' could be safely applied to version B2, although it wasn't explained how this can be automatically determined. To justify this statement, consider the ADDRESSBOOK schema, as defined before: |
|
| [The ADDRESSBOOK schema] consists of one ADDRESSBOOK node, which contains any number of ENTRY nodes in any order. Also, each ENTRY node consists of a NAME and CITY node, both of which contain an arbitrary alphanumeric string. |
|
| With this in mind, let's return to the example at hand. Delta A2', like B2' is known to modify version B1. Also, version B2 is known to be just version B1 after delta B2' has been applied. Given this knowledge, delta A2' can be safely applied to version B2 if A2' doesn't conflict with delta B2'. For this reason, A2' and B2' must be compared. If they do not conflict, they can each be applied to the others' results, thereby merging the version fork. This conflict comparison is guided by ADDDRESSBOOK schema. Specifically, the individual instructions within the two delta scripts, called "left" (A2') and "right" (B2') for convenience, are iterated in parallel to perform a line-by-line search for conflicts: |
|
1.
|
The first two lines of both deltas are . This merely indicates that the parser should move its internal cursor to the same starting location within the ADDRESSBOOK for both deltas. It doesn't actually make any changes, so it cannot create a conflict. |
|
2.
|
The next line has on the left and on the right. The left indicates that the cursor should move forward to just before the ENTRY, while the right indicates that the cursor should move just after the ENTRY. Again, because these just move the internal cursor and make no changes, this line produces no conflicts. |
|
3.
|
Because the right cursor has advanced beyond the left cursor, anything the left does cannot possibly conflict with the right (aside from addressing, which is discussed later). As such, the left's changes to ENTRY[0]'s name can be safely performed. After the modification, the cursor is advanced forward to meet up with the right's cursor, which is after the ENTRY[0] node. |
|
4.
|
Incidentally, because the left has no more instructions, nothing that the right does can possibly conflict with the left. As such, the right's cursor is advanced to the end of the script. With both scripts completed, and no conflicts found, the comparison process is complete. |
|
| Basically, by scanning the two deltas in parallel, it can be positively determined that the two deltas do not conflict. Once determined, the deltas can each be applied to the opposite fork's workspace to bring both to the same version. In this way, the two forked versions can be merged into a new, single version as illustrated in Figure 14. The actual technology that performs this parallel comparison is called the "delta merger". From a functional standpoint, the delta merger has four inputs and three outputs. First, the inputs: |
|
>
|
Common Ancestor Version: Merging forked versions requires knowledge of the shared "ancestor" version -- essentially the version both deltas have "in common". For this example, the common ancestor version is B1. |
|
>
|
Document Schema: Also, the merger must know the structure, or schema of the document. |
|
>
|
Left and Right Pre-Deltas: Finally, the merger must actually have the two deltas to merge. These deltas are the actual deltas that, when applied to the common ancestor, produce the two forked versions. Because these deltas lead to the forked state, they are called "pre-deltas". For this example, the left and right pre-deltas are A2' and B2', respectively. |
|
>
|
Left and Right Post-Deltas: The whole purpose of the delta merger is to come up with deltas that, when applied to the two forked versions, lead to the same merged version. Because these deltas lead to the merged version, they are called "post-deltas". For this example, the left and right post-deltas are AB3' and BA3', respectively. |
|
>
|
Merged-Delta: Also, for future convenience, a single delta that leads directly from the common ancestor version to the new, merged version can be created. For this example, the merged-delta is called ?3'. |
|
| These deltas and versions are illustrated in Figure 16. |
|
Figure 16: Pre-, Post-, and Merged-Deltas
|
|
|
|
|
|
Automated Conflict Resolution
|
|
|
|
| However, conflicts can and do happen. Yet even when they do, they can usually be resolved automatically and with no application assistance. In the previous delta merging example, the post-deltas output by the delta merger were simply swaps of the inputs; the left post-delta was identical to the right pre-delta and vice versa. However, this is rarely the case. Indeed, while the post-deltas will surely bear some resemblance to the pre-deltas, the delta merger is free to make any type of modifications. The only requirement is that in the end the post-deltas take the forked versions to the same merged version, regardless what that version is. Thus, a more intelligent delta merger would be made aware of common conflicts, as well as supplied with the ability to automatically resolve them. For example, consider another fork after version AB3, as illustrated in Figure 17 and Figure 18. |
|
Figure 17: Delta A4' (From version AB3)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[1]"/> <DESCEND> <BEFORE PATH="CITY"/> <DESCEND> <REMOVE>San Diego, CA</REMOVE> <ADD>San Francisco, CA</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
Figure 18: Delta B4' (From version AB3)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <AFTER PATH="ENTRY[0]"/> <ADD> <ENTRY> <NAME>Erin Smith</NAME> <CITY>Dallas, TX</CITY> </ENTRY> </ADD> </DESCEND> |
|
| Just like before, application A modifies an existing ENTRY at the same time application B creates a new ENTRY. However, unlike before B adds the ENTRY to the list before the ENTRY A is modifying. This minor change actually causes a major conflict: |
|
1.
|
The first two lines of both deltas are . This merely indicates that the parser should move its internal cursor to the same starting location within the ADDRESSBOOK for both deltas. It doesn't actually make any changes, so it cannot create a conflict. |
|
2.
|
The next line has on the left and on the right. The left indicates that the cursor should move to just after the second ENTRY, while the right indicates that a it should move just to just before the first ENTRY. Again, because these just move the internal cursor and make no changes, this line produces no conflicts. |
|
3.
|
The left delta has advanced beyond the left cursor, implying that the right should be free to make modifications without conflict. Often this is the case, as the right could add, remove, and modify nodes with impunity. However, because it adds a node at the same level that the left cursor is currently traversing, the indexing of the left is affected by changes by the right. Thus, a conflict is created. |
|
| Consider the left delta: the line indicates that the cursor should move to just before the second ENTRY. However, before that instruction can be executed, the right adds a new node before the second ENTRY. This causes the indexing of those nodes to change. What was previously ENTRY[1] becomes ENTRY[2]. This type of conflict is called an "indexing conflict", as changes made by one delta affect the indices used by another. Luckily, this type of conflict can be easily resolved by simply rewriting all references to ENTRY[1] by the left delta to be ENTRY[2]. By training the delta merger to recognize and accommodate these common conflicts, many deltas can be merged entirely automatically. Some additional conflicts and their automated resolutions are listed below: |
|
>
|
Ordering Conflict: If two deltas attempt to add a node in the same place, for example, and the schema requires that nodes be sorted in a particular order, a naive merger may inadvertently add them in the wrong order. An intelligent merger would be aware of the ordering requirements, and apply the additions in the proper order. |
|
>
|
Distinctness Conflict: If two deltas attempt to add the exact same node, for example, and the schema requires that each node by completely unique, a naive merger might violate the schema. An intelligent merger can instead recognize that the nodes are identical, and choose to only add one, or raise a conflict. |
|
>
|
Remove-Remove Conflict: Nodes are generally implied to represent finite quantities, and two deltas attempting to remove the exact same node at the same time would generate a "remove-remove conflict". However, an intelligent merger might instead allow nodes to be labeled "intangible", such that redundant removes cause no conflict. |
|
>
|
Modify-Remove Conflict: Information loss is assumed to always a problem, and one delta modifying a node that is simultaneously removed should by default cause a conflict. However, applications could pre-authorize a merger to disregard information for particular nodes in this situation, avoiding a conflict. |
|
>
|
Modify-Modify Conflict: In the event two applications make modifications to the exact same data at the exact same time, a "modify-modify conflict" should by default be generated. However, applications could pre-authorize a merger to automatically choose one over the other or somehow combine values, thereby avoiding a conflict. |
|
| These are just a few possible conflicts and automated resolutions. Additionally, these only assume very basic modification functionality. Assuming the delta script were made more robust, such as including MOVE and COPY instructions, more conflicts would surely arise, each with their own possible automated solutions. |
|
| Regardless, assuming the delta merger were aware of the "indexing conflict", it would merge the A4' and B4' pre-deltas to create the post- and merged-deltas illustrated in Figure 19, Figure 20, and Figure 21. |
|
Figure 19: Delta AB5' (From version A4)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <AFTER PATH="ENTRY[1]"/> <ADD> <ENTRY> <NAME>Erin Smith</NAME> <CITY>Dallas, TX</CITY> </ENTRY> </ADD> </DESCEND> |
|
Figure 20: Delta BA5' (From version B4)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <BEFORE PATH="ENTRY[2]"/> <DESCEND> <BEFORE PATH="CITY"/> <DESCEND> <REMOVE>San Diego, CA</REMOVE> <ADD>San Francisco, CA</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
Figure 21: Delta ?5' (From version AB3)
|
<BEFORE PATH="ADDRESSBOOK"/> <DESCEND> <AFTER PATH="ENTRY[1]"/> <ADD> <ENTRY> <NAME>Erin Smith</NAME> <CITY>Dallas, TX</CITY> </ENTRY> </ADD> <BEFORE PATH="ENTRY[2]"/> <DESCEND> <BEFORE PATH="CITY"/> <DESCEND> <REMOVE>San Diego, CA</REMOVE> <ADD>San Francisco, CA</ADD> </DESCEND> </DESCEND> </DESCEND> |
|
Figure 22: Version AB5
|
<ADDRESSBOOK> <ENTRY> <NAME>David Barrett</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> <ENTRY> <NAME>Erin Smith</NAME> <CITY>Dallas, TX</CITY> </ENTRY> <ENTRY> <NAME>John Wolthuis</NAME> <CITY>San Francisco, CA</CITY> </ENTRY> </ADDRESSBOOK> |
|
Figure 23: Final Version Map
|
|
|
|
|
|
Assisted Conflict Resolution
|
|
|
|
| Sadly, not all conflicts can be automatically resolved. When validity constraints cannot be record in the schema, particularly when data is intentionally denormalized, it becomes increasingly difficult to find automatic solutions. In these situations the application itself must play a role in resolving the conflict. |
|
| For example, consider what would happen if applications A and B modify the CITY associated with "Erin Smith" to different values at the same time. The delta merger might have no way to automatically resolve the conflict. However, this scenario can be predicted at "compile time", when the applications are created. Because the schema is known, and all the automated responses are known, and all potential conflicts are known, every potential conflict that has no automated resolution is also known. The application designer can then "plug-in" a custom code fragment to execute when that conflict occurs. Indeed, the application could refuse to execute until either a predefined automated resolution or a custom, "assisted" resolution exists for every potential conflict. |
|
| This custom code fragment would be supplied with all the information available to the delta merger such that it can make its informed decision. This information includes: |
|
>
|
The previous value, from the common ancestor, |
|
>
|
The "proposed" values, from the left and right pre-deltas, |
|
>
|
Complete access to the common ancestor data, as well as the left and right pre-deltas, |
|
>
|
Complete access to the current left and right post-deltas, as well as the merged-delta, |
|
>
|
And complete access to all other documents at the same version of the common ancestor or left and right deltas. |
|
| Basically, all data that could possibly be of use to the decision making process is supplied to the assisted conflict resolution code fragment. And, if that's not enough, it is always free to open a dialog box and ask the user directly. In essence, if the code fragment cannot come to a decision regarding how to resolve the conflict, then nothing can. |
|
|
|
|