2-3 Tree

Files:  tree23.c, tree23.h


A B-tree maintains its balance by ensuring that the paths from the root to
leaves are all equal in length.  The 2-3 tree is one of several variations of
B-trees.    A node in a 2-3 tree has either two or three children, except for
the special cases where the tree is empty or contains only one node.  All data
items in the tree are stored as leaf items of the tree.  From the way the tree
is maintained, leaf items are all stored at the same level in the tree.  The
internal nodes of the tree are used for guiding the search when trying to
locate a data item in the tree.  For this purpose, each internal node of the
tree has two keys.  A node of a 2-3 tree is illustrated below:


                            -----------------------
                           |           |           |
                           | key_item1 | key_item2 |
                           |           |           |
                           |-----------------------|
                           |      |        |       |
                           | left | middle | right |
                           |      |        |       |
                            --/--------|--------\--
                             /         |         \
                           ...        ...        ...


Pointers to the left, middle, and right child nodes are maintained.  Note that
if the nodes children are leaves items of the tree, then these pointers will
point to data items, rather than nodes.  The keys of a node are determined as
follows:

   key1 - holds the value of the smallest (or left-most) data item in the
          subtree rooted at the middle child.
   key2 - holds the value of the smallest (or left-most) data item in the
          subtree rooted at the right child.

For the purpose of flexibility, the 2-3 tree implementation described here uses
pointers to the corresponding data item for each key:

   key_item1 - points to the minimum data item in the subtree rooted at the
               middle child.
   key_item2 - points to the minimum data item in the subtree rooted at the
               right child.

If a node in the 2-3 tree has three children, then all pointers will be used.
If a node in the 2-3 tree has two children, then the right child pointer, and
key_item2 will be NULL.  For the special case where there is only one data item
in the tree, the root node has all pointers NULL, except for the left child
pointer, which points to the data item.  For the case of an empty tree, the
root node still exists, but has all child pointers and key items set to NULL.

In the source files tree23.c and tree23.h, a union data type
tree23_link_t is used which allows child pointers to point to either a node,
or a data item.  In this way, p->left.node specifies to the left child node of
a node, p.  Similarly, p->left.item specifies the left child data item of a
node, p.  In order to know what kind of child pointers a node has, the
additional enumerated type field link_kind is used.  This uses the value
LEAF_LINK when child pointers specify data items, and the value INTERNAL_LINK
when the child pointer specify nodes.


The Find Operation
------------------

The find operation locates a data item stored in the 2-3 tree.  It requires a
pointer to a data item, key_item, and searches for a data item in the tree
which has the same key value.  The search begins from the root node.  At each
internal node in the search, we determine which child pointer needs to be
followed:

   - if key_item2 is not NULL and key(key_item) >= key(key_item2) then
        follow the right child pointer.
   - else if key(key_item) >= key(key_item1) then
        follow the middle child pointer.
   - else
        follow the left child pointer.

The search finishes once a leaf node is reached.  Special cases, where there
are only 0 or 1 items in the tree, are handled separately.  If a matching data
item is found, find returns a pointer to the item, otherwise find returns NULL.


The Find-Min Operation
----------------------

Find-min returns the minimum data item in the 2-3 tree.  It is possible to
implement find-min efficiently by maintaining a pointer to the minimum data
item in the tree.  This is the method used in the source file tree23.c.  If
a minimum pointer is not maintained, find-min can be implemented by searching
for the left-most data item in the tree.


The Insert Operation
--------------------

The insert operation inserts a new data item, u, into the 2-3 tree.  In order
to insert u into the tree, the correct insertion position must first be
located.  This is done using the same process as the find operation, eventually
arriving at some data item at leaf level in the tree.  Let p be the parent node
of this data item and let x be the left child of p, y the middle child, and z
the right child (if any).  One of these children is the item that was located
for the insertion position.  If this item has the same key as u, then insertion
fails, since duplication of keys in the tree is not allowed, otherwise
insertion can continue.  If p only has two children, then insertion of x is
simple since the number of nodes under p is easily increased to three.  The
following insertion cases are possible:

                         (y,_)   --->   (y,u) 
                         / |            / | \
  insert beside y       /  |           /  |  \
                       x   y ^        x   y   u
                             u


                         (y,_)   --->   (u,y)
                         / |            / | \
  insert beside x       /  |           /  |  \
                       x ^ y          x   u   y
                         u


                         (y,_)   --->   (x,y)     (special case)
                         / |            / | \
                        /  |           /  |  \
                     ^ x   y          u   x   y
                     u

In the above illustration, the node is shown as (k1,k2) where k1 and k2 are the
key items.  Insertion of u will always occur to the left of the item that was
located for the insertion position, except for the special case where u becomes
the new minimum (or left-most) item in the entire tree.  If u is not the
minimum node, then key(u) > key(x) since this comparison was performed during
the search when u was a key item of some node on the path to u.


If p has three children then the insertion of u is more complicated.  The
following leaf-level insertion cases are possible:


                     p (y,z)    --->  p (y,_)   q (u,_)   insert new node, q:
                       / | \            / |       / |       umin <- z
  insert beside z     /  |  \          /  |      /  |          u <- q
                     x   y   z ^      x   y     z   u      
                               u


                     p (y,z)    --->  p (y,_)   q (z,_)   insert new node, q:
                       / | \            / |       / |       umin <- u
  insert beside y     /  |  \          /  |      /  |          u <- q
                     x   y ^ z        x   y     u   z
                           u


                     p (y,z)    --->  p (u,_)   q (z,_)   insert new node, q:
                       / | \            / |       / |       umin <- y
  insert beside x     /  |  \          /  |      /  |          u <- q
                     x ^ y   z        x   u     y   z
                       u


                     p (y,z)    --->   p(y,z)   --->   p (x,_)   q (z,_) 
                       / | \            / | \            / |       / |
                      /  |  \          /  |  \          /  |      /  |
                   ^ x   y   z        u ^ y   z        u   x     y   z
                   u                    x

                    (special case)                        insert new node, q:
                                                            umin <- y
                                                               u <- q


Since a node cannot have 4 children a new internal node, q, is created which
takes two of the children, so that both nodes p and q have two children.  Node
q must then be inserted beside node p at the next level up in the 2-3 tree.

Insertion of internal nodes has slightly more complexity since the value of
keys in sub-trees must be kept track of.  For this purpose, umin is use for
keeping track of the minimum data item in sub-tree, u.  We use ymin and zmin to
indicate the keys corresponding to the minimum data items in subtrees y and z
respectively.  At the start of the insertion, these are available as the key
items of node p.  The different cases for inserting an internal node are shown
below:


                  p (ymin,zmin)  --->  p (ymin,_)  q (umin,_)   insert q:
                       / | \                / |         / |       umin <- zmin
  insert beside z     /  |  \              /  |        /  |          u <- q
                     x   y   z ^          x   y       z   u      
                               u


                  p (ymin,zmin)  --->  p (ymin,_)  q (zmin,_)   insert q:
                       / | \                / |         / |          u <- q
  insert beside y     /  |  \              /  |        /  |          
                     x   y ^ z            x   y       u   z
                           u


                    p (ymin,_)   --->  p (ymin,umin)   STOP
                         / |                / | \
                        /  |               /  |  \
                       x   y ^            x   y   u
                             u


                  p (ymin,zmin)  --->  p (umin,_)  q (zmin,_)   insert q:
                       / | \            / |             / |       umin <- ymin
  insert beside x     /  |  \          /  |            /  |          u <- q
                     x ^ y   z        x   u           y   z
                       u


                    p (ymin,_)   --->  p (umin,ymin)   STOP
                         / |                / | \
                        /  |               /  |  \
                       x ^ y              x   u   y
                         u


Note that there is no special case when inserting an internal node.  The
special case only arises for inserting a leaf data item that is smaller than
any other item in the tree.

When insertion occurs at the root level of the tree, a new root node, q, is
created, with the inserted node, u, and the old root node, r, as its children.
The pointer to the root node of the tree is then updated to q.


    r (k,_)  ^   --->   q (umin,_)    STOP
             u            /    |
                         /     |
                     r (k,_)   u
               

The Delete Operation
--------------------

The delete operation deletes a data item, u, from the 2-3 tree.  Item u is
located using the same process as the find operation.  As for insertion:
p denotes the parent of u; x, y, and z denote the left, middle and right
children of p respectively.  Note that u is one of x, y or z.  If the tree is
empty then delete fails.  There are also these special cases:

  Only one item in the tree (p is the root node):

                  p(_,_)    --->    p(_,_)    STOP
                   /
                  /
                 x
                 ^
                 u

  Only two items in the tree (p is the root node):

                    p(y,z)    --->    p(_,_)    STOP
                   / |                 
                  /  |                 
                 x   y                 x 
                     ^
                     u

                  p(y,z)     --->   p(_,_)    UPDATE_KEY and MERGE y
                   / |                 
                  /  |                 
                 x   y                 y
                 ^
                 u


For all other cases, if p has three children, then deletion is easy since two
children are left over.  However if p has only two children, deletion becomes
complicated since the 2-3 tree cannot allow a single child to be left over
after deletion.  The different cases when deleting a data item from the 2-3
tree are shown below:

  u = z:          p(y,z)    --->    p(y,_)    STOP
                   / | \             / |
                  /  |  \           /  |
                 x   y   z         x   y
                         ^
                         u


  u = y:          p(y,z)    --->    p(z,_)    STOP
                   / | \             / |
                  /  |  \           /  |
                 x   y   z         x   z
                     ^
                     u


                  p(y,z)    --->    p(_,_)    MERGE x
                   / |                 
                  /  |                 
                 x   y                 x 
                     ^
                     u


  u = x:          p(y,z)     --->   p(z,_)    UPDATE_KEY and STOP
                   / | \             / |
                  /  |  \           /  |
                 x   y   z         y   z
                 ^
                 u


                  p(y,z)     --->   p(_,_)    UPDATE_KEY and MERGE y
                   / |                 
                  /  |                 
                 x   y                 y
                 ^
                 u


Note that when u is a left child, there may be a key higher up in the tree
which needs to be updated.  The key corresponding to the most recent non-left
branch traversed on the path to p is the correct key to update.  The find
process can be extended to remember this key.  If u is the minimum item in the
tree, then no such key will exist, and there is no need to update a key.  In
this case, if a pointer to the minimum data item in the tree is being
maintained, it will need to be updated.

Consider the cases where p has just a single child left over after deletion.
Let this left over child be denoted by u, and its parent denoted by p.  For
now, assume that node p also has a parent, m.  Data item u must be merged with
child data items from a sibling of p.  Let this sibling be denoted by q, and
its children be denoted by x, y and z.  The different cases for the MERGE u
operation at leaf level are illustrated below:


  p is a right child:


           m(x,u)                  m(x,z)
            / | \                   / | \
           .  |  \                 .  |  \
          .   |   \               .   |   \
         .    |    \             .    |    \
              |     \                 |     \
              |      \                |      \
           q(y,z)  p(_,_)   --->   q(y,_)  p(u,_)    STOP
            / | \                   / |     / |
           /  |  \                 /  |    /  |
          x   y   z   u           x   y   z   u


           m(x,u)                  m(x,_)
            / | \                   / |
           .  |  \                 .  |
          .   |   \               .   |
         .    |    \             .    |
              |     \                 |
              |      \                |
           q(y,_)  p(_,_)   --->   q(y,u)    STOP    (node p is deleted)
            / |                     / | \
           /  |                    /  |  \
          x   y       u           x   y   u


  p is a middle child:


                   m(u,?)                  m(z,?)
                    / | \                   / | \
                   /  |  ?                 /  |  ?
                  /   |                   /   |    
                 /    |                  /    |     
                /     |                 /     |
               /      |                /      |
           q(y,z)  p(_,_)   --->   q(y,_)  p(u,_)    STOP
            / | \                   / |     / |
           /  |  \                 /  |    /  |
          x   y   z   u           x   y   z   u


                   m(u,k)                  m(k,_)
                    / | \                   / |
                   /  |  .                 /  .
                  /   |   .               /   .
                 /    |    .             /    .
                /     |                 /
               /      |                /             
           q(y,_)  p(_,_)   --->   q(y,u)            STOP   (node p is deleted)
            / |                     / | \
           /  |                    /  |  \
          x   y       u           x   y   u


                   m(u,_)          m(_,_)
                    / |               
                   /  |                
                  /   |                 
                 /    |                
                /     |              
               /      |          
           q(y,_)  p(_,_)   --->   q(y,u)    MERGE q    (node p is deleted)
            / |                     / | \
           /  |                    /  |  \
          x   y       u           x   y   u


  p is a left child:


                   m(x,?)                  m(y,?)
                    / | \                   / | \
                   /  |  ?                 /  |  ?
                  /   |                   /   |
                 /    |                  /    |
                /     |                 /     |
               /      |                /      |
           p(_,_)  q(y,z)   --->   p(x,_)  q(z,_)    STOP
                    / | \           / |     / |
                   /  |  \         /  |    /  |
              u   x   y   z       u   x   y   z


                   m(x,k)                  m(k,_)
                    / | \                   / |
                   /  |  .                 /  .
                  /   |   .               /   .
                 /    |    .             /    .
                /     |                 /     
               /      |                /      
           p(_,_)  q(y,_)   --->   q(x,y)            STOP
                    / |             / | \                (node p is deleted)
                   /  |            /  |  \
              u   x   y           u   x   y


                   m(x,_)          m(_,_)
                    / |             
                   /  |            
                  /   |               
                 /    |                
                /     |              
               /      |          
           p(_,_)  q(y,_)   --->   q(x,y)    MERGE q    (node p is deleted)
                    / |             / | \
                   /  |            /  |  \
              u   x   y           u   x   y


For two of the merging cases shown above, the parent of node p is left with
just one child, and merging must be carried out at the next level up in the
tree.  Maintaining keys in the tree requires slightly more insight when merging
nodes, as opposed to merging leaf data items.  Except for the maintenance of
keys, the merging of nodes, is exactly the same as for leaf level data items.
Labels x, y, z, and u, now denote nodes, rather than data items.  The labelling
xmin, ymin, umin, and zmin is used for labelling key data items to show where
keys are moved around in the tree.


  p is a right child:


        m(xmin,umin)                 m(xmin,zmin)
            / | \                        / | \
           .  |  \                      .  |  \
          .   |   \                    .   |   \
         .    |    \                  .    |    \
              |     \                      |     \
              |      \                     |      \
       q(ymin,zmin)  p(_,_)   --->   q(ymin,_)  p(umin,_)    STOP
            / | \                        / |     / |
           /  |  \                      /  |    /  |
          x   y   z   u                x   y   z   u


        m(xmin,umin)                 m(xmin,_)
            / | \                       / |
           .  |  \                     .  |
          .   |   \                   .   |
         .    |    \                 .    |
              |     \                     |
              |      \                    |
         q(ymin,_)  p(_,_)   --->   q(ymin,umin)    STOP
            / |                         / | \
           /  |                        /  |  \
          x   y       u               x   y   u


  p is a middle child:


                 m(umin,?)                  m(zmin,?)
                    / | \                       / | \
                   /  |  ?                     /  |  ?
                  /   |                       /   |    
                 /    |                      /    |     
                /     |                     /     |
               /      |                    /      |
      q(ymin,zmin)  p(_,_)   --->   q(ymin,_)  p(umin,_)    STOP
            / | \                       / |     / |
           /  |  \                     /  |    /  |
          x   y   z   u               x   y   z   u


                 m(umin,k)                    m(k,_)
                    / | \                      / |
                   /  |  .                    /  .
                  /   |   .                  /   .
                 /    |    .                /    .
                /     |                    /
               /      |                   /
        q(ymin,_)  p(_,_)   --->   q(ymin,umin)             STOP
            / |                        / | \
           /  |                       /  |  \
          x   y       u              x   y   u


                 m(umin,_)             m(_,_)
                    / |               
                   /  |                
                  /   |                 
                 /    |                
                /     |              
               /      |          
         q(ymin,_)  p(_,_)   --->   q(ymin,umin)    MERGE q
            / |                         / | \
           /  |                        /  |  \
          x   y       u               x   y   u


  p is a left child:


                 m(xmin,?)                        m(ymin,?)
                    / | \                            / | \
                   /  |  ?                          /  |  ?
                  /   |                            /   |
                 /    |                           /    |
                /     |                          /     |
               /      |                         /      |
           p(_,_)  q(ymin,zmin)   --->   p(xmin,_)  q(zmin,_)    STOP
                    / | \                    / |     / |
                   /  |  \                  /  |    /  |
              u   x   y   z                u   x   y   z


                 m(xmin,k)                       m(k,_)
                    / | \                         / |
                   /  |  .                       /  .
                  /   |   .                     /   .
                 /    |    .                   /    .
                /     |                       /     
               /      |                      /      
           p(_,_)  q(ymin,_)   --->   q(xmin,ymin)            STOP
                    / |                   / | \ 
                   /  |                  /  |  \
              u   x   y                 u   x   y


                 m(xmin,_)               m(_,_)
                    / |             
                   /  |            
                  /   |               
                 /    |                
                /     |              
               /      |          
           p(_,_)  q(ymin,_)   --->   q(xmin,ymin)    MERGE q 
                    / |                   / | \
                   /  |                  /  |  \
              u   x   y                 u   x   y


All of the merge cases illustrated above assumed that node p had a parent node,
m.  If node p is the root node, then m  will not exist.  For the case where
node p is the root node of the tree, node p is deleted, and node u becomes the
new root node in the tree.  This is illustrated below:


             p(_,_)    --->    u      STOP


                u


The Delete-Min Operation
------------------------

For the delete-min operation, the minimum data item in the tree is located and
deleted from the tree.  The source code for delete-min is simpler than that for
delete, since only merge cases where p is a left child can occur.  Unlike
find-min, maintaining a minimum pointer will not help improve the efficiency
of delete-min since the rearrangement of the 2-3 tree, which occurs after the
delete, takes O(log n) worst case time.  Note also that the minimum node must
be located in order to obtain the path to the minimum node.  The only way to
avoid performing the search for the minimum node would be to maintain parent
pointers for each node, or path information for the path to the minimum node.