Data Structures

Wikipedia List of Data Structures

Fixed Arrays

Fixed arrays aren't generally used in higher level programming languages since they're so hard to use, but they're important building blocks for understanding the more common dynamic array.

A very basic array will require just two pieces of data: 1) the starting location in memory of the first item, and 2) the number of items in the collection. This will make it possible to access every item in the array without taking anything that doesn't belong. Array of five items that starts at memory position 1000 will allow the user to access the third item in the array by offsetting the starting position by two (the third item will be at position 1002). Indexes greater than or equal to the size will be out of bounds.

Elements in the array can be accessed and updated in constant time, but adding and deleting items, either from the beginning, middle, or the end, takes linear time. The entire array must be rewritten at a different place in memory because no space was left for additions. The memory after the array was likely used by something unrelated.

A fixed array can still be useful in extremely memory limited environment, since no space is wasted. But memory is usually cheap, and its unlikely that that is a priory over performance.

Linked Lists

Link lists provide a solution to the issue of of insertions and deletions, but you lose the constant lookup time. Linked lists consist of nodes that hold a value for that node, and a reference to the next node. It's generally advisable for the linked list to keep track of its first node and its last node. Adding a new node to the beginning of the list will involve just creating that new node with the reference to the formerly first node. Adding one to the end will involve setting the formerly last node's reference to the new node.

Since the nodes are scattered throughout memory, and only the nodes themselves know where each successive node is, you cannot access any one node without going through all the nodes before it first. This produces linear lookup time. And inserting a node in the middle will involve a linear search as well.

module LinkedList
   class List
      def initialize(value)
         @first = Node.new(value, nil)
         @last = @first
      end

      def prepend(value)
         node = Node.new(value, @first)
         @first = node
      end

      def append(value)
         node = Node.new(value)
         @last.nextNode = node
         @last = node
      end

      def each(&block)
         node = @first
         while node
            yield(node.value)
            node = node.nextNode
         end
      end
   end

    class Node
       attr_accessor :value, :nextNode
       def initialize(value, nextNode = nil)
          @value = value
          @nextNode = nextNode
       end
    end 
end

Dynamic Arrays

Dynamic arrays get around the limitations of the previous two data structures by allocating extra space for new elements to fit into. The array will then need to track both the total units of memory encompassing the array, and the number currently in use.

The question then becomes "how much extra space to allocate?" Allocating too much space is wasteful, allocating too little will cause it to rebuild itself too frequently. The solution is to double the size whenever there is no more space left (table doubling). This creates a geometric expansion, as an array of size two will grow to four, then to eight, then to 16, and so forth. The array will need to be recreated at a linear time cost every time it needs to be doubled, but it won't need to be doubled every time something is added. In fact, the recreation of the array becomes so infrequent that when doing a large number of adds to an array, it averages out to give you constant time for each individual addition. This is known as amortization.

Deletions are also amortized to constant time. When array falls below a threshold (probably 30% of capacity), the array will rebuild itself with less space available. Insertions in the middle will still require the array to rebuild itself, as with insertions at the beginning of the array (although this may depend on the implementation).

Stacks and Queues

Stacks and queues are important and frequently used data structures, but also simple to implement and conceptualize. Stacks are last in first out (LIFO). The last thing to go in is the first thing to come out, as if you were _stack_ing coins on top of each other--you would take one from the top, not the bottom. Queues are first in first out (FIFO). Just like a line at the grocery store, the first person in line will be the first person served.

Many languages have built in Stacks and Queues that you can use. A simple dynamic array can performantly be used for stack (just use the pop method to take take the last item off the array). A queue would require the shift method to take the first item from the array, but the shift method likely operates in linear time complexity. A linked list would likely be a better fit for a queue (or even a stack), since insertions and deletions at both the front and back can be done in constant time.

Hash Tables

Hash tables are extraordinarily powerful data structures and are central to modern computer languages. But they still rely on arrays (associative arrays) for their implementation. Hash table are key-value stores with constant lookup, insertion, and deletion time.

An associative array is the backbone of a hash table. It sets up separate buckets/slots for storing all the data (ie. the key and the value).

Hash table use hashing functions to take a string (the key)--or anything that can be converted into a sting, which in many languages can be any value--and converts it into an integer representation. That integer can then be reduced to a index by applying the modulo operator to it against the size of the corresponding associative array for which data will be stored. Since the hashing function is deterministic, the same string passing through it will always return the same integer, and will always map to the same position in the array.

There are more possible strings than there are 64-bit integers, so its inevitable that two different strings will map to the same index. This is made even worse since the size of the array is highly limited. Hash tables must do a comparison check between the key that's saved in the array, and the key that is trying to access it (or else it may overwrite an existing entry with a new entry that maps to the same index). It must also take steps to avoid such index collisions, and how to handle them when they're unavoidable.

The good hash function should provide a uniform distribution for hash values. This means that strings aren't more likely to map to one integer than they are to any other integer.

There are many ways of dealing with collisions. Using linked lists is a common method. In that case, each bucket becomes a node. When a collision is detected, a new node is added to the end of that bucket's list. This does have the consequence of creating a worst case linear run time. You need to iterate through each node in a list to find the one you're looking for. If all keys map to the same index, then you're not doing any better than just a linked list. An self-balancing binary search tree could also be used, as is the case with a Hash Map, which will guarantee a logarithmic lookup time. However, its implementation becomes much more complicated and the performance will likely be worse for small tables.

A different approach is linear probing. When there is a collision, it will search down the list of buckets trying to find an open one. This creates its own problems when the load (the ratio of used buckets to total buckets) becomes high or the size gets large, but can still be efficient for small hashes.

Most implementations of hash table will expand (and shrink) the associative array once its load reaches a certain threshold, in a process that is similar to dynamic array table doubling. This greatly reduces the likelihood of collisions, and allows hash tables to maintain a high probability of constant time operations.

Sets

Sets are collections without repeated values. They're not designed to retrieve values, but are instead used to simply test if a value is present. This can be useful, for example, when traversing a graph to keep track of visited nodes and ensure that no node gets visited multiple times.

They can be implemented in multiple ways using other data structures. Self-balancing binary search trees can preserve an order for the set, and will guarantee logarithmic run time for most operations. Hash tables are already very similar to sets. They don't allow duplicates, but also grant constant lookup time with high probability. The implementations can effectively be almost identical, with a hash needing to store a key and a value, and the set only needing the key.

Heaps

Heaps are tree-like data structures that are usually implemented with arrays. Binary heaps will have nodes that have two child nodes. In max heaps, the children have smaller keys than their parents, and the left child and right child are irrespective of each other (left key can be larger or smaller than the right key). In min heaps, the children are always larger than their parents. Additionally, heaps are perfectly balanced from left to right.

For max heaps, whenever the root node is removed, its larger child will take its place. This has the effect of always being able to remove nodes in a sorted order, which is how heapsort works. It also makes heaps useful for creating priority queues. Normal queues will simply be first-in-first-out, but priorities queues allow you to specify the order in which nodes are removed, regardless of when you put them in.

In general, heaps work by appending nodes to an array. The index of a node's left child can always be calculated by take the node's index, multiplying it by two, and adding one. The index of the right child is the node's index times two plus two.

After adding a new node to the array, you need to search for the node's parent to see if it violates the condition that it is greater than (max heap) or less than (min heap) the new node. If it violates the condition, you must recursively swap the nodes until the correct condition is met.

module PriorityQueue
    def self.from_array(array)
        new_heap = Heap.new
        array.each { |val| new_heap << val }
        new_heap
    end

    class Node
        include Comparable
        attr_accessor :name, :priority

        def initialize(name, priority)
            @name = name
            @priority = priority
        end

        def <=>(other)
            @priority <=> other.priority
        end
    end

    class Heap
        attr_reader :elements

        def initialize
            @elements = []
        end

        def <<(element)
            @elements << element
            bubble_up(@elements.size - 1)
            @elements
        end

        def pop
            exchange(0, @elements.size - 1)

            max = @elements.pop
            bubble_down(0)
            max
        end

        private
        def bubble_up(index)
            parent_index = (index - 1) / 2

            # base case: stop at root or when parent is greater then new node
            return if index == 0 || @elements[parent_index] >= @elements[index]

            exchange(index, parent_index)

            bubble_up(parent_index)
        end

        def bubble_down(index)
            child_index = (index * 2) + 1

            return if child_index > @elements.size - 1

            not_the_last_element = child_index < @elements.size - 1
            left_element = @elements[child_index]
            right_element = @elements[child_index + 1]

            child_index += 1 if not_the_last_element && right_element > left_element

            return if @elements[index] >= @elements[child_index]

            exchange(index, child_index)

            bubble_down(child_index)
        end

        def exchange(source, target)
            # swap to elements of array
            @elements[source], @elements[target] = @elements[target], @elements[source]
        end
    end
end

Search Trees

Binary search trees are binary trees whose nodes obey the following trait: the key of the left node is alway less than its parent, and the key of the right node is always greater than its parent. This property allows for adding and searching in potentially logarithmic time.

module BinarySearchTree
    class Tree
        def initialize(value)
            @root = Node.new(value)
        end

        def add(val)
            @root.add(val)
        end

        def includes?(val)
            @root.includes?(val)
        end
    end

    class Node
        def initialize(val)
            @value = value
            @left = nil
            @right = nil
        end

       def add(new_val)
           if new_val < @value
               insert_left(new_val)
           elsif new_val > @value
               insert_right(new_val)
           else
               # do nothing if value already exists
               return false
           end
       end

       def includes?(val)
          if val < @value
              # recurse left subtree or false base case if nil
              @left ? @left.includes?(val) : false
          elsif val > @value
              @right ? @right.includes?(val) : false
          else
              # Base case: value is found, return true
              return true
          end
       end

       private
       def insert_left(new_val)
           # recurse the left subtree until it's empty, then set that value
           @left ? @left.add(new_val) : @left = Node.new(new_val)    
       end

       def insert_right(new_val)
           @right ? @right.add(new_val) : @right = Node.new(new_val)
       end
    end
end

The main issue with a the BST above is that the order in which elements are added matter, and will effect the performance of searches and additions. If elements are added in a sorted order, then each new element will be added as the right child, and you will essentially get a linked list. Luckily , there are strategies that can be employed to maintain that logarithmic lookup time by rebalancing the tree every time an node is added or deleted. Two well known types of self-balancing trees are AVL trees and Red-black trees. Additionally, B-trees are a common form of non-binary, self-balancing tree.

AVL Trees

AVL trees get rebalanced whenever the heights of two child subtrees of any node differ by more than one. This difference is known as its balance factor. Thus, if a node is missing both children, its balance factor is (0 - 0) zero. If it has a left child which has no children of its own, and no right child, then its balance factor is (0 - 1) negative one. If the opposite is the case (a single right child) then its balance factor is (1 - 0) positive one. When that balance factor hits two or negative two, the tree must be rebalanced. Remember that the number of nodes on each sub tree doesn't matter, its the height that matters.

The balance factor of each node is saved by it, and every time a new node is added or deleted, the new node's ancestors must be retraced to update their balance factors.

Rebalancing is handled by tree rotations. You're going to have four different types of rotations. A simple left rotation occurs when the right sub tree is heavy. A parent node A will be come the left node of its former right node B, which becomes the ancestor of the rebalanced tree. B's left node is transferred over to become A's new right node.

If B's left node, C, has a higher balance factor than its right node, D, then a double rotation is required. C will become A's right child, B will become C's right child, and C's old right child becomes B's left child. C then get promoted again. A becomes C's left child and C's old left child become A's right child. At this point, the tree is perfectly balanced. Rotations can also be performed in the other direction (towards the right), when conditions are reversed.

Red-black Trees

BSTs must obey four additional rule to be Red-black trees, and ensure logarithmic run time complexity.

  1. Each node is either red or black: The key can be saved by each node as a single bit representation (true/false).
  2. The root and the leaves are black: The leaves nil nodes--they don't contain data and are the children of the bottom node in the tree.
  3. Both children of red nodes are black
  4. Every path from any node to its nil descendants have the same number of black nodes

Rules 3 and 4, taken together result in the longest path being no longer than twice the shortest path.

Nodes are inserted as red nodes, and then it and its ancestors are rotated and recolored to obey the properties above. There are four cases you are going to encounter which will require these changes:

  1. When the new node, N, is the root node, the node simply need to be changed to black. This change will validate rule 2.
  2. When both the parent, P, and uncle, U, of N are red, then P and U are changed to black, and the grandfather, G, of N is changed to red. This occurs recursively since changes to G can affect its ancestors.
  3. When P is red, U and G are black, and N is on the inside of the tree (either its the right child of P if P is a left child, or vice versa), a rotation occurs. N simply gets promoted to P's place and P moves to the outside position of N in the tree. This will still violate run 3, so case 4 will run to fix it.
  4. When P is red, U and G are black, and N is on the outside of the tree (either its the left child of P if P is a left child, or its the right child or P if P is a right child), a rotation occurs. P is promoted and G becomes P's other child (the one that's not N). N's sibling (P's former other child), will become G's inside child.

B-trees

Nodes in B-trees can have more than two children. This creates a flatter structure that optimizes read and writes of large blocks of data. This makes them a good choice for use in databases.

B-trees must obey the follow rules:

  1. Every node has at most m children.
  2. Every non-leaf node (except the root) has at least m/2 (upper bounded) children.
  3. The root has at least two children if is is not a leaf node.
  4. A non-leaf node with k children contains k-1 keys.
  5. All leaves appear in the same level.

If _m _is 4, each node will have at most 4 keys. It can have 5 children, each with at least 2 keys and at most 4. For example, a root node of [50, 100, 150, 200] can have the children [10, 20, 30, 40], [60, 65, 72, 81, 95], etc. The keys of the child nodes must be within the ranges defined by its parent. The second node here must have keys between 50 and 100. Keys must be maintained in sorted order.

Searching will be somewhat similar to a BST. It will require knowledge of the subtree's range to figure out if the element is present, and will have to iterate if the value could be in that range. Luckily, the number of elements is usually a small constant factor, so logarithmic search time is preserved.

When adding a new keys, you simply search for its correct position and, if there is room in the node, add it. If the node is full then you split the node. Choose a single median among the leaf elements and the new element and put all the elements in a new left and right node according to that. The separation value is inserted in the nodes parent. If that causes a split there, recursively repeat the process.

Tries

A Trie (pronounced as either "tree" or "try," but saying "try" will differentiate it from normal trees) are a type of search tree that can be used to implement other data structures (like hash tables and arrays), but are most frequently used for incremental full text search. In a prefix tree, for example, the keys for nodes will be characters, and the children of each node will have keys that, when combined with their ancestors and descendants, help construct possible words or phrases.

Graphs

Graphs themselves are generally fairly simple data structures, but the algorithms that can be used to analyze them can get very complex. They're like trees, but can have multiple parents. There are multiple ways to represent them, but most of the time, you're going to want to represent them as adjacency lists. Vertices are the nodes in graphs, and the connection between vertices are edges. Edges can have weights (like a roads between cities have variable lengths) and can be directed (you may be able to go from one vertex to another, but not come back).

results matching ""

    No results matching ""