Back to Basics: Manage Collections with a Custom Dictionary

n Part 1 of this series, I showed you an elementary vector implementation for Brew as a stand-alone component you can use in your applications. This time, I give you a dictionary with a similar interface you can use in your Brew applications. You can download both the vector and hash here.

Introducing the Dictionary
The dictionary, also known as the associative array or hash, is a data structure that provides an interface for storing key-value associations. Typically, the name is a string, and the value may be anything, although in languages like C, you’re hard-pressed to store multiple data types in a single dictionary unless you use unions with a type identifier so you know the type of a specific entry.

Most dictionaries are based on a vector and a hash function that is responsible for helping determine where each item should be inserted in the vector by operating on the item’s key. For performance, my implementation does not reuse the vector implementation I presented previously, because the storage characteristics of the dictionary (more on this later) are different than your garden-variety vector. This implementation uses a collection of nodes each containing the name and item being stored, along with its hash for fast comparison:

typedef struct Node{  uint32 hash;  char *szKey;  void *pValue;} Node;

Choosing a good hash function is tricky, and there’s a lot of fact as well as folklore on the Internet. Theoretically, the problem is simple: you want a function that distributes the keys across the entire range of the collection’s indexes in relatively few operations. In practice, this is quite difficult, and much has been written on the topic. I began with a naïve implementation that worked for testing, and after some research, selected the Jenkins one-at-a-time hash. Of course, even good hash functions occasionally return the same value for two different inputs (that is, the inputs collide). When that happens, you need a tie-breaker to decide which item should be stored where. For simplicity, I use linear probing (walking to the next empty slot) to find the next available spot; the alternative, keeping items with the same hash in the same location chained via a linked list, just seemed more work than it was worth to implement.

Author’s Note: If you’re interested in learning more about selecting hash functions, I encourage you to read the resources cited in Related Resources.

I should remark that Brew provides a similar component, the IRamCache component, based on the IRecordStore implementation. However, it uses a singly-linked list for its implementation, meaning that the search time for different elements can be grossly different. Moreover, it copies keys and values, making it impossible to store entries such as pointers to other Brew interfaces.

The Dictionary Interface
The dictionary interface inherits from IBase just as the vector does, and is very similar (see Listing 1).

Those who read Part 1 will recognize the function names, although the signatures have changed to better meet the needs of a dictionary as opposed to a hash. While the vector and dictionary components were never meant to be caste-compatible, they need to be as similar as possible. This similarity ensures that developers using one kind of collection can easily transfer their knowledge to the other.

Under the hood, the dictionary structure is simple:

typedef struct CDictionary{  struct IDictionaryVtbl *pvt;  uint32 nRefs;  uint32 maskElements;  uint32 consumed;  Node *pElements;  PFNFREE pfnDeallocator;  uint32 cursor;  // The virtual table  IDictionaryVtbl vt;} CDictionary;

Of course, the CDictionary structure underlying the data must be cast-compatible with the IDictionary interface; this is the purpose of the first element. The second field is the reference count used when you implement the reference counting required by IBase. The remainder of the fields constitute the data fields for the dictionary itself:

  • maskElements: This field is a bit mask indicating the number of nodes in the collection for the dictionary.
  • consumed: This field indicates the number of nodes currently in use in the collection for the dictionary.
  • pElements: This field is a pointer to the collection of nodes containing the dictionary’s data.
  • pfnDeallocator: This field is a pointer to the destructor used by the dictionary to release a node’s value when it is deleted or replaced.
  • cursor: This field is used by the enumeration interface while an enumeration is in process to store the state of the enumeration.

Setting up a dictionary is not unlike setting up any other component: allocate memory for the object, initialize its fields, initialize the virtual table, and then link the pointer to the virtual table to the table itself (see Listing 2).

The only thing worth noting here is that the dictionary provides a default destructor, DefaultFree, which simply calls Brew’s FREE function. If the dictionary was to store something else?vectors, say?I’d pass in a different destructor (such as IVECTOR_Release()).

Inserting Items in the Dictionary
Inserting an item in a dictionary is more difficult than inserting it in a vector, because not only do you have to resize the dictionary when it becomes full, you also have to calculate where the item is going to reside. This involves hashing the key for the item as well as probing in the event of collisions.

While this article takes a bottom-up approach to writing the insertion function Dictionary_ReplaceAt, it’s easier to understand looking at it from the top down, walking through the different operations it has to perform depending on the state of the dictionary (see Listing 3).

Begin with the usual error checking: validating the incoming instance and other arguments. NULL or zero-length keys are not premitted, either; it makes no sense to the dictionary. Next, grow the dictionary if it needs more space and delete any existing item at the same location before inserting a new node. This code isn’t as efficient as it could be in the name of clarity, because Dictionary_Delete computes the hash for the object, something you’ll do again only a few lines later. But insertions are more expensive than deletions, and this dictionary, like most, is tuned for fast fetches, rather than insertions (for a data structure that better balances the cost of both insertions and fetches, consider a balanced tree). Once the node is built up on the stack, insert it in the collection using Dictionary_Insert, which creates the new node based on the node you pass.

Growing the collection of items is a little trickier than with a vector, for two reasons. First, you don’t want to wait until the collection is full?to do so would create lots of collisions as the dictionary fills. Instead, you want to grow the collection when it’s a little less than full. Most agree that a good threshold is at 75 percent of capacity:

static __inline boolean Dictionary_NeedsMore( CDictionary *pMe ){  return (boolean) ( ( 4 * pMe->consumed ) / pMe->maskElements > 3 ); }

The second challenge is that when the collection grows, the location of each item changes, because its location is a function of the hash value. Consequently, when re-sizing the collection, all of the items must be re-inserted. This is very expensive, and while it doesn’t occur very often, is one of the known performance penalties when using a hash table to store a dictionary (see Listing 4).

The first thing to note about how I’m growing the underlying collection is that I always double the size of the collection?that is, the size of the container for the collection is always a power of two. I do this for two reasons; first, many hash functions expect this, so this allows me to try different hash functions when I was initially developing the implementation. Second, it makes it exponentially less likely that another reallocation will be required. This model works well for the typical use of a dictionary, in which the client creates a dictionary, stashes away a bunch of name-value pairs, and then spends most of the time just querying the dictionary. The code allocates the new store, and if successful, iterates over the old store and re-inserts each existing node into the store, using the same insertion routine that Dictionary_ReplaceAt does.

Insertion is trickier, too, because if the hash function collides, you have to walk sequentially through the remaining nodes and wrap back around the beginning of the collection to find an empty slot (see Listing 5).

The code begins by computing the ideal location for the item based on its hash value. This operation explains why you use a bit mask instead of a count to store the number of elements in the collection; you can compute the appropriate position using a binary-AND, as opposed to the more expensive modulo. This is less important for insertion performance than it is for search performance, but you do the same in each case. Then, walk forward from that point on until you find a slot with no key; if you don’t find one probing forward, begin again at the beginning of the collection and walk until you’ve exhausted all elements (you’re guaranteed to find an empty element at some point during this search, because you reallocate the array when it’s only 75 percent full).

Accessing an Item
Accessing an item in the hash can be done one of three ways:

  • Using IDICTIONARY_Get and passing a key to the item you seek.
  • Enumerating across all items using IDICTIONARY_EnumInit and IDICTIONARY_EnumNext.
  • Mapping a function to all elements or a subset of all elements using IDICTIONARY_Comprehend or IDICTIONARY_ForEach.

These are the same general operations you can perform on the vector component described previously, although with the dictionary there’s one important distinction: there’s no sort order imposed on elements using any of these methods. Elements are returned in the order they reside in the collection, which is a function of the hash function, the key for each item, and the number of items in the collection, as you can see from the insertion implementation.

The implementation of IDICTIONARY_Get deserves attention, because it’s the basis for finding an element for deletion as well:

int Dictionary_Get( IDictionary *p, const char *szKey, void **ppObject ){  CDictionary *pMe = DICTIONARY_FROM_INTERFACE( p );  uint32 index = 0;  int result = EBADPARM;  if ( !pMe || !szKey || !ppObject || STRLEN( szKey ) == 0 ) return result;  *ppObject = NULL;  result = Dictionary_FindIndex( pMe, szKey, &index );  if ( result == SUCCESS )  {    *ppObject = pMe->pElements[ index ].pValue;  }  return result;}

Aside from error checking, this doesn’t do much; it defers its real work to Dictionary_FindIndex, copying the result at the returned index into the pointer provided by the caller. Dictionary_FindIndex must hash the key, test the element at that key, and if it doesn’t match, probe until it either finds the missing item or searches the entire collection (see Listing 6).

This code begins by computing the ideal location?the hash of the object modulo the number of items in the collection, and tests that element first. If it doesn’t match, it walks to the end of the list, and wraps around searching for the desired element only if it doesn’t find a match. Remember to check the hash values for equality first, but also check the equality of keys, in case the hash values collide!

Enumeration and mapping?also called completion, a concept taken from languages such as Haskell that provide a clear syntax for completing functions over lists?use a cursor and must check each node explicitly to see if it has an associated key. By testing against the key and not the item, users of the collection can differentiate between items that are not set, items that are set that have a NULL value, and those with a specific value (see Listing 7).

Initializing the enumerator consists of just setting the cursor to zero; each call to Dictionary_EnumNext returns the next key and value with a set key, incrementing the cursor one or more times as necessary until the next item in the collection has past.

The implementation of Dictionary_Comprehend works the same way, except that it also contains the enumeration loop directly.

Deleting an Item
To delete an element, it first needs to be found. Dictionary_Delete does this using Dictionary_FindIndex, which you saw previously:

int Dictionary_Delete( IDictionary *p, const char *szKey ){  CDictionary *pMe = DICTIONARY_FROM_INTERFACE( p );  uint32 index = 0;  int result = EBADPARM;  if ( !pMe || !szKey || STRLEN( szKey ) == 0 ) return result;  result = Dictionary_FindIndex( pMe, szKey, &index );  if ( result == SUCCESS )  {    if ( pMe->pfnDeallocator && pMe->pElements[ index ].pValue )     {      (pMe->pfnDeallocator)(pMe->pElements[ index ].pValue);    }    pMe->pElements[ index ].pValue = NULL;    FREE( pMe->pElements[ index ].szKey );    pMe->pElements[ index ].szKey = NULL;    pMe->pElements[ index ].hash = 0;    pMe->consumed--;  }  return result;}

After validating its arguments, the routine finds the index of the key in question, and if it’s found, uses the dictionary’s deallocator to free the item. Next, it nulls the pointer to the item and frees the key, nulling the key as well to indicate the slot is empty and available for reuse. Finally, the hash is set to zero. Dictionary_DeleteAll does the same, but across all set elements in the dictionary’s collection.

Useful and Reusable
Using Brew’s component-oriented approach makes it easy to create reusable data structures that you can share between your applications. Dictionaries and vectors benefit greatly from this approach because they can be foundation collection in many applications and share a common conceptual interface. Feel free to leverage these in your own applications!

Share the Post:
Share on facebook
Share on twitter
Share on linkedin

Overview

The Latest

microsoft careers

Top Careers at Microsoft

Microsoft has gained its position as one of the top companies in the world, and Microsoft careers are flourishing. This multinational company is efficiently developing popular software and computers with other consumer electronics. It is a dream come true for so many people to acquire a high paid, high-prestige job

your company's audio

4 Areas of Your Company Where Your Audio Really Matters

Your company probably relies on audio more than you realize. Whether you’re creating a spoken text message to a colleague or giving a speech, you want your audio to shine. Otherwise, you could cause avoidable friction points and potentially hurt your brand reputation. For example, let’s say you create a

chrome os developer mode

How to Turn on Chrome OS Developer Mode

Google’s Chrome OS is a popular operating system that is widely used on Chromebooks and other devices. While it is designed to be simple and user-friendly, there are times when users may want to access additional features and functionality. One way to do this is by turning on Chrome OS