Thursday, October 10, 2013

Notes on "sorting a hash table" in Python

Well, by definition you can not sort a hash table as "order" is not really important in that data structure. However, there are situations where data stored in a hashtable needs to be displayed in some ordered fashion. In such a case you can obtain a "sorted representation" of the hash table.
For instance, I had the following hashtable, which I wanted to display as tabular data:

td = {1: {"mykey1":[4,5,6, -2, 5, 6,7], "mykey2":[6,7,8]}, 2: {"mykey1":[5,7,8,9], "mykey2":[0, 9, 7, 6, 8]}, 3:{"mykey1":[5,7,8,9], "mykey2":[0, 9, 7, 6, 8,9]}}

My intention is to display:


  mykey1 mykey2
1 7 3
3 4 6
2 4 5

Essentially, the data is displayed based on the number of items in mykey1 and then sorted on mykey2. You could write code to flatten the hashtable, but you can write this more elegantly as follows:

def hashCountSort(h, k, r=False):

    def len_cmp(x, y):
      ln = 0
      for ck in k:
          ln = (len(x[1][ck]) - len(y[1][ck]))
          if (ln != 0): return ln
      return ln

    return sorted(h.iteritems(), cmp=len_cmp, reverse=r)

And use:
print hashCountSort(td, ["mykey1", "mykey2"], True)

Above, I use nested function in Python (essentially a closure) to define a len_cmp function, that iterates through a list of keys. Forward key comparisons are only made if the earlier key compare returned an equality. Also note that the iteritems() converts individual key, value pairs of the outer hashtable to an iteratable tuple list, with each tuple containing (key, value) items.

More generally one may write:

def hashSort(h, k, cmpf, r=False):

    def hash_cmp(x, y):
      res = 0
      for ck in k:
          res = cmpf(x[1][ck], y[1][ck])
          if (res != 0): return res
      return res

    return sorted(h.iteritems(), cmp=hash_cmp, reverse=r)

And use:
def cmp(x, y):
   return (len(x) - len(y))

print hashSort(td, ["mykey1", "mykey2"], cmp, True)

In the above example, I am using an external user provided function that may be defined by the user indicating how exactly the comparison function should be made for the purpose of sorting. One should note that the cost of such functions is high, and one may need to optimize if you have very large dataset.

Hope someone finds this useful ;)
Have a great weekend!