Performance

Data StructuresHash Table (aka Dictionary)

i206

Fall 2010

John Chuang

Some slides adapted from Marti Hearst, Brian Hayes, Andreas Veneris, Glenn Brookshear,Nicolas Christin, or Ion Stoica.

John Chuang

Outline

What is a data structure

Basic building blocks: arrays and linked lists

Data structures (uses, methods, performance):

-List, stack, queue

-Dictionary

-Tree

-Graph

John Chuang

Dictionary

Also known as hash table, lookup table,associative array, or map

A searchable collection of key-value pairs

-each key is associated with one value

Many possible applications, e.g.,

-Address book

-Student record database

-Routing tables

-…

John Chuang

Dictionary Methods

get(k): if the dictionary has an entry with key k,return its associated value; else return null

put(k,v): insert entry (k,v) into the dictionary; ifkey is not already in dictionary, return null; elsereturn old value associated with k

remove(k): if dictionary has an entry with key k,remove it from dictionary and return itsassociated value; else return null

size()

isEmpty()

John Chuang

Dictionary in Python

Example:

>>> agents = {‘006’:’Jack Giddings’, ‘007’:’James Bond’}

>>> agents[‘001’] = ’Edward Donne’

>>> agents[‘007’]

‘James Bond’

John Chuang

Desirable Properties

Fast search, insert and delete (get, put, andremove)

Efficient space usage

Arrays: O(1) operations; not space efficient

Linked lists: O(n) operations; space efficient

Binary trees: O(log n) operations; spaceefficient

Can we do better?

John Chuang

Hash Table

A generalization of an array that, if properlydesigned, can realize fast insert/delete/searchoperations with good space efficiency

-average run-time of O(1)

-worst case run-time of O(n)

A hash table is:

-Good for storing and retrieving key-value pairs

-Not good for iterating through a list of items

John Chuang

Hash Table

A hash table consists of two major components:

-Bucket array (for storing entries)

-Hash function (for mapping keys to buckets)

Obj5

key=1

obj1key=15

Obj4

key=2

Obj2

key=36

table

Obj3

key=4

buckets

hash value/index

John Chuang

Hash Table Design

Bucket array is an array A of size N, where eachcell of A is considered a “bucket”

Hash function h maps each key k to an integerin the range [0, N -1]

Store entry (k,v) in the bucket A[h(k)]

Search for entry (k,v) in the bucket A[h(k)]

Choose hash function such that

-Can take arbitrary objects (keys) as input

-Hash computation is fast

-Key mappings evenly distributed across [0, N -1]

John Chuang

Example Hash Function

h(k) = k mod N

mod stands for modulo, the remainder of the division of twonumbers. For example:

-8 mod 5 = 3

-9 mod 5 = 4

-10 mod 5 = 0

-15 mod 5 = 0

Observe that collisions are possible:

-two different keys hash to the same value

John Chuang

Example

h(k) = k mod 5

key value

Insert (2,x)

2 x

key value

Insert (21,y)

2 x

21 y

key value

Insert (34,z)

2 x

21 y

34 z

Insert (54,w)

There is a

collision at

array entry #4

???

John Chuang

Dealing with Collisions

Hashing with Chaining: every hash table entrycontains a pointer to a linked list of keys thathash in the same entry

key value

Insert (54,w)

2 x

21 y

54 w

34 z

CHAIN

Insert (101,x)

21 y

2 x

101 x

54 w

34 z

John Chuang

Dictionary Performance

What is the run time for insert/search/delete?

-Insert: It takes O(1) time to compute the hash function and insert athead of linked list

-Search: It is proportional to the length of the linked list

-Delete: Same as search

key value

Insert (54,w)

2 x

21 y

54 w

34 z

CHAIN

Insert (101,x)

21 y

2 x

101 x

54 w

34 z

John Chuang

Load Factor

Average length of the linked lists is a function of theload factor

-Load factor = number of items in hash table / array size

In Python, the implementation details are transparent tothe programmer (load factor kept under 2/3)

In Java, the programmer can set the initial table size(default=16) and/or the load factor (default = 0.75)

John Chuang

Distributed Hash Table (DHT)

Similar to traditional hash table data structure, exceptdata is stored in distributed peer nodes

-Each node is analogous to a bucket in a hash table.

Applications: distributed search, e.g., peer-to-peernetworks, content distribution networks (CDNs), etc.

DHTs are typically designed to scale to large numbers ofnodes and to handle continual node arrivals anddepartures/failures.

Put(), Get() interface like a regular hash table:

-put(id, item);

-item = get(id);

John Chuang

DHT Example: Chord

Nodes: n1(id=1),n2(id=2), n3(id=0),n4(id=6)

Items inserted:f1(id=7), f2(id=1)

i id+2i succ

0 2 2

1 3 6

2 5 6

Finger Table

i id+2i succ

0 3 6

1 4 6

2 6 6

Finger Table

i id+2i succ

0 1 1

1 2 2

2 4 6

Finger Table

Items

i id+2i succ

0 7 0

1 0 0

2 2 2

Finger Table

Slide adapted from Ion Stoica, Nicolas Christin

John Chuang

DHT Example: Chord

Upon receiving a query foritem id, a node:

Check if the item is storedlocally

If not, forwards the queryto the largest node in itssuccessor table that doesnot exceed id

i id+2i succ

0 2 2

1 3 6

2 5 6

Finger Table

i id+2i succ

0 3 6

1 4 6

2 6 6

Finger Table

i id+2i succ

0 1 1

1 2 2

2 4 6

Finger Table

Items

i id+2i succ

0 7 0

1 0 0

2 2 2

Finger Table

query(7)

Slide adapted from Ion Stoica, Nicolas Christin