Skip to content
Jun 08 2011

Set data structures & Implementation

Sets are an important mathematical concept. A set can be defined as a collection of distinct elements. In mathematical literature, curly braces are used to denote a set. So, a set S containing the elements 1, 2 and 3 is denoted by S = {1, 2, 3}.

Set illustration

A set supports the following basic operations,

  1. Union - The union of two sets is a set that contains elements from both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∪ 2 = {1, 2, 3, 4, 5}
  2. Intersection - The intersection of two sets is a set that contains elements common to both sets. For instance, given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 ∩ 2 = {3}
  3. Difference - The difference of two sets is a set that contains elements from the first set with the elements common to both sets removed. Given,
    • S1 = {1, 2, 3} and
    • S2 = {3, 4, 5}
    • S1 − 2 = {1, 2}

We also have the following additional operations,

  1. IsMember - Tells you if an element is a member of the set. It is denoted by the ∈ symbol. So,
    • 1 ∈ {1, 2, 3} but
    • 4 ∉ {1, 2, 3}

Implementing sets

Let's look at the different ways we can implement the Set data structure and analyse each implementation.

Bit-vectors

Bit vectors are an array of bits (x1, x2, ... xn) such that xi = 1 if i is a member of the set. For example,

S1 = {1, 2, 3} will be represented as 11100 while
S2 = {3, 4, 5} will be represented as 00111.

Advantages

  1. IsMember(), Add() and Delete() are O(1) operations.
  2. Union(), Intersection() and Difference() are at most O(N) where N is the size of the universe.
  3. For non-sparse sets, bit-vectors can represent the set quite tightly.
  4. Bit operations are typically faster.

Disadvantages

  1. Takes too much memory for large universes.
  2. Needs a mapping function for non-integer members.

Linked-list

A linked list implementation keeps a sorted list of elements for every set. This means,

S1 = {1, 2, 3} will be stored as 1 -> 2 -> 3 while
S2 = {3, 4, 5} will be represented as 3 -> 4 -> 5.

Properties

  1. IsMember() is at most O(n) operation where n is the size of the set.
  2. Union(), Intersection(), Difference(), Add() and Delete() are take at most O(mn) where m is the size of the first set and n is the size of the second set.

Advantages

  • For very large universes, this outperforms a Bit-vector implementation.

Disadvantages

  • Typically the slowest implementation compared to the rest.

Hashtables

A set can also be implemented as a hashmap. The time complexity of the various functions is dependent on the particular implementation of a hashtable.

Properties

  1. IsMember() is typically a O(1) operation.
  2. Add() and Difference() are typically O(n) where n is the size of the second set.
  3. Union() and Intersection() are typically O(mn) where m is the size of the first set and n is the size of the second one.

Ordered trees

Sets can also be implemented as ordered trees. Priority Queues are an example of an ordered tree. A Binary search tree implementation of a set would give you the following properties,

Properties

  1. IsMember() is an O(log n) operation.
  2. Add(), Difference(), Union() and Intersection() are O(log n).

Applications

  1. Information retrieval - Inverted index.
  2. AND & OR queries.

Get in touch

Email

skype twitter linkedin github amazon

Advert

Latest Articles

Building a url shortener in 10 mins
A 10 minute guide to building your own url shortener i.e. a service that lets you shorten a long url - http://pravin.insanitybegins.com to a url that takes up less number of characters - http://goo.gl/q3jEH
Going Text-mode
I spend a great deal of time working on the Terminal. Here, I list my must-have console applications for email, www browsing, and programming in Linux, Windows and Mac OSX.
Setting up Replica sets in MongoDB
A 5-minute guide to setting up replica sets on MongoDB. Covers downloading, installation on various flavours of linux, editing the config file and initiating replica sets!
Super quick Find & Replace
This article explores a number of ways of performing a find/replace and compares the various implementations for different sizes of the find/replace list and input text.
Set data structures & Implementation
Sets are an important concept and extremely useful in various computer science applications. We take a look at some of the ways a set data structure can be implemented.
Manage your wireless with nmcli
This article shows you how to use the command line tool nmcli to manage your wireless networks. If you prefer cli over gui, go ahead and take a look.
Binary heaps & Priority Queues
Heaps and Queues can be a powerful data structure. This article goes into the implementation of a binary heap and extends the data structure to act as a priority queue.
Recursive functions, Stack overflows and Python
Here we look at recursive functions in Python, and explore some ways of avoiding the maximum stack depth limit by using alternate functions like reduce().
Facebook puzzle - Find Sophie
Today we solve the FB engineering puzzle - Find Sophie. We start with a naive solution and improve the algorithm until we can pass the Facebook Puzzlebot. In closing, I leave you with open-ended questions on improving the algorithm further.
Palindromic sub-sequences in python
This bit of python code returns all palindromic subsequences in the input string. Nothing to see here, I was just having a meh moment.