How SETS are internally stored in Python?

Set in Python can be defined as the collection of similar items. In Python, these are basically used to include membership testing and eliminating duplicate entries.

The data structure used in SET is Hashing, a popular technique to perform insertion, deletion and traversal in O(1) on average. The operations on Hash Table are somewhat similar to Linked List. Sets in python are unordered list with duplicate elements removed.

In order to create a hash table from scratch, we start with some allocated memory, similar to what we started with for arrays.

For an array, if we want to insert data, we simply find the smallest unused bucket and insert our data there (and resize if necessary).

For hash tables, we must first figure out the placement of the data in this contiguous chunk of memory.

The placement of the new data is contingent on two properties of the data we are inserting: the hashed value of the key and how the value compares to other objects.

This is because when we insert data, the key is first hashed and masked so that it turns into an effective index in an array.

The mask makes sure that the hash value, which can take the value of any integer, fits within the allocated number of buckets.

So, if we have allocated 8 blocks of memory and our hash value is 28975, we consider the bucket at index 28975 & 0b111 = 7.

If, however, our dictionary has grown to require 512 blocks of memory, then the mask becomes 0b111111111 (and in this case, we would consider the bucket at index 28975 & 0b11111111).

Now we must check if this bucket is already in use. If it is empty, we can insert the key and the value into this block of memory. We store the key so that we can make sure we are retrieving the correct value on lookups. If it is in use and the value of the bucket is equal to the value we wish to insert (a comparison done with the cmp built-in), then the key/value pair is already in the hash table and we can return. However, if the values don’t match, then we must find a new place to put the data.