String Interning in Python (Optimization)

String interning is an optimization technique by making global cache of particular strings in memory as they are instantiated. It is basically reusing strings on-demand.


In Python optimization, string interning is the process of making cache some strings to reduce memory use and to speed up processing.


In standard Python, following things are automatically interned:

  1. As the Python code is compiled; identifiers — name given to variables, functions, classes, constants etc — are automatically interned. An identifier must start with _ (underscore) or a letter and can only contain _ , letters and numbers.
  2. Some string literals that looks like an identifiers are also automatically interned.

To clear the concept of string interning, let's consider few examples:

Example 1


x = "welcome"
y = "welcome"

print("Address of x: ", id(x))
print("Address of y: ", id(y))

Output

Address of x:  2779414734448
Address of y:  2779414734448

Address of x & y are same, which means no new object is created while doing y = "welcome" and it is referenced from cached version of x = "welcome". Here x & y are interned automatically since they look like an identifier.

Example 2


x = "welcome home"
y = "welcome home"

print("Address of x: ", id(x))
print("Address of y: ", id(y))	

Output

Address of x:  2779413032368
Address of y:  2779413032880

Address of x & y are different which means they are not interned but their value is same. Here x & y are not interned because they don't look like an identifier.

'==' vs 'is' in Python

In the above example, value are same but address is different. This can be verified quickly using == and is operator in Python. Operator == compares value while operator is compares address. Here is example:


x = "welcome home"
y = "welcome home"

print(x is y) # address comparison
print(x == y) # value comparison

Output

False
True

Example 3


x = "_this_is_very_long_string_to_check_string_interning_in_python_"
y = "_this_is_very_long_string_to_check_string_interning_in_python_"

print("Address of x: ", id(x))
print("Address of y: ", id(y))

Output

Address of x:  2779414791200
Address of y:  2779414791200

Address of x & y are same so they are automatically interned since they look like an identifier.

Manual String Interning

As we know not all strings are interned by Python but we can force strings to be interned by using the sys.intern() method.

Interning Using sys.intern()


import sys

x = sys.intern("Learning to fly")
y = sys.intern("Learning to fly")

print("Address of x: ", id(x))
print("Address of y: ", id(y))

Output

Address of x:  2779414734448
Address of y:  2779414734448

Address of x and y are same. Here string x and y are forced to be interned.

When To Use Manual String Interning

In general don't do manual string interning unless and until you have some valid reason to do it. It is becuase Python does this automatically in relevant cases. However there are some cases where manual string interning is quite applicable:

  1. Dealing with large number of strings where string has high frequency of repetition. This scenario generally occurs while tokenizing string in Natural Language Processing.
  2. In those cases where strings need to be compared very often. Becuase string comparison using operator is (address comparison) is very faster than operator == (character wise comparison).