Tour of Python Itertools

By Martin Heinz

Let’s explore two great Python libraries — itertools and more_itertools and see how to leverage them for data processing…

Martin Heinz
Photo by Trevor Cole on Unsplash

There are lots of great Python libraries, but most of them don’t come close to what built-in itertools and also more-itertools provide. These two libraries are really the whole kitchen sink when it comes to processing/iterating over some data in Python. At first glance however, functions in those libraries might not seem that useful, so let's make little tour of (in my opinion) the most interesting ones, including examples how to get the most out of them!

Compress

You have quite a few options when it comes to filtering sequences, one of them is compress, which takes iterable and boolean selector and outputs items of the iterable where the corresponding element in the selector is True.

We can use this to apply result of filtering of one sequence to another, like in the above example, where we create list of dates where the corresponding count is greater than 3.

Accumulate

If you don’t care about intermediate results, you could use functools.reduce (called fold in other languages), which keeps only final value and is also more memory efficient.

Cycle

Tee

This function is handy every time you need multiple separate pointers to the same stream of data. Be careful when using it though, as it can be pretty costly when it comes to memory. Also important to note is that you should not use an original iterable after you use tee on it as it will mess up (unintentionally advance) those new tee objects.

more_itertools

Divide

Partition

In the first example above, we are splitting list of dates into recent ones and old ones, using simple lambda function. For the second example we are partitioning files based on their extension, again using lambda function which splits file name into name and extension and checks whether the extension is in list of allowed ones.

Consecutive_groups

In this example, we have a list of dates, where some of them are consecutive. To be able to pass these dates to consecutive_groups function, we first have to convert them to ordinal numbers. Then using list comprehension we iterate over groups of consecutive ordinal dates created by consecutive_groups and convert them back to datetime.datetime using map and fromordinal functions.

Side_effect

We declare a simple function that will increment a counter every time it’s invoked. This function is then passed to side_effect along with non-specific iterable called events. Later when the event iterator is consumed, it will call increment_num_events for each item, giving us final events count.

Collapse

First one generates list of files and directory paths by collapsing iterables returned by os.walk. In the second one we take tree data structure in a form of nested lists and collapse it to get flat list of all nodes of said tree.

Split_at

Above, we simulate text file using list of lines. This “text file” contains lines with -------------, which is used as delimiter. So, that's what we use as our predicate for splitting these lines into separate lists.

Bucket

Here we show how to bucket iterable based on items type. We first declare a few types of shapes and create a list of them. When we call bucket on this list with the above key function, we create a bucket object. This object supports lookup like built-in Python dict. Also, as you can see, each item in the whole bucket object is a generator, therefore we need to call list on it to actually get the values out of it.

Map_reduce

This MapReduce implementation allows us to specify 3 functions: key function (for categorizing), value function (for transforming) and finally reduce function (for reducing). Some of these function can be omitted to produce intermediate steps in MapReduce process, as shown above.

Sort_together

Input to the function is list of iterables (columns) and key_list which is tells sort_together which of the iterables to use for sorting and with what priority. In case of the above example with first sort the "table" by Date of Birth and then by Updated At column.

Seekable

seekable is a function that wraps iterable in an object that makes it possible to go back and forth through an iterator, even after some elements were consumed. In the example you can see that we've got StopIteration exception after going through the whole iterator, but we can seek back and keep working with it.

Filter_except

filter_except filters items of input iterable, by passing elements of iterable to provided function (float) and checking whether it throws error (TypeError, ValueError) or not, keeping only elements that passed the check.

Unique_to_each

Here, we define graph data structure using adjacency list (actually dict). We then pass neighbours of each node as a set to unique_to_each. What it outputs is a list of nodes that would get isolated if respective node was to be removed.

Numeric_range

What is nice about numeric_range is that it behaves the same way as basic range. You can specify start , stop and step arguments as in examples above, where we first use decimals between 1.7 and 3.5 with step of 0.3 and then dates between 2020/2/10 and 2020/2/15 with step of 2 days.

Make_decorator

This example takes map_except function and creates decorator out of it. This decorator will consume the result of the decorated function as its second argument (result_index=1). In our case, the decorated function is read_file, which simulates reading data of some file and outputs a list of strings that might or might not be floats. The output however, is first passed to decorator, which maps and filters all the undesirable items, leaving us with only floats.

Conclusion