Let’s explore two great Python libraries — itertools and more_itertools and see how to leverage them for data processing…
There are lots of great Python libraries, but most of them don’t come close to what built-in
itertools and also
more-itertools provide. These two libraries are really the whole kitchen sink when it comes to processing/iterating over some data in Python. At first glance however, functions in those libraries might not seem that useful, so let's make little tour of (in my opinion) the most interesting ones, including examples how to get the most out of them!
You have quite a few options when it comes to filtering sequences, one of them is
compress, which takes iterable and boolean selector and outputs items of the iterable where the corresponding element in the selector is
We can use this to apply result of filtering of one sequence to another, like in the above example, where we create list of dates where the corresponding count is greater than 3.
As name suggests — we will use this function to accumulate results of some (binary) function. Example of this can be running maximum or factorial:
If you don’t care about intermediate results, you could use
fold in other languages), which keeps only final value and is also more memory efficient.
This function takes iterable and creates infinite cycle from it. This can be useful for example in a game, where players take turns. Another cool thing you can do with
cycle is to create simple infinite spinner:
Final one from
itertools module is
tee, this function creates multiple iterators from one, which allows us to remember what happened. Example of that is
pairwise function from itertools recipes (and also
more_itertools), which returns pairs of values from input iterable (current value and previous one):
This function is handy every time you need multiple separate pointers to the same stream of data. Be careful when using it though, as it can be pretty costly when it comes to memory. Also important to note is that you should not use an original iterable after you use
tee on it as it will mess up (unintentionally advance) those new
Now, let’s have a closer look at what the
more_itertools library offers, as there are many interesting functions, that you might not have heard about.
First up from
divide. As the name suggests, it divides iterable into number of sub-iterables. As you can see in example below, the length of the sub-iterables might not be the same, as it depends on number of elements being divided and number of sub-iterables.
With this function, we will be also dividing our iterable, this time however, using a predicate:
In the first example above, we are splitting list of dates into recent ones and old ones, using simple lambda function. For the second example we are partitioning files based on their extension, again using lambda function which splits file name into name and extension and checks whether the extension is in list of allowed ones.
If you need to find runs of consecutive numbers, dates, letters, booleans or any other orderable objects, then you might find
In this example, we have a list of dates, where some of them are consecutive. To be able to pass these dates to
consecutive_groups function, we first have to convert them to ordinal numbers. Then using list comprehension we iterate over groups of consecutive ordinal dates created by
consecutive_groups and convert them back to
Let’s say you need to cause side-effect when iterating over list of items. This side-effect could be e.g. writing logs, writing to file or like in the example below counting number of events that occurred:
We declare a simple function that will increment a counter every time it’s invoked. This function is then passed to
side_effect along with non-specific iterable called
events. Later when the event iterator is consumed, it will call
increment_num_events for each item, giving us final events count.
This is a more powerful version of another
more_itertools function called
collapse allows you to flatten multiple levels of nesting. It also allows you to specify base type, so that you can stop flattening with one layer of lists/tuples remaining. One use-case for this function would be flattening of Pandas
DataFrame. Here are little more general purpose examples:
First one generates list of files and directory paths by collapsing iterables returned by
os.walk. In the second one we take tree data structure in a form of nested lists and collapse it to get flat list of all nodes of said tree.
Back to splitting data.
split_at function splits iterable into lists based on predicate. This works like basic
split for strings, but here we have iterable instead of string and predicate function instead of delimiter:
Above, we simulate text file using list of lines. This “text file” contains lines with
-------------, which is used as delimiter. So, that's what we use as our predicate for splitting these lines into separate lists.
If you need to split your iterable into multiple buckets based on some condition, then
bucket is what you are looking for. It creates child iterables by splitting input iterable using key function:
Here we show how to bucket iterable based on items type. We first declare a few types of shapes and create a list of them. When we call
bucket on this list with the above key function, we create a bucket object. This object supports lookup like built-in Python
dict. Also, as you can see, each item in the whole bucket object is a generator, therefore we need to call
list on it to actually get the values out of it.
Probably the most interesting function in this library for all the data science people out there — the
map_reduce. I'm not going to go into detail on how MapReduce works as that is not purpose of this article and there's lots of articles about that already. What I'm gonna show you though, is how to use it:
This MapReduce implementation allows us to specify 3 functions: key function (for categorizing), value function (for transforming) and finally reduce function (for reducing). Some of these function can be omitted to produce intermediate steps in MapReduce process, as shown above.
If you work with spreadsheets of data, chances are, that you need to sort it by some column. This is a simple task for
sort_together. It allows us to specify by which column(s) to sort the data:
Input to the function is list of iterables (columns) and
key_list which is tells
sort_together which of the iterables to use for sorting and with what priority. In case of the above example with first sort the "table" by Date of Birth and then by Updated At column.
We all love iterators, but you should always be careful with them in Python as one of their features is that they consume the supplied iterable. They don’t have to though, thanks to
seekable is a function that wraps iterable in an object that makes it possible to go back and forth through an iterator, even after some elements were consumed. In the example you can see that we've got
StopIteration exception after going through the whole iterator, but we can seek back and keep working with it.
Let’s look at following scenario: You received mixed data, that contains both text and numbers and all of it is in string form. You, however, want to work only with numbers (floats/ints):
filter_except filters items of input iterable, by passing elements of iterable to provided function (
float) and checking whether it throws error (
TypeError, ValueError) or not, keeping only elements that passed the check.
unique_to_each is one of the more obscure functions in
more_itertools library. It takes bunch of iterables and returns elements from each of them, that aren't in the other ones. It's better to look at example:
Here, we define graph data structure using adjacency list (actually
dict). We then pass neighbours of each node as a set to
unique_to_each. What it outputs is a list of nodes that would get isolated if respective node was to be removed.
This one has a lot of use cases, as it’s quite common that you would need to iterate over a range of some non-integer values:
What is nice about
numeric_range is that it behaves the same way as basic
range. You can specify
step arguments as in examples above, where we first use decimals between
3.5 with step of
0.3 and then dates between
2020/2/15 with step of 2 days.
Last but not least,
make_decorator enables us to use other itertools as decorators and therefore modify outputs of other functions, producing iterators:
This example takes
map_except function and creates decorator out of it. This decorator will consume the result of the decorated function as its second argument (
result_index=1). In our case, the decorated function is
read_file, which simulates reading data of some file and outputs a list of strings that might or might not be floats. The output however, is first passed to decorator, which maps and filters all the undesirable items, leaving us with only floats.
I hope that you learned something new in this article, as
more_itertools can make your life a whole lot easier if you are processing lots of data frequently. Using these libraries and functions however, requires some practice to be efficient with. So, if you think that you can make use of some of the things shown in this article, then go ahead and checkout itertools recipes or just force yourself to use these as much as possible to get comfortable with it. 😉