Chaining in Python and R

30 april 2020

We used to think that the main objective of programming is to communicate our ideas to the machine. Nowadays, we realize that programming also has to do with communicating ideas to other developers. That’s where clean code comes into play.

In actuality, the vast majority for the cost of code isn’t in initial development, it is maintenance over the long-term. In software engineering there is the concept of “modifiability” which encompasses the idea that quality code is easily modified and extended.

Both Python and R support syntax constructs for writing code that can be easily understood by other developers, including using good variable names and coding styles to achieve “self documenting code” like the chaining construct we will discuss in this post. Self documenting code is essential as sooner or later will come the day other developers will either decide to abandon or maintain your code.

Example

Let’s assume you work for an insurance company selling its products through agents. Each agent is paid a commission, calculated as an agent specific percentage of written premiums.

You are asked to produce a quarterly report containing this bar chart summarizing the commission in dollars your insurance company has to pay each agent for their online sales.

What do the data look like? You have two data tables at your disposal. In the premium table (25,184 rows), each record represents a premium booking classified by channel and agent. The commission table (12 rows) features the brokerage – expressed as a percentage of premiums – for each agent.

Chain of data transformations

Taking one step back, we realize that producing the bar chart involves a chain of subsequent data transformations, where the output of each step is the input for the next:

filter the premium table on sales channel “online”;
merge the premium table with the commission table on agent;
for each premium, calculate the agent’s commission;
in the combined table, group the commissions by agent;
for each group, sum the commissions;
sort the commission paid; (yes, we want the agent receiving the highest commission on top of the bar chart)
plot the results as a bar chart.

Once we have this chain of data transformations in place, we can produce the bar chart our boss asked for with just one line (or “chain”) of code.

Chaining in Python

In Python, the data transformations described above are chained through the dot operator, using the pandas library:

Chaining in R

In R, the data transformations described above are chained through the forward pipe operator %>%, using the tidyverse package:

Conclusion

Chaining is a powerful construct supported both by Python and R. Chaining allows you to express the idea of data flowing through a pipeline almost 1-to-1 into your code, making your code more readable both for yourself and for other developers. Feel free to try it yourself, code and data can be found here.

Acknowledgements: I would like to express my gratitude to David Langer, hands-on writer and instructor on data analytics, for his valuable comments on this post.

Deel dit artikel op: