At the time of writing, the Python Package Index hosts more than 120,000 packages.
A stunning number that often raises the following question:
Which Python packages should I use to get the job done?
Last week I had the opportunity to visit PyCon UK 2017 in Cardiff. One of the cool things of attending these conferences is that you meet a lot of folks who can give you great pointers when it comes to selecting the best packages.
In this post I give you an overview of packages I wish I had known about.
Overall data science work flow
TPOT is a tool for processing and optimizing machine learning pipelines in one big for loop.
nbdime makes version control for jupyter notebooks easier by visualizing the differences between sequential versions of a notebook. Useful when several team members are cooperating on the same workbook.
unittest is a package for performing test driven development (TDD). TDD aims at delivering better (production) code and basically entails (re)iterating through the following three steps:
- write the simplest failing test;
- write the simplest production code that passes all tests;
- refactor both production and test code and make sure it still passes all tests.
dask works as a data processor for data too big to fit into ram. Easy to pick up when you already know the pandas package, thanks to its familiar interface.
lens intends automating some of the common (and often time consuming) data exploring tasks, like visualizing features in histograms and creating heat maps to gain insight into the correlations between features. Lens is built on top of dask.
keras is a user friendly package for training neural networks.
hyperopt searches optimal values for the hyperparameters given the objective function.
hyperas is a user-friendly wrapper of hyperopt for keras.
Communicating the outcomes (and convincing the client)
ipywidgets is a package for making your notebooks interactive so that you (or the user) can easily play around with the data.
yellowbrick is a tool for visualizing model quality with residual plots showing model performance both for the training and the test set.
ELI5 is a tool for communicating the outcomes of machine learning models (like XGBoost) to a non-technical audience. It shows graphically which features are most important, provides a local view of your data including interactions between features.
LIME provides a model agnostic way to explain models (as opposed to ELI5 which uses a model specific approach).
ccxt is a package for connecting to an exchange to trade cryptocurrencies. A the time of writing this package supports 91 different cryptocurrencies.
backtrader is a package for testing your trading strategies on historical data.
Acknowledgements: I would like to express my gratitude to the following individuals for sharing their experience on these packages at Pycon UK 2017: Scott Stevenson (Jupyter notebooks and collaboration), Chris Medrela (TDD), Víctor Zabalza (data exploration), Solveiga Vivian-Griffiths (neural networks), Ian Ozsvald (communicating the outcomes and convincing the client) and James Campbell (algotrading).