New technologies are transforming the insurance industry. Tomorrow's winners will be those insurers able to convert vast amounts of data into actionable insights about clients and products alike. An increasing demand for machine learning methods, combined with a growing shortage of data scientists able to create, implement and communicate these methods, call for a data pipeline that is as efficient as possible.
A common approach among data scientist is going step-by-step through the data pipeline using scripting languages like Python and R. This approach can be painstakingly slow and we see that the vast majority of machine learning models never make it to production. Reason are, among others, that the crucial link between the business and the analytics department is missing or because models are overly static and cannot cope well with an ever-changing, high-paced world.
Since a couple of years, however, there is a new kid on the block: automated machine learning platforms ("autoML") that automate several of the most challenging steps in the pipeline, including preprocessing, model building, model tuning, and, last but not least, model deployment. These platforms also prevent less experienced users from making the most common pitfalls like over fitting or data leakage.
This post intends to give a brief overview of the autoML space, its benefits and its limitations.
The autoML space is growing fast. At the moment there are more than 30 vendors offering systems promising a "one-click-data-in-model-out" solution to practical data driven business problems. Some of these products have been summarized in the table below.
Upfront fees for automated machine platforms vary significantly among vendors. Open source packages are free of charge while expenses for full-blown commercial solutions can add up to hundreds of thousands of dollars a year (including licenses and customer support).
The platforms available in the market today differ in the features they offer but they all automate - to a greater or lesser extent - the following steps in the data pipeline:
- Data connectivity
- Exploratory data analysis
- Model building (training, validation, evaluation, comparison and scoring)
- Deployment and communication
All platforms are able to read flat comma separated files. Many platforms can read Excel spreadsheets and offer some form of connectivity to databases and/or the Hadoop data storage system(HDFS). Once the data has been uploaded to the system, several summary statistics like mean, standard deviation and the number of missing values are generated. Some platforms offer more support for automated data cleaning (e.g. imputation of missing values) than others. The same goes for the automated application of data transformations. Once the data has been preprocessed, initiating the model building process can be as simple as hitting a start button. Depending on the platform, models are built for any, some or all of these categories: regression, classification, clustering and time series. The model building process results in a ranking of the models producing several numerical and graphical performance metrics like ROC curves and elevator charts. A powerful feature of many platforms is their support for quick deployment of the selected model through an API endpoint or automated code generation. Another feature important for getting the model accepted by the end-user is the visualization of predictions by the model.
Although autoML can contribute significantly to a more effective and efficient data analysis process in terms of hours invested, it has its limitations. For example, autoML can never replace domain knowledge. Business analysts and data specialists will remain indispensable for translating a business problem to the appropriate data model which on its turn serves as input for the ML platform. Even with the application of autoML, finding the optimal model remains an iterative process that requires judgment, both from a domain and a data technical perspective. Furthermore, autoML platforms produce an abundance of numerical and graphical performance statistics. This output needs to be linked to the original business problem, again requiring both technical and domain knowledge. Also, once the model has been put into production, it will not be able to adapt itself to new circumstances straightforwardly.
Should the insurance industry embrace autoML?
Yes, we think so. With the classical approach, where the data pipeline is basically created from scratch over and over again for each specific situation, it can take months before the most appropriate model has been tested, validated and deployed. AutoML can reduce the time-to-market for data driven solutions from months to weeks or even days. Instead of spending most of their valuable time on activities like data wrangling and model selection, data scientists can now start focusing on adding more value to business-development and innovative ideas.
AutoML: how to find the best platform?
With so many vendors in the autoML space, insurers face the challenging task choosing the most appropriate platform for their specific situation. We think answering the following questions can be a good starting point when selecting a platform and vendor:
- What type of business problem needs to be solved?
- Who will be the users of the platform?
- What data sources will be used?
- What regulations should the business adhere to (data security & transparency)?
When choosing a platform, we recommend comparing at least three vendors using a list of predefined specifications that reflect the business domain.
So what does the advent of autoML mean for an insurance company having the ambition to become truly datadriven? Let it be noted that there seems to be some misalignment between the perception of what data science does and its role in a corporate setting. Data science it not about data wrangling, nor about building complex models. The primary objective of data science is adding value to the business, using data. This requires spending less time on activities that can be performed faster (and often better) by a computer, leaving more time for creative domain related tasks that really matter. This is where a well selected AutoML platform can make a difference. In terms of the overall data science process, the insurance sector should embrace a sizable shortening of the data-wrangling and model-building steps that usually take the majority of a data scientist's time and put renewed focus on data collection and translating insights into action. Finally, as the hugely important and complicated questions mentioned above illustrate, autoML will always require a skilled data analyst to make the investment worthwhile. In summary, combining domain knowledge from the industry with the appropriate autoML platform may not result in a Holy Grail yet, but it may get close to a match made in heaven.