A castle built on soft sand? - ML on funky data

Jay Dean
Jul 14, 2021
2 min read

Updated: Sep 21, 2021

Called in to help a senior Data Scientist complete an exciting application of machine learning designed to identify patterns and clusters in customer, and prospective customer data. The issue, as described to me, was in packaging the promising data science results into something that could be deployed in production. Great idea, I'm on it!

Something doesn't look right:

So I met with the Chief of Data Science, and the work looked impressive. Then, I met with Product Management to discuss how to sell this new service to paying customers. In my head I'm building a little map or flow-diagram. I would have done this on a whiteboard but we were meeting via remote conferencing (MS Teams in this instance) and so, no whiteboard. I've got the ml piece diagrammed and the hoped-for product/service diagrammed but the two don't connect. Don't connect at all. Hmm.....

The source of the disconnect was the data source for the wonderful new ML models. Our data scientist had shown me the training data; it was wonderfully feature rich and varied. Then the product folks described getting consumer data from our clients. I have spent years working with that stuff... know it well. It rarely (let's be honest, "never") looks like that training data. The mandate was to create a pipeline to import consumer data from clients, feed it to the models and return the results, all nice and packaged for easy consumption, but there was a disconnect between the expected data and the realistic input data.

Let's train on some realistic,(or even "real") data:

Before we build the application infrastructure, we had to get the models retrained on more realistic inputs. That took some effort, and some clever work to enrich our customer input data to give the models more to work with, but once accomplished, the rest was pretty straightforward. The is enough processing applied to the input data that I wondered if our models were describing the customer's data or our processing, but I think we got it right. The clients seem to like it, and the new product has been well received.

A lesson learned: Map the whole process at the start.

The lesson here is to get your data scientists, data engineers, and customer service or customer success folks together at the very start. In this case described above, the data science group wanted to work on customer demographic and purchase data, but was handed something developed in a research context that in no way reflected realistic input from a real customer, so the models were wonderfully successful but not realistic. I strongly recommend you pay close attention to who is building your training data, and the work they are doing. Do you know what you are modeling?

Comments