High-volume Streaming Data


Situation: A new streaming data source could enhance customer data if it can be intelligently absorbed.

A new source of streaming data from multiple web and mobile sources might significantly enhance a company's  understanding of customer and new propect behavior and preferences. The data lake has heretofor been updated in batches, but the new streaming data is accumulating fast and has become overwhelming.

The Response - Assess the new data's usefulness, then prepare to process as it arrives.

Step1: A sample of the accumulated stream was separated out for intensive analysis. Speculation about what "we might find" or "could be learned" was put to the test. Is it really there? Is it new and surprising insight, or just a validation of what was already known? Is the new information actionable or just "nice to know"?

We found all of those situations. Much of the imagined, or "hoped for" learning was not supported in the data, or was found to be adequately covered by less cumbersome and expensive data. But some very useful new learnings were identified.

Step 2: Design a process to reduce small batches of streamed records to a select group of key counts, sums and dates. Essentially, a "count them as they go by" strategy, this approach wrings the value of of the high-volume stream as it is received, removing the need to store large volumes of transient data.

The resulting aggregates and counts were much compressed from the original stream, but even these were found to have short-lived value. After accumulating for some days, these records are further aggreated in a process similar to the original processing, and then disposed from the data lake.

Step 3: The processing was tested and refined by applying it to individual day's volume from the accumulated backlog. The backlog now cleared, the processing is applied to the stream as it arrives, several times each day. The results are stored in the data lake and incorporated in the companies customer knowledge. This new insight has become a trigger for new marketing and merchandising programs, and it being explored as a source of new features for ML modeling. 

What was Jay Dean's role?

As consulting analyst and architect, engaged by the CTO, Jay performed the analysis stage, in cooperation with the company's analysis group, and then designed, tested and documented the new processing plan. The validated process was handed over to the Dev Ops team for on-going implementation, while Jay engaged in additional analysis and exploration suggested by the insights from the intial stages.

Key Lessons

  1. Streaming data, especially that generated from on-line sources, can arrive in very high volume. The value of any single records is very low, but observations over time can be much more useful. However, accumulated records will become unwieldy and eventually so cumbersome that attempts to make use of the data are limited or even abandoned. Your data team will always have something else the need to do; "We'll get to it later". The better approach is to be realistic about what value can be extracted from the stream and develop ways to get that new knowledge as the stream data arrives.

  2. Experiencing the lower cost and seemingly limitless storage and processing power of a cloud platform, a data team can fall into a trap of keeping everything that flows in, telling themselves "you never know what you'll need". But there is a cost to this sort of data hoarding, both in direct fees to your cloud provider and in reduced usefulness of the unprocessed data pile. And the phrase, "you never know" is ultimately a cop-out. Much can be known with a little effort, and decisions about priorites and capacity are a regular part of any business process. Resist the temptation to stuff streams. logs and other ephemeral data into your data lake and say "later". Think the problem through and reduce those streams to useful information at the doorstep.

  3. Despite the caveat expressed above, the space and processing power of new data tools and cloud platforms opens a door to new sources of knowledge. Even after refining the processing as described above, keep a complete record of data from a smaller sample of the data, perhaps a test-panel of customers, so that later analysis can look for unanticipated trends and patterns only visible over time, or find new behaviors not observed in the first analysis. Again, you may find this a rich source of new features for Machine Learning.