High-volume Streaming Data

GETTING VALUE FROM THE "DATA FIREHOSE" WITHOUT BEING DROWNED.

Situation: A new streaming data source could enhance customer data if it can be intelligently absorbed.

A new source of streaming data from multiple web and mobile sources might significantly enhance a company's  understanding of customer and new propect behavior and preferences. The data lake has heretofor been updated in batches, but the new streaming data is accumulating fast and has become overwhelming.

The Response - Assess the new data's usefulness, then prepare to process as it arrives.

Step1: A sample of the accumulated stream was separated out for intensive analysis. Speculation about what "we might find" or "could be learned" was put to the test. Is it really there? Is it new and surprising insight, or just a validation of what was already known? Is the new information actionable or just "nice to know"?

We found all of those situations. Much of the imagined, or "hoped for" learning was not supported in the data, or was found to be adequately covered by less cumbersome and expensive sources. But some very useful new learnings were identified.

Step 2: Design a process to reduce small batches of streamed records to a select group of key counts, sums and dates. Essentially, a "count them as they go by" strategy, this approach wrings the value out of the high-volume stream as it is received, eliminating the need to store large volumes of transient data.

The resulting aggregates and counts were much compressed than the original stream, but even these were found to have short-lived value. After accumulating for some days, these records are further aggregated in a process similar to the initial processing, and then disposed from the data lake.

Step 3: The processing was tested and refined by applying it in one day batches from the accumulated backlog. With the backlog now cleared, the processing is applied to the stream as it arrives, several times each day. The results are stored in the data lake and incorporated within the company's customer knowledge base. This new insight has become a trigger for new marketing and merchandising programs, and it being explored as a source of new features for ML modeling. 

What was Jay Dean's role?

As consulting analyst and architect, engaged by the CTO, Jay performed the analysis stage, in cooperation with the company's analysis group, and then designed, tested and documented the new processing plan. The validated process was handed over to the Dev Ops team for on-going implementation, while Jay engaged in additional analysis and exploration suggested by the insights from the initial stages.

Key Lessons

  1. Streaming data, especially that generated from on-line sources, can arrive in very high volume. The value of any single record is very low, but observations over time can be much more useful. However, accumulated records will become unwieldy and eventually so cumbersome that attempts to make use of the data are limited or even abandoned. Your data team will always have something else they need to do; "We'll get to it later". A better approach is to be realistic about what value can be extracted from the stream and develop ways to capture that knowledge as the stream data arrives.

  2. Experiencing the lower cost and seemingly limitless storage and processing power of a cloud platform, a data team can fall into a trap. They will keep everything that flows in, telling themselves, "You never know what you'll need". But there is a cost to this sort of data hoarding, both in direct fees to your cloud provider and in the reduced usefulness of the unprocessed data pile. And the phrase, "you never know" is ultimately a cop-out. Much can be known with a little effort, and decisions about priorites and capacity are a regular part of any business process. Resist the temptation to stuff streams. logs and other ephemeral data into your data lake and say "later". Think the problem through and reduce those streams to useful information at the doorstep.

  3. Despite the caveat expressed above, the space and processing power of new data tools and cloud platforms opens a door to new sources of knowledge. Even after refining the processing as described above, keep a complete record of data from a smaller sample of the data, perhaps a test-panel of customers, so that later analysis can look for unanticipated trends and patterns only visible over time, or find new behaviors not observed in the first analysis. Again, you may find this a rich source of new features for Machine Learning.