Migration to Cloud

SAME DATA - ALL NEW STRUCTURE AND PROCESS

Situation: Large scale Identity Warehouse in SQL databases due to move to Cloud

This data warehouse, built and maintained over years, was hosted on servers in the company datacenter running SQL Server. Data about persons and key contact points, addresses, phone, email and social IDs, was rigorously normalized and indexed. This database is maintained in a "current version" state, and has many uses and users, so tables are heavy with "flags" and other indicators of specific properties, limits and indicators.

The datawarehouse was slated to be relocated to a IaaS cloud provider (AWS). While it's possible to move the entire data structure to the cloud intact, and host it on virtual servers, that approach fails to exploit the full power and potential of the cloud platform (and would likely increase costs).

Response: Rethink how the data is structured and maintained - Use a Data Lake model

Data in Zones:

Data comes into the lake through persistant zones based on the state of processing; Raw, Standardized, QA, Curated and Application Zones are examples. After trying various technologies, Databricks (Spark) was chosen for compute, query and analysis purposes. Tables were rebuilt as partitioned parquet files. 

After reviewing actual data lake uses for paying customers, key tables were extensively de-normalized. Many support tables containing additional demographic or other attributes are  available in the lake to be joined to the main tables. 

Undoing all that careful normalization:

Denormalized tables and new, special purpose versions resulted in redundant data which traditionally creates issues with data integrity, but these issues were avoided by entirely rebuilding the complete structure with each influx of new data (Careful partitioning helps). 

With more complete versions of the data stored in the earlier zones, the Application zone, intended to support customer-facing applications, is very refined and optimized for that purpose. Nothing that is old, superceded, invalid or not licensed for the app is present in that zone. Sloppy application coding cannot inadvertantly expose the wrong data.

Build for the actual business need:

The core Identity tables had been placed in a "graph" structure. However, looking at the actual use of the graph in revenue-generating jobs indicated that much simpler structures could deliver the same result for customers, so that structure was altered. The simplified graph is less able to "do anything", but it very efficiently supports the actual business workloads.

Helping users work with the new structures:

Extensive pre-defined table structures were created in Hive to allow analysts, app developers, and the services team to use familiar SQL query. Migrating and re-architecting the data added power and reduced costs, but the highly experienced team was harder to "re-engineer". Bringing the data to them in a familiar form and familiar query language made them immediately productive.

Adding new "Big Data" sources:

With the new power of the cloud platform and Spark, new data sources could be incorporated into the datalake and exploited to improve the companies producs and services, including some very large scale streaming sources. Some of these are so large that they must be processed, analyzed and turned into useful metrics and aggregates shortly after receipt. Direct links between data lake tables and the streamed data, if stored over time, taxes the performance of even these powerful systems, and is effectively useless. The aggregates and counts, however, are incorporated into the data lake and can

enhance the base table, just as the other attribute tables do.

What was Jay Dean's role?

As Chief Architect and also "Product Owner" for the data lake development project, Jay worked daily with the data developers, authored the User Stories and guided the priorities for this massive program. Also, as Jay frequently performs exploratory analysis on the company data, and on new data sources, Jay was able to act as a key user and tester for the new data structures. 

As elements were completed, they were not only tested for integrity, completeness and fullfillment of the requirements, but also for usefulness to an analyst or application. Many changes were implemented only after the first version had been subject to extensive internal use.

Key Findings:

1. When building a warehouse on owned servers, one must anticipate all the potential uses of the data system in future years. But in a cloud environment, adding new storage or processing capacity is quick. New and expanded versions of some data, or a special purpose data set, can be generated and made available in hours. One need not anticipate what *might* be needed a year from now. Rather, you can create a data structure to serve today's needs, and let tomorrow's dollars pay for tomorrow's needs. This data lake evolved to be less of an all-purpose datastore, becoming more tightly focused on the key business needs of today.

2. The need to capture and retain good metadata and process metrics is underappreciated. Also a data lake becomes a big and hard-to-navigate place pretty quickly. Create and maintain data guides and dictionaries, or your users will become lost and frustrated. They will need help understanding when to use each zone, or each special purpose view.

3. Cloud-based relational data systems can directly query the same partitioned files as your data lake, and give users and applications a familiar structure. Data that requires frequent updates can be stored in a separate data mart and then bulk updated back to the lake. In addition, new services from cloud providers are making new architectures possible (DeltaLake from DataBricks for example). Expect your data lake to be as "fluid" as the metaphore implies.