In today's digital age, data is king. Every organization, big or small, collects vast amounts of data, and it is essential to have the right infrastructure in place to store, manage, and analyze this data effectively. The data infrastructure industry plays a vital role in providing businesses with the tools they need to process and make sense of their data. In this write-up, we'll briefly explore the data infrastructure industry, focusing on the fundamental concepts and key technologies involved. While the data infrastructure industry can be quite complex, this post will aim to provide a high-level understanding of the industry's main components and how they work together.
Historically, each company stored its data in large data centers which they owned and operated. These on-premise (“on-prem”) warehouses were the predecessors to the traditional cloud computing warehouses we see today, where companies will lease data center resources from a third-party service provider. The shift to the cloud was driven by exploding data volumes and contracting storage and compute costs (think ~700K $/TB for compute and ~4K $/TB for storage in 2000 vs ~2K $/TB and <15 $/TB today). The biggest cloud providers, from most popular to least popular, are Amazon’s AWS, Microsoft’s Azure, and Google’s Google Cloud Platform (GCP). The cloud provides lower cost and better scalability than on-prem solutions as these providers charge via a consumption-based model, affording little fixed costs to the businesses storing their data.
The shift the cloud also caused a change in how we uploaded data to warehouses. Originally, we upload data via an extract, transform, load (ETL) process, where data transformations happened before the data was uploaded to the warehouse. This process helped standardize data in the warehouses but was expensive, tedious, and hard to scale. So we saw a shift to an extract, load, transform (ELT) process where data transformation happens in the cloud data warehouse (CDW) after data has been loaded into the cloud, meaning CDW platforms (i.e., Amazon Redshift, Snowflake, and Google BigQuery) don’t need external resources to transform data.
First came the cloud data warehouse where data is stored structured (ETL needed before data is loaded into the warehouse). The data warehouse affords data consistency and fast and easy analytics (via SQL). In a move driven by Snowflake, warehouse compute and storage costs separated, allowing companies to scale their storage and compute resources independently. This meant that the barriers to maintaining a database were broken down even further. As data volumes exploded, the data lake architecture arose. The data lake allows companies to store large amounts of both structured and unstructured data, which is raw, unformatted data. This architecture is better for flexibility and is cheaper than the warehouse architecture. Next (and most recently) came the data lakehouse architecture, often supported by an open-source storage layer running on top of the data lake (called a Delta Lake). On top of the Delta Lake, companies can still do data processing (via Spark/DataFrame) for ML and data science and access data with SQL statements for business intelligence (BI). The data lakehouse afforded the benefits of both the data lake (flexibility through storing both structured and unstructured data, low-cost storage) and the data warehouse (easy BI through SQL).
Key data storage players include Snowflake (warehouse) and Databricks (lake, lakehouse).
Snowflake is the older platform, meaning most data infra tools were built for it. Data is stored in a semi-structured format, which is better for BI intelligence. Though Snowflake has historically had limited AI/ML support, they are launching “SnowPark” to allow developers to program in Python, Scala, or Java directly on Snowflake. This means developers no longer need to explore data from Snowflake to perform AI/ML tasks (to compete with Databricks).
On Databricks, data can be stored in any form. Databricks has many of the same integrations as Snowflake and can be thought of as more developer-friendly given its superior AI/ML support. Databricks is the cheaper option, especially at a real scale. Given Databricks started as a data lake, SQL queries (needing structured data) were originally run outside of the platform. Recently, however, Databricks launched “Databricks SQL” to allow developers to perform SQL and BI analysis directly on Databricks (to compete with Snowflake).
So, we see both platforms converging to similar offerings, both competing to become the go-to data cloud.
Amazon Redshift and Google BigQuery are also other popular cloud data warehouse solutions.
When we think of the MDS, we can think of four main components: 1) data collection and loading, 2) data storage, 3) data transformation, and 4) data modeling and analysis.
1) Data collection and loading: data is collected from data sources, such as Salesforce, MySQL databases, Excel, Google Analytics, and Stripe logs. For the most part, this data is unstructured and must be loaded into the data storage provider via custom python scripts or a data ingestion tool. Fivetran is a commonly used data ingestion tool, making it easy for users to send information from data sources to data destinations in a few simple steps, without code.
2) Data storage: data is loaded into storage providers (Redshift, BigQuery, Snowflake, etc.), which were talked about above.
3) Data transformation: once data has landed in a storage provider it may need to be cleaned via transformations so that it can be properly analyzed. Most CDWs allow for basic SQL transformations directly on their platform. An example of a basic SQL transformation would be joining an Orders table and a Customers table to create a Customer_orders table, summing (aggregation) the total number of orders and amount spent by customers, then keeping only the important columns (customer_id, num_orders, total_spend). Other tools like dbt (the most popular data transformer) integrate directly into CDWs and can be used for more complex and modular transformations.
4) Data modeling and analysis: this is where the BI analysis and AI/ML methods sit. Data visualization also falls under this category. While BI modeling, AI/ML, and data visualization can take many forms, the most basic use case is creating data dashboards, through a platform such as Looker, to visualize data in storage. Examples include creating a line graph to plot sales over time, a bar graph to visualize how many users were acquired each month, and a pie chart to show the distribution of acquired users by acquisition source.
As shown in the data infra roadmap below (created by Bessemer), other data services include observability (maintain health and security of data, i.e., Datadog and Monte Carlo) and orchestration (gather siloed data, i.e., Airflow).
In conclusion, the data infrastructure industry is a crucial component of modern technological advancements, and it is constantly evolving to meet the increasing demands of data-driven businesses. This write-up has only provided a glimpse into this complex and rapidly changing field, and there is much more to be explored and learned.
1: Conversations with Oliver Zhang
2: Bessemer data infra roadmap