Amazon Redshift was announced in November of 2012 and became the first cloud data warehouse, opening a whole new segment of technology. What is a cloud data warehouse exactly?
On-premises data warehouse
While cloud data warehouses are relatively new, at least from this decade, the data warehouse concept is not. A data warehouse is a data store designed for storing large quantities of data over a large period of time. It centralizes data from multiple systems into a single source of truth. It is often loaded in batches, with minimal updates, and read from repeatedly. Check out this ancient diagram of a data warehouse.
Given the data processing needs of a data warehouse, they tend to be implemented on massively parallel processing (MPP) systems. The MPP architecture replies upon a shared nothing concept for distributing data across various slices. Compute nodes are layered on top of the storage and processes queries for data residing in its local slice. The control node is responsible for taking a query and dividing it up into smaller queries to be run in parallel on the compute nodes.
So that is what is in the cloud, right? Hold on, we are almost there.
Data warehouse appliance
Before understanding the cloud data warehouse, it is important to understand data warehouse appliances. The term data warehouse appliance may have been coined by the founder of Neteeza but the first data warehouse appliance was likely made by Teradata in 1990. Data warehouse appliances follow the MPP architecture, which first came about in the early 1980s.
These data warehouse appliances have been the best options for large-scale data processing for some time. There are a variety of options, as well. To name a few…
These appliances delivered on the business need but also delivered eye-widening invoices and challenging scale-out options resulting in under-utilized systems. There are a number of benefits to a cloud data warehouse, however, I would argue that the two most important reasons to look into a cloud data warehouse are addressing the cost prohibitive nature of the data warehouse appliances and gaining the enhanced flexibility which is native to the cloud.
Cloud data warehouse
The modern data warehouse architecture has not changed must since the relic that I showed you in the first section but it has grown much. Rather than just ingesting data from a few operational systems, a data lake must be supported, third-party data, non-relational data, internet of things (IoT) data, social listening, machine learning, and predictive analytics are in play.
Each of the public cloud providers, with data warehouse offers, have very different means of implementing the same MPP concept. A cloud data warehouse is more than a data warehouse appliance in the cloud, however. These providers deliver a platform and ecosystem to house and support the cloud data warehouse, connecting the warehouse to data types/sources and services which are challenging to implement on-premises.
Who am I talking about?
- Amazon RedShift
- Microsoft Azure SQL Data Warehouse (ADW)
- Teradata IntelliCloud (deployed on Teradata, AWS, or Azure)
- Google BigQuery
- Oracle Database Cloud Exadata Service
- IBM Db2 Warehouse in the Cloud (formerly dashDB)
It is not possible to compare or list all the services in this blog post due to the number of cloud data warehouse providers and the extent in which they vary in their services and features. I will, however, highlight a few of my favorite features which are natively available in most of the cloud providers.
Companies like Microsoft, Amazon, and Google operate on such a large-scale that, to the individual customer, there are effectively no limits to the data storage capacity and compute power which can be leveraged. These companies offer solutions at the petabyte and exabyte scale. By the time anyone needs a zettabyte or yottabyte, it will be available.
In addition to the potential scale, scaling up an existing implementation is far easier, and faster, than it is on-premises. Each provider implements their systems differently and fulfills scaling requests at different degrees of speed and disruptive activities.
With Amazon Redshift, you can add nodes to your cluster and each node adds CPU, memory and disk space. The storage capacity jumps in increments as small as 160 GBs and as large as 16 TBs. Adding a node requires data to be redistributed, which can take a couple of hours, and provides compute power in a non-linear fashion.
With Microsoft Azure SQL Data Warehouse, the storage scales seamlessly and compute power is scaled by adding Data Warehouse Units (DWUs). Under the covers, more Azure SQL Databases are stood up to support the throughput but the details are abstracted away more than with Redshift. The scale operations offer a linear increase in throughput and are disruptive but only take minutes because the data does not need to be redistributed.
By renting out their hardware at a large-scale, cloud providers do not need to lock their customers into contracts for using it. This means that you can scale up and down to suit your needs. As mentioned once already, each cloud provider will offer diverse types of services and capabilities but I will, again, use Microsoft and Amazon as examples.
ADW requires all connections to be killed when changing scale but, outside of that, it will only take minutes to scale up or down. This opens opportunities to perform scale operations multiple times a day. No longer do you need hardware to meet your peak times and then under-utilize that hardware the rest of the day/week/month. You can, for example, scale up for a large set of ingestion processes over the weekend and then scale down for the rest of the week to save money.
Redshift also provides the ability to scale up and down. In Redshift, you do not need to kill queries to add nodes and the process is online, but puts the system into a read-only state until the data redistribution process is complete. Scaling down takes a bit longer. You must take a snapshot of the cluster and restore from the snapshot to a smaller cluster. This may take longer than with ADW but is orders of magnitude faster that it would be to order hardware and provision it on-premises.
Platform as a Service
Cloud data warehouses are platform as a service offerings. That means that the cloud provider is managing the infrastructure while you work only within the data warehouse software itself. To some, this is scary. You will no longer control when patches happen to the underlying hardware. However, you will never have to worry about performing the patches yourself. They are all very much on top of their game too. This post shows that all the Azure infrastructure was patched on the same day that the Meltdown vulnerability was announced. AWS and the other providers reacted in a comparable manner, making their servers more secure than most other on-premises data centers will be for many more months to come.
In addition to patches, they handle seamless replacement of end of life hardware. When new hardware is provisioned, you may see performance gains that you did not have to pay for due to newer generations of hardware natively performing better. Finally, these cloud providers can afford to invest heavily into engineering / optimizing their data centers and continue to drive costs down. Typically, customers never see prices increase, only decrease, unless you opt to elevate your performance tier.
I have conversed with a couple of cloud providers in a pre-sales capacity. With every company, they pushed very hard to sell me on their roadmap and the value of the cloud ecosystem. There is such emphasis on this because entering a relationship with a cloud provider is the same as working with a large development team who will build you services and products without being paid and then hope that you will like it enough to begin using them. They innovate and push the bleeding edge forward and then you get to pick and choice which pieces were successful and which pieces meet your business needs.
If one day your business decides that they want to incorporate R for predictive analytics, then the entry cost is extremely low. Depending upon the provider, you might be up and running in less than an hour with seamless integration to your data warehouse.
Cloud data warehouses are an exciting and evolving segment of technology. There is great value to any business who is in need of a data warehouse and enticing to organizations with existing data warehouse appliances coming up on their end of life. When investigating the various cloud providers, be sure to understand what is most important to your business. Each provider has its pros/cons. There is not a single clear winner among them. Having a weighted list of features and service level objectives is critical when choosing a vendor. I also recommend a full-scale proof of concept with your top 2-3 providers before making a final decision. Data warehouses, whether on-premises or in the cloud, are a significant investment, with significant value.