Microsoft Azure provides a convenient feature when creating a new Azure SQL Data Warehouse database. You can create the database from a backup of the Microsoft sample database, AdventureWorksDW. This is super fast and easy. It also has huge tables like dbo.FactInternetSalesReason with 64,515 rows!
Oh, hold on I think I typo’ed. No, actually that is all. For the scale and performance of SQL Data Warehouse, 64,515 rows in a fact table is quite small.
On four separate occasions, I decided to build a larger SQL Data Warehouse for testing and demonstrations. My first attempt was using the copy of StackOverflow’s database which Brent Ozar maintains as a BitTorrent, here. Once I spent all the time to download the database and then export it in a format which is easy to upload and load into SQL Data Warehouse, I realized that I still needed to enhance the scale of the tables. Also, it is not a data warehouse database. I wanted a star schema and some data which is more like the types of scenarios that I wanted to demonstrate.
I then pivoted to using the Azure provided copy of AdventureWorksDW but then went about extending the table sizes. I made it work but I was disappointed with the amount of effort it would require to make the data look real and proportionally increasing all the dimensions with the facts.
Calling a cab
The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.
I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.
Procedure for data import
Downloading the files is a matter of calling a URL for each file, one per month. I did this manually because queueing up the file downloads did not take very long. However, if you wish you do something more scripted, you can just call the below URL with the year and month number incremented.
The complete data dictionary is also available on the TLC website in PDF format.
Create an Azure blob storage account and create a container within to hold your CSV files.
Download and install Microsoft’s AzCopy utility. You can do this in Command Prompt or PowerShell but the install does not append to your PATH environment variable, which makes going straight to the utility easiest.
Run a command, similar to this one, to upload your entire directory of files to blob storage.
AzCopy /Source:C:\temp\active-upload /Dest:https://sqlhammersandbox.blob.core.windows.net/inbox /DestKey:N6xsupersecretprimarystorageaccountkeyqhk9Bmgdw== /Pattern:"*.csv" /NC:1
Remove the ‘/NC 1’, if you have good or great upload speeds. I am limiting my upload to one file at a time because my upload is terrible and the default parallel copy was timing out.
SQL Data Warehouse Setup
With at least one file in blob storage, this T-SQL script will create the necessary objects for loading the data. Just update the CREDENTIAL SECRET, master key password, and LOCATION.
--/* CREATE MASTER KEY ENCRYPTION BY PASSWORD='MyP@ssword123secretword'; CREATE DATABASE SCOPED CREDENTIAL AzureBlobsqlhammersandbox WITH IDENTITY = 'credential', Secret = 'N6xsupersecretprimarystorageaccountkeyqhk9Bmgdw==' CREATE EXTERNAL DATA SOURCE dsinbox WITH ( TYPE = HADOOP, LOCATION = N'wasbs://firstname.lastname@example.org/', CREDENTIAL = AzureBlobsqlhammersandbox ) CREATE EXTERNAL FILE FORMAT csv WITH ( FORMAT_TYPE = DELIMITEDTEXT, FORMAT_OPTIONS ( FIELD_TERMINATOR =',' ,FIRST_ROW = 3 ,DATE_FORMAT = 'yyyy-MM-dd HH:mm:ss' ) ); --*/ --/* CREATE EXTERNAL TABLE ext_Yellow_Cab_NYC ( vendorid tinyint null, tpep_pickup_datetime datetime2(3) null, tpep_dropoff_datetime datetime2(3) null, passenger_count tinyint null, trip_distance_miles decimal(19,2) null, ratecodeid tinyint null, store_and_fwd_flag char null, pulocationid int null, dolocationid int null, payment_type tinyint null, fare_amount decimal(19,2) null, extra decimal(19,2) null, mta_tax decimal(19,2) null, tip_amount decimal(19,2) null, tolls_amount decimal(19,2) null, improvement_surcharge decimal(19,2) null, total_amount decimal(19,2) null ) WITH ( LOCATION='/', DATA_SOURCE = dsinbox, FILE_FORMAT = csv ) ; select TOP 10 * from ext_Yellow_Cab_NYC --*/ --/* CREATE TABLE [dbo].[Yellow_Cab_NYC] ( vendorid tinyint null, tpep_pickup_datetime datetime2(3) null, tpep_dropoff_datetime datetime2(3) null, passenger_count tinyint null, trip_distance_miles decimal(19,2) null, ratecodeid tinyint null, store_and_fwd_flag char null, pulocationid int null, dolocationid int null, payment_type tinyint null, fare_amount decimal(19,2) null, extra decimal(19,2) null, mta_tax decimal(19,2) null, tip_amount decimal(19,2) null, tolls_amount decimal(19,2) null, improvement_surcharge decimal(19,2) null, total_amount decimal(19,2) null ) WITH ( DISTRIBUTION = HASH ( vendorid ), CLUSTERED COLUMNSTORE INDEX ) --*/
With all of your files in Azure blob storage, simply INSERT INTO your table from the EXTERNAL table.
begin tran insert into Yellow_Cab_NYC select * from ext_Yellow_Cab_NYC commit
This data set is not a star schema, being that it is only one table. However it is; large, can be sized based on how much data you feel like loading, and is realistic without needing to be massaged.
The only thing left to do is enjoy analyzing the taxi fares of NYC.