Amazon Redshift Temporary Tables

My company is in the middle of a migration to Amazon Redshift. Data warehouse design standards have been in question now that our platform is changing. One development team asked me whether they would be allowed to use temporary tables or not. They know that different database platforms handle temporary tables differently and there was concern that temporary table usage in Redshift would be an anti-pattern.

Should we use temporary tables in Redshift?

Yes.

There is nothing inherently wrong with using a temporary table in Amazon Redshift. They effectively are just regular tables which get deleted after the session ends.

To prove the point, the two below queries read identical data but one query uses the demo.recent_sales permanent table and the other uses the temp_recent_sales temporary table.

explain
select *
from demo.recent_sales rs
inner join demo.factsales fs on fs.orderkey = rs.orderkey
inner join demo.dimsales ds on ds.orderkey = fs.orderkey;

XN Hash Join DS_DIST_NONE  (cost=101482942.03..222138377.88 rows=161 width=406)
->  XN Seq Scan on dimsales ds  (cost=0.00..53624637.44 rows=5362463744 width=144)
->  XN Hash  (cost=101482941.81..101482941.81 rows=90 width=262)
->  XN Hash Join DS_DIST_NONE  (cost=0.75..101482941.81 rows=90 width=262)
->  XN Seq Scan on factsales fs  (cost=0.00..45103528.96 rows=4510352896 width=254)
->  XN Hash  (cost=0.60..0.60 rows=60 width=8)
->  XN Seq Scan on recent_sales rs  (cost=0.00..0.60 rows=60 width=8)

explain
select *
from temp_recent_sales rs
inner join demo.factsales fs on fs.orderkey = rs.orderkey
inner join demo.dimsales ds on ds.orderkey = fs.orderkey;

----- Update statistics by running the ANALYZE command on these tables -----
----- Tables missing statistics: temp_recent_sales -----
XN Hash Join DS_DIST_NONE  (cost=101482942.03..222138377.88 rows=161 width=406)
->  XN Seq Scan on dimsales ds  (cost=0.00..53624637.44 rows=5362463744 width=144)
->  XN Hash  (cost=101482941.81..101482941.81 rows=90 width=262)
->  XN Hash Join DS_DIST_NONE  (cost=0.75..101482941.81 rows=90 width=262)
->  XN Seq Scan on factsales fs  (cost=0.00..45103528.96 rows=4510352896 width=254)
->  XN Hash  (cost=0.60..0.60 rows=60 width=8)
->  XN Seq Scan on temp_recent_sales rs  (cost=0.00..0.60 rows=60 width=8)

The plans wouldn’t pass a text compare test but the important pieces are the same in each.

  • They both use DS_DIST_NONE.
  • Each step in the plan reads the exact same number of rows.
  • The same scan and join operators are used.

How to get the most out of your temporary tables

Read / write behavior

One difference between regular tables and temporary tables is how they are typically used. Temporary tables are session scoped which means that adding them into a process or report will probably cause them to be created multiple times. Temporary tables might be very similar to regular tables but most regular tables are not re-written into, every time they are queried.

The disk writes involved in populating the temporary table might be more expensive than the reads would be if you were to modify your query to include the logic into one, larger, query. The frequency of the report or process will be a factor into how much of a performance hit you get by using the temporary tables. If you are using temporary tables to make debugging a procedure easier or to enhance readability, make sure you understand the IO cost of performing writes and then reading that data back into a subsequent query.

Ideal schema

Typically, temporary tables are created so that they can be joined into a query later in the session. It is important to create the temporary table with the best column encoding, distribution key, and sort key for the coming join operation.

The goal is to combine control of the table schema with ease of use and adoptability. It is easy to make sure the temporary table is built correctly. The create table statement followed by an insert into statement provides complete control.

CREATE TEMPORARY TABLE temp_recent_sales (orderkey BIGINT ENCODE lzo DISTKEY);

insert into temp_recent_sales (orderkey)

select orderkey
from demo.recent_sales;

The manual method might not be ideal. Any changes to the distribution key of the table(s) might change the way that the query should be joined. It would be convenient if schema elements could be dynamically discovered each time the table was created.

The Create Table As (CTAS) method requires few lines of code and allows the temporary table to adopt from one creation to the next.

create temporary table temp_recent_sales as
select * from demo.recent_sales;

It is time to inspect the schema, to verify the temporary table is as expected.

--Inspect base table
set search_path to demo;

select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'recent_sales';

--Inspect temp table
set search_path to public;

create temporary table temp_recent_sales as
select * from demo.recent_sales;
select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'temp_recent_sales';

Everything, except for the column encoding matches the base table. There is a discrepancy but loading a single table into temporary table is uncommon. What happens when the query involves joins?

create temporary table temp_recent_sales
as
select rs1.orderkey from demo.recent_sales rs1
full outer join demo.recent_sales rs2 on rs1.orderkey = rs2.orderkey;

select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'temp_recent_sales';

As soon as a join is included in the query, the distribution key is lost. This is the worst possible situation because the distribution key is critical for query join performance.

Another option is to use the Create Table Like method. With this method, the temporary table is created and an INSERT INTO statement is used to populate it.

create temporary table temp_recent_sales (like demo.recent_sales);

select "column", type, encoding, distkey, sortkey
from pg_table_def where tablename = 'temp_recent_sales';

insert into temp_recent_sales (orderkey)
select orderkey
from demo.recent_sales;

This will not work in all cases because it is not often that the temporary table schema will need to match the base schema of a single table. While the use cases are more limited, the result is better. The distribution key, sort key, and encoding of all columns match.

Take away

For most temporary table use cases, it will be best to manually create the temporary table and then populate it with an insert statement. The CTAS approach has limitations with the column encodings, distribution key, and sort key. The Create Table Like method works great and it convenient but it has limited use cases.

This article has 1 comment

  1. […] Derik Hammer has some notes on temporary tables in Amazon Redshift: […]

Leave a Reply