Monday, February 10, 2014

Redshift table optimizing with SORTKEY and DISTKEY.

People ask if it is good idea to split a big table into multiple tables based on values of column, for example date. This is an interesting question as Redshift supplies DISTKEY option for table definition already.

Keeping in mind that Redshift is a MPP database management system, which distributes data among its cluster. "The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is executed." (See detail from Redshift doc). So, "partition" a huge table in Redshift should be taken cared by DISTKEY.

Another key for optimizing Redshift table design is SORTKEY. I will think DISTKEY and SORTKEY together as "index" in Redshift. Also, it is good to keep in mind the order of sort keys. For example, if sort key is defined as SORTKEY(col1, col2), it should be good practice to know col1 has less variant values than col2 does. Also, making joined columns as sort key will help on performance of table join as a merge join may happen (see detail from Redshift doc).

Redshift have awesome online document to describe how to tune table. BTW, Redshift is a relational database. But, it is most likely not good idea to use it as your transaction database. Better to use Redshift for data warehouse instead of data mart platform.

http://stackoverflow.com/questions/17115069/what-does-it-mean-to-have-multiple-sortkey-columns/17510565#17510565