Amazon Redshift is a massively popular data warehouse service that lives on their AWS platform, making it easy to set up and run a data warehouse. Redshift clusters can range in size from the hundred-gigabyte scale up to the petabyte scale, and can be set up without having to purchase, install and manage the hardware yourself. This guide will demonstrate how to set up a Redshift instance, load data, and run some basic queries on the data once it’s loaded.
First, we’ll assume you have an AWS account set up already. If not, you can head here first and choose Create an AWS account. Don’t worry, we’ll wait.
Once you’ve got your console set up, you can launch a Redshift instance and start querying your data. But since you’ll probably be pulling data for from a separate AWS instance (i.e. an S3 bucket), you’ll need to create and attach an IAM role for your Redshift warehouse. This will allow you to share data securely between your S3 bucket and your Redshift cluster.
CREATING AN IAM ROLE
Log in to your IAM console and/or go to the Roles pane on the left side of the console (direct link here). Clicking Create Role will show a list of possible AWS services. Next, select Redshift. On the next page, under “select your use case”, select Redshift - Customizable and then select Next: Permissions at the bottom right of the screen.
On the following page, you’ll see a scrollable list of pre-defined access policies in alphabetical order. Scroll down and choose AmazonS3ReadOnlyAccess, then select Next in the bottom right of the screen again.
On the following page, you’ll create a tag for your IAM role. You can use this panel to create tags for your IAM role. Enter something easy to remember in the “Key” field and then select next (again in the bottom right corner of the screen). You can also skip this screen if you don’t feel like adding a tag.
Then you’ll create a name for the role. Again, choose something easy to remember for this role. We’ll use myRedshiftRole for the purposes of this tutorial. Amazon enters a default description in the Role Description form, but you can add whatever you like here. Once you’ve entered the relevant information, select Create role in the bottom right of the page.
Next, you’ll see a screen listing all your AWS IAM roles. If you followed all the previous steps, myRedshiftRole should be there.
LAUNCHING YOUR CLUSTER
In order to launch a Redshift cluster, head to the Redshift dashboard. We’ll assume you’re the primary user of this AWS account and have all relevant permissions for the purposes of this tutorial, but if not, you may need to reconfigure some of your IAM user settings. You can read about that here. Once you’re in the Redshift dashboard, you should see Quick launch cluster and Launch cluster near the top of the screen.
We’ll select Quick launch cluster. If you’re already a user of AWS, you probably have a preferred region for your AWS instances, but you can set or change the region you want to launch your Redshift instance in before proceeding to the next step. Otherwise, this will launch a Redshift cluster in your default region.
Now you’ll need to configure your cluster. For the purposes of this tutorial, we’ll stick with the lowest-end defaults that AWS offers, but you can easily customize your launch here.
Unless you need something bigger, leave the Node type (default is dc2.large) and number of compute nodes at the default values. Of course, depending on your needs, you might need a larger and/or more powerful cluster. As the size of your data grows, you will need to add nodes to your cluster. It’s worth noting, however, that resizing your Redshift cluster will involve some downtime during which you won’t be able to take in new data, so keep your future needs in mind if you’re using this guide to set up a cluster you think you might be using for some time.
For the purposes of this tutorial, stick with the default Master user name (“awsuser”) and set your own password for the cluster. Leave the database port set to 5439. At the bottom of the list, you’ll see Available IAM roles. This is where you’ll attach the IAM role you created earlier. Select it from the drop-down list and select Launch cluster in the bottom right of the launch screen. Your cluster will likely take a few minutes to spin up, but you can monitor its status by going to the clusters pane of the Redshift dashboard. You should see something that looks like this:
Once it’s ready, you can start with data ingestion and query writing. Before, that, though, you’ll want to configure access to your cluster. It’s going to live on the internet, so you’ll want to make sure it isn’t open to just anyone.
CONTROLLING ACCESS TO YOUR REDSHIFT CLUSTER
This tutorial assumes you launched your cluster from the AWS EC2-VPC platform, but the steps are relatively similar to those you would use to launch from EC2-Classic.
First, find your cluster in Redshift console > Clusters. Select the cluster you just launched and head to the Configuration tab. Head to Cluster Properties and choose a security group under VPC Security Groups. You should see a section that looks like this:
Click the linked VPC security group and you’ll see this screen:
This will open a new pane with a space to list the rules you’ve set for this security group. Select the Inbound tab. Choose Edit, then select the following:
- Type: Custom TCP Rule
- Protocol: TCP
- Port Range: 5439 (unless you configured your cluster to use a different port, in which case use that one)
- Source: Select Custom IP and enter an IP address or set of addresses that you will likely be connecting from. If you’re just playing with this to run through the demonstration, you can enter 0.0.0.0/0, but please do not do this in production, as it will open the port to any computer with internet access and leave a gaping security hole in your data warehouse
Hit Save in the lower right corner
Now you’re all set to connect to your Redshift cluster using Amazon’s proprietary query editor, your own favorite SQL client, or whatever other method you prefer.
LOADING DATA FROM S3 INTO YOUR REDSHIFT CLUSTER
AWS quite helpfully provides some sample data in an easy-to-access S3 bucket for the purposes of demoing a Redshift cluster, so we’ll use their data for the next part of this tutorial. The principles are essentially the same for any data stored in an S3 bucket, though, so if you’d prefer to use this guide to work with your own data, feel free.
To get started loading data in, we’ll first need to configure the schema for the table we’re going to build. Amazon provides a number of pipe-delimited text files in their sample S3 bucket, but for now, we’ll just run through building a table for one of them. Let’s set up the table--we’ll call it users:
create table users( userid integer not null distkey sortkey, username char(8), firstname varchar(30), lastname varchar(30), city varchar(30), state char(2), email varchar(100), phone char(14), likesports boolean, liketheatre boolean, likeconcerts boolean, likejazz boolean, likeclassical boolean, likeopera boolean, likerock boolean, likevegas boolean, likebroadway boolean,
Next, we’ll need the name of the IAM role (AKA the ARN) we created for our Redshift instance at the beginning of this tutorial. This is because the most straightforward way to pull S3 data into our Redshift cluster is by using the COPY command, like so:
copy users from 's3://awssampledbuswest2/tickit/allusers_pipe.txt'
delimiter '|' region 'us-west-2';
Make sure to change the value of aws_iam_role to whatever ARN you defined on your end before running this command. It should have a structure something like arn:aws:iam::123456789012:role/myRedshiftRole. You can copy it directly from the IAM roles section of your AWS console. After that, you’ll have a table in your database named users, and you can start running SQL queries.
For example, suppose you wanted to see how many users in each city like broadway (Postgres dialect):
SELECT city, COUNT() as num_users FROM users WHERE likebroadway
But don’t stop there. Play around with whatever data you’d like in your new Redshift instance.
It may seem like a lot all together, but setting up Redshift boils down to a few steps that are fairly straightforward on their own:
- Creating an IAM role for your Redshift instance
- Attaching that role
- Determining the current and potential future size requirements for your Redshift cluster
- Launching your cluster
- Launching an S3 bucket with your data
- Loading data from an S3 bucket using the COPY command
- Installing and configuring a SQL client (if necessary)
- Querying your data
THE EASY WAY
At this point, depending on your familiarity with Redshift, SQL and AWS, your head may be spinning from all the steps involved in this process. This guide was adapted from the AWS docs in order to simplify the process for new users, but it still has quite a few moving parts.
You might be wondering if there’s an easier way to do all of this, one that doesn’t require getting involved with the nitty-gritty of defining table schema, managing security groups and installing SQL client drivers. In that case, you might be interested in using Panoply’s data warehouse solution, which lives on top of Redshift and automates most of the nitty-gritty of setting up a Redshift cluster on your own, meaning you can go directly from data ingestion to analysis. You can load structured data straight from your local machine, an S3 bucket you already maintain, or a number of other data sources, including both SQL and NoSQL-style databases. With Panoply, the process looks more like:
- Create a Panoply data warehouse
- Load data directly from wherever it’s stored using our simple web interface
- Query the data directly using our web-based SQL editor
- Connect directly to your favorite BI analysis tool
Everything else is automated behind the scenes, including scaling, table configuration and data warehouse maintenance.
Check out next week's post to see an in-depth look at how it works.