Getting Started: Greenplum OSS

Choose an Instance Family

There are four instance families that are supported  Greenplum: c3, i2, and d2 instances. Read more below to narrow down which instance family is best for your needs.

    • sildenafil citrate without prescription C3/C4 Instance Family
      • The c3 instance family features the highest performing processors. This family offers the lowest price per compute performance out of all other supported instance types as well as the fastest CPU clock speed. This may be a great choice if you want to test a small cluster out on AWS or if your workload is especially CPU intensive. It gives the most bang for your buck.
    • tadalafil without prescription I2/I3 Instance Family
      • The i2 instance family offers fast SSD for storage which is optimized for very high random I/O performance. It provides high IOPS at a low cost. If your workloads are performing lots of reading and writing, these will give the best performance for you.
    • buy sildenafil citrate canadian pharmacy D2 Instance Family
      • The d2 instance family delivers high disk throughput and offers the lowest price per disk throughput performance out of all the instance families. These instances use the traditional magnetic spinning hard drives instead of solid state drives so that they can provide more capacity. This family would be a good choice for workloads that have a very large amount of data.
    • R4 Instance Family
      • The r4 instance family delivers high performance and high memory capacities for intense analytics and business intelligence. These instance are great for those analytic workloads of large datasets.

Choosing Instance Type From Family

Once you’ve narrowed down the instance family you want to use, you’ll need to choose the instance type. The families have the same basic characteristics, but what differs is the amount of virtual CPUs, amount of RAM, the network performance, the type of storage, and the amount of usable capacity for the Greenplum database. Besides those obvious differences, there are a few other factors to consider.

 

      • Storage
        • Ephemeral storage is temporary meaning if a node stops or is terminated the storage will be erased. Ephemeral disk has higher throughput per node than EBS but is more volatile and requires more nodes for high availability.
        • EBS storage is permanent. You can shutdown and reboot the cluster at will and you will not lose data. EBS does have throughput and IOPS caps though that have to be tuned for your business case which is usually solved by more Segment nodes.
        • The cluster can be ran on ephemeral or EBS storage.
      • Cost
        • Cost can be a huge factor in deciding which instance type to choose. In addition to the AWS Marketplace costs, there is the cost of running the cluster which is per instance and per hour. Determining what your budget is for running Greenplum on AWS should help narrow down your choices. If you’re new to Greenplum and just want to test out the functionality of it, we recommend using the c3.4xl or c3.8xlarge for ephemeral databases or the i3.2xl or c3.4xl for EBS databases because they has the best price to performance ratio.
      • Capacity
        • You’ll need to plan how much data you plan to store and also leave some room for growth. The Usable Capacity Per Segment Host/ Mirroring column below shows the amount that a single instance can hold for data inserted. Determine the max capacity you want to support, divide the usable capacity per segment host and this will give you the number of segments hosts required for that instance type to offer that capacity. If capacity and storage are the the most important to you, go with one of the d2 instance types. Otherwise consider the other two instance type families.
      • Workloads
        • Knowing the type of workloads can have an importance on choosing the right instance type. Generally, if you’re a DBA, you may already know the current bottlenecks of your existing system or the type of problems you’re able to solve. Refer to the instance families above to choose the right instance family based on your workload.
    • Summary of Instance Types
      • https://aws.amazon.com/ec2/instance-types/

Cluster Creation and Raid Creation/Wait Time

Spinning up the cluster takes a little bit of time. It primarily depends on the number of segment servers and the instance type you choose for the Greenplum cluster. Below is a table showing the instance types in a cluster size of 3 nodes (2 segment servers and 1 master). The CloudFormation Stack Creation Time is how long the cluster took from launch to usability. After a cluster is usable, it may still be syncing or building the raids in the background. This is what the Raid Sync / Build Wait Time refers to. Some clusters will take longer to finish this step depending on the size of the instance storage and type of instance storage (HDD or SSD). Until the raids are completely synced or rebuilt, reads and writes to the database will take longer than normal, but still be functional.

Example: Cluster Creation Times

Instance Type # Segment Servers vCPUs / Segment Servers Cloud Formation Stack Creation Time Raid Sync / Build Wait Time
c3.4xlarge 2 8 0:10:21 0:52:00
c3.8xlarge 4 8 0:11:17 1:46:12
i2.2xlarge 2 4 0:11:36 1:02:00
i2.4xlarge 4 4 0:10:06 0:35:05
i2.8xlarge 6 5.3
d2.2xlarge 2 4 0:42:34 3:54:00
d2.4xlarge 2 8 0:45:25 3:47:13
d2.8xlarge 6 6 0:48:55 4:08:53
    • Sample Network and Disk Benchmarks

The table below shows the result of a network bandwidth test and disk performance tests. The next table is the result from benchmarking small clusters on each instance type to get a general level of network and disk performance. The network tests were done with a simple upload and download network throughput program between two instances in the cluster. The disk I/O tests were done using Greenplum’s gpcheckperf command. There are two categories of these disk I/O tests: the total bandwidth and the average bandwidth.

The Disk Read / Write Total Bandwidth is the total I/O in the cluster size. This is dependant on the Total Instance Count so if you have more instances in your cluster you will have a higher total bandwidth. This is to give you a good idea of what the cluster can read in write in parallel. The “ Disk Read / Write Bandwidth Avg“ is the average I/O per instance. These numbers are basically the total write or total read divided by the total instance count.

Instance Type Total Instance Count # Segment Hosts Network Test Gb/s Disk Write Total Bandwidth Disk Read Total Bandwidth Disk Write Bandwidth Avg Disk Read Bandwidth Avg
c3.4xlarge 3 2 1.993 196.02 MB/s 854.77 MB/s 65.34 MB/s 284.92 MB/s
c3.8xlarge 3 2 9.283 183.28 MB/s 892.1 MB/s 61.09 MB/s 297.36 MB/s
i2.2xlarge 3 2 0.995 1257.38 MB/s 1128.26 MB/s 419.13 MB/s 376.08 MB/s
i2.4xlarge 3 2 46.8 2062.83 MB/s 4624.66 MB/s 687.61 MB/s 1541.55 MB/s
i2.8xlarge 3 2
d2.2xlarge 3 2 2.491 1785.96 MB/s 1944.42 MB/s 595.32 MB/s 648.14 MB/s
d2.4xlarge 3 2 5 3237.73 MB/s 3507.0 MB/s 1079.24 MB/s 1169.0 MB/s
d2.8xlarge 3 2 9.056 4901.66 MB/s 6644.03 MB/s 1633.88 MB/s 2214.67 MB/s
c4.8xlarge 5 4 3793.02 MB/s 3841.99 MB/s 952.41 MB/s 960.27 MB/s
    • Sample TPC-DS Benchmarks

This last table summarizes unofficial TPC-DS benchmarks that were ran on each instance family to give a comparison of how the instance types compare with this specific type of benchmark. This benchmark has to first generate the data, run the DDL, load the data, and run 100 queries. The results below are the summation of how long each query took for that portion of the test. It is worth noting the dense storage instances are significantly slower for creating the tables and loading the data, but the actual queries in the single user SQL Tests are faster. This is most likely because of how the implementation for this benchmark was written and how the loads are being ran. The actual data generation and SQL tests, along with the disk I/O tests above, show that this is family of instances is a very good option for running the Greenplum database cluster.1 . SSH into the IP address given in the Stack’s Output as ec2-user with the private key specified during Stack creation. 

Instance Type TPC-DS Data Size # Segment Hosts Generate Data DDL Load Data Single User SQL Tests
c3.4xlarge 100 GB 2 0:37:16 0:00:21 0:31:46 0:03:12
c3.8xlarge 100 GB 2 0:18:33 0:00:19 0:21:51 0:03:04
i2.2xlarge 100 GB 2 0:20:02 0:00:14 0:16:28 0:03:25
i2.4xlarge 100 GB 2 0:39:29 0:00:07 0:29:20 0:03:31
i2.8xlarge
d2.2xlarge 100 GB 2 0:18:32 0:13:11 1:26:50 0:03:08
d2.4xlarge 100 GB 2 0:36:53 0:08:58 1:27:54 0:03:15
d2.8xlarge 100 GB 2 0:12:27 0:12:09 1:04:28 0:03:08
c4.8xlarge 100gb 4 0:09:23 0:00:27 0:09:00 0:00:51

Our benchmarking suite is available here: git@github.com:zdata-inc/oss-greenplum-tpcds.git

Connecting to Your Cluster

1. Using the Master IP and private key generated from the stack creation SSH into your host

ssh -i ec2-user@

2. Make sure you’re connecting from an IP in the specified RemoteAccess CIDR whitelist that was used at creation time.
Change to the gpadmin user.

sudo su – gpadmin

 3. Add a database user
> psql -d postgres

> create user myname with password ‘password’;

> alter user myname login;

4. Add a line to your pg_hba.conf
host all myname /32 md5

 5. Reload Greenplum Configuration
> gpstop -u.

You may now reconnect with the database user you created on port 5432 using a Postgres or Greenplum client.