As an ETL/ELT Architect at zData Inc., I spend much of my time experimenting with new systems and technologies. In the ever-changing landscape of data warehousing, it is my belief that data warehouses will continue to thrive and not fade away. The challenges that enterprises face with huge volumes and varieties of data will still exist but may be addressed in new ways.
The exploration of new tools becomes a daily encounter and choosing/deciphering between the correct toolsets for your enterprise can be daunting. It is natural to feel amazed by current storage capacity (in Petabytes) and the speed at which the data can be processed and retrieved. This makes it easy to get lost when trying to identify which toolset is the right one to use for your enterprise.
Through my learnings I hope to provide education and information on new toolsets and technologies that should be helpful in choosing the right ongoing solutions. This will be helpful when looking to buy for the problem that needs to be solved, rather than buying because of market and technology hype
I have categorized the modern data tools into 3 types.
- Hadoop File System
- In-memory DB
- Cloud Data Warehousing.
This is a high-level overview; I would recommend everyone to dive deeper into each technology for further insights.
Hadoop File System can really store huge amounts of data and can be scaled to multiple nodes (in the thousands ) with unlimited memory. It uses the Shared Nothing Architecture and the data can be replicated to any number of times. The common but not a critical error anymore, is the failure of nodes, which is overcome by replication. Map-Reduce is the famous algorithm to access data from and to HDFS. There are a numbers of tools in the market to access HDFS as easily as accessing from traditional database. It can store structured, semi-structured and unstructured data and likes to have bigger chunks of data moved and retrieved from it rather than in small chunks.
Recommended Places: – Big Data storage includes huge volumes of data both unstructured and semi-structured. In the typical data warehousing world it would be for archiving history data and when needed can be pulled out into a small data mart. The data can be used to generate reports or if you are willing to compromise the speed, direct reporting can be done from HDFS.
Not Suitable For: – Slowly changing dimension. Fast and quick reporting out of a relatively small data set.
Leading Providers: – Hortonworks, Cloudera, and Pivotal
Cassandra, Couchbase, and Mongo DB are a very good examples of in-memory Database. In-Memory DB also has the same principle of shared nothing architecture, high scalability and replication factor to the users choice. The biggest advantage of this is a huge memory amount allowing the retrieval of data to be much quicker. The idea behind the in-memory DB. is to store the data as denormalized as possible so the data retrieval is quicker, The mantra here is writes are pretty cheaper, so redundancy is compromised but the read should be direct. They do no support joins, so every table (or column family) should support for each user report.
Data modeling should follow Top down technique meaning all user queries and request should be present well in advance and then the model should be done for those queries. Few tools even support storing the data in the order you want which helps in doing some direct and meaningful limits on the queries. It supports updates and it handles semi-structured data (JSON) really well. It is very suitable for having lot many user requests at the same time.
Recommended Places: – Quick Data retrieval, Pre-defined the set of reports (airline booking for example), JSON semi-structured data.
Not Suitable For: – Really big data like Petabytes, which is not the suitable use case, not for storage. Joins and normalized data models.
Leading Providers: – Cassandra, Couchbase, and Mongo DB.
Cloud Data Warehousing
As the name suggests, the data warehousing concepts still work here, but in a much cheaper, scalable and easily managed way. Cloud data warehousing gives the power of MPP (powerful Analytics) databases like Teradata, Netezza, and Greenplum at a much cheaper way with providing an open opportunity for more scalability. High capacity storage has been getting cheaper and cheaper but the appliances (inbuilt software and hardware) like Teradata, IBM Netezza (Pure Data for Analytics) and Pivotal Greenplum on DCA are expensive and scaling them for multiple nodes/segments are still a challenge. Now, these are available in the cloud with different names and much support (like semi-structured JSON data) and room for more scalability. Security has been a concern when the enterprise hears the word “cloud”, but there is something called local cloud where all these characteristics can be leveraged in a local environment as well, but the cloud is getting more secure these days than the local handling and management. Added advantage here is the movement of all of your current data in any database including Oracle, db2 etc. to cloud data warehouse is pretty seamless.
Recommended Places: – Traditional data warehousing wants to move to a bigger database and with scope for more scalability in the future at a much cheaper way.
Not Suitable For – Unstructured data.
Leading Providers: – Snowflake, IBM Dash DB and Open source Greenplum on any cloud environment (for ex-AWS).
Leading Analytical tools support almost all of the above technologies. But it is Enterprise, which has to decide what we need, to make it optimal for Business Intelligence.