Tuesday, March 06, 2012

SAP HANA: Simply Explained

This article is for SAP HANA beginners, who don’t really understand what it is about and are lost in all the marketing and technical jargon.

What is SAP HANA?
SAP HANA is the latest in-memory analytics product from SAP; using HANA companies can do ad hoc analysis of large volumes of data in real-time.

What is in-memory?
In-memory means all the data is stored in the memory (RAM). This is no time wasted in loading the data from hard-disk to RAM or while processing keeping some data in RAM and temporary some data on disk. Everything is in-memory all the time, which gives the CPUs quick access to data for processing.

What is real-time analytics?
Using HANA, companies can analyze their data as soon as it is available. In older days, professionals had to wait at least few hours before they could analyze the data being generated around the company. To put this in perspective, let us take an example – suppose a super market chain wants to start giving you discount coupons when you visit them based on your shopping habits. Before: they could only mail them to your address or give you coupons for your next purchase. Now: while you are checking out, your entire shopping history can be processed and discount could be given on the current shopping. Imagine the customer loyalty for such a chain!

So is SAP making/selling the software or the hardware?
SAP has partnered with leading hardware vendors (HP, Fujitsu, IBM, Dell etc) to sell SAP certified hardware for HANA. SAP is selling licenses and related services for the SAP HANA product which includes the SAP HANA database, SAP HANA Studio and other software to load data in the database. Also, as already announced, the vision is to run all the application layer enterprise software on the HANA platform; that is ERP/BW/CRM/SCM etc /etc will use HANA as their database.

Can I just increase the memory of my traditional Oracle database to 2TB and get similar performance?
Well, NO. You might have performance gains due to more memory available for your current Oracle/Microsoft/Teradata database but HANA is not just a database with bigger RAM. It is a combination of a lot of hardware and software technologies. The way data is stored and processed by the In-Memory Computing Engine (IMCE) is the true differentiator. Having that data available in RAM is just the icing on the cake.

Is HANA really fast? How is it possible that HANA is so fast?
HANA is fast due to many reasons. The following picture1 depicts a very simplified version of what’s inside the In-memory Computing Engine.


Column Storage
While the traditional databases store the relational table one row after another, IMCE stores tables in columns. Hopefully the following figure explains the difference between the two storage mechanisms easily.

Frankly, storing data in columns is not a new technology, but it has been not leveraged to its full potential YET.  The columnar storage is read optimized, that is, the read operations can be processed very fast. However, it is not write optimized, as a new insert might lead to moving of a lot of data to create place for new data. HANA handles this well with delta merge (which in itself is a topic for an entire article coming next), so let us just assume here, that the columnar storage performs very well while reading and the write operations are taken care of by the IMCE in some other ways. The columnar storage creates a lot of opportunities as follows:
  1. Compression: As the data written next to each other is of same type, there is no need to write the same values again and again. There are many compression algorithms in HANA with the default being the dictionary algorithm, which for example maps long strings to integers 

    Example of dictionary algorithm:
    You have a Country column in your Customer table in your database. Let’s say you have 10 million customers from 100 countries. In the standard row-based storage you will need 10 million string values stored in memory. With the dictionary compression, the 100 country values will be assigned an integer based index and now you need only 10 million integers + the 100 string values + the mapping of these values. This is a lot of compression in terms of bytes stored in memory. There are more advanced compression algorithms (RTE etc) which would even reduce the 10 million integer storage.
    Now imagine a scenario with 100 tables and a few thousand columns. You get the picture. Less data is exponentially proportional to fast processing. The official tests show a compression of 5-10x, that is a table which used to take 10GB of space would now need only 1-2GB of storage space.
  2. Partitioning: SAP HANA supports two types of partitioning. A single column can be partitioned to many HANA servers and different columns of a table can be partitioned in different HANA servers. Columnar storage easily enables this partitioning.
  3. Data stripping: There are often times when querying a table, a lot of columns are not used. For example, when you just want the revenue information from a Sales table which stores a lot of other information as well. The columnar storage enables that the unnecessary data is not read or processed. As the tables are stored in vertical fashion, there is no time wasted trying to read only the relevant information from a lot of unnecessary data.

  4. Parallel Processing: It is always performance critical to make full use of the resources available. With the current boost in the number of CPUs, the more work they can do in parallel, the better the performance. The columnar storage enables parallel processing as different CPUs can take one column each and do the required operations (aggregations etc) in parallel. Or multiple CPUs can take a partitioned column and work in parallel for faster output.
Multiple Engines
SAP HANA has multiple engine inside its computing engine for better performance. As SAP HANA supports both SQL and OLAP reporting tools, there are separate SQL and OLAP engines to perform operations respectively. There is a separate calculation engine to do the calculations. There is a planning used for financial and sales reporting.Above all sits something like a controller which breaks the incoming request into multiple pieces and sends sub queries to these engines which are best at what they do. There are separate row and column engines to process the operations between tables stored in rows and tables stored in column format.
Caution: Currently, you can't perform a join between a table stored in row format and a table stored in column format. Also, the query/reporting designer needs to be careful about which engines are being used by the query. As the performance reduces if for example the SQL engine has to do the job of the calculation engine because the controller was not able to optimize the query perfectly.

What is ad hoc analysis?
In traditional data warehouses, such as SAP BW, a lot of pre-aggregation is done for quick results. That is the administrator (IT department) decides which information might be needed for analysis and prepares the result for the end users. This results in fast performance but the end user does not have flexibility. The performance reduces dramatically if the user wants to do analysis on some data that is not already pre-aggregated. With SAP HANA and its speedy engine, no pre-aggregation is required. The user can perform any kind of operations in their reports and does not have to wait hours to get the data ready for analysis.

I hope the above information is useful to get a better understanding of SAP HANA.Please let me know your comments/suggestions.

1: The picture is obviously a much simplified version of the engine and there is much more to it than represented in the picture.
 
Disclaimer: I am an SAP employee and a certified HANA consultant. All the opinions expressed here are completely my own and have no influence of my employer.