My inner self: 2012

Tuesday, March 06, 2012

SAP HANA: Simply Explained

This article is for SAP HANA beginners, who don’t really understand what it is about and are lost in all the marketing and technical jargon.

What is SAP HANA?

SAP HANA is the latest in-memory analytics product from SAP; using HANA companies can do ad hoc analysis of large volumes of data in real-time.

What is in-memory?
In-memory means all the data is stored in the memory (RAM). This is no time wasted in loading the data from hard-disk to RAM or while processing keeping some data in RAM and temporary some data on disk. Everything is in-memory all the time, which gives the CPUs quick access to data for processing.

What is real-time analytics?
Using HANA, companies can analyze their data as soon as it is available. In older days, professionals had to wait at least few hours before they could analyze the data being generated around the company. To put this in perspective, let us take an example – suppose a super market chain wants to start giving you discount coupons when you visit them based on your shopping habits. Before: they could only mail them to your address or give you coupons for your next purchase. Now: while you are checking out, your entire shopping history can be processed and discount could be given on the current shopping. Imagine the customer loyalty for such a chain!

So is SAP making/selling the software or the hardware?

SAP has partnered with leading hardware vendors (HP, Fujitsu, IBM, Dell etc) to sell SAP certified hardware for HANA. SAP is selling licenses and related services for the SAP HANA product which includes the SAP HANA database, SAP HANA Studio and other software to load data in the database. Also, as already announced, the vision is to run all the application layer enterprise software on the HANA platform; that is ERP/BW/CRM/SCM etc /etc will use HANA as their database.

Can I just increase the memory of my traditional Oracle database to 2TB and get similar performance?

Well, NO. You might have performance gains due to more memory available for your current Oracle/Microsoft/Teradata database but HANA is not just a database with bigger RAM. It is a combination of a lot of hardware and software technologies. The way data is stored and processed by the In-Memory Computing Engine (IMCE) is the true differentiator. Having that data available in RAM is just the icing on the cake.

Is HANA really fast? How is it possible that HANA is so fast?

HANA is fast due to many reasons. The following picture¹ depicts a very simplified version of what’s inside the In-memory Computing Engine.

Column Storage

While the traditional databases store the relational table one row after another, IMCE stores tables in columns. Hopefully the following figure explains the difference between the two storage mechanisms easily.

Frankly, storing data in columns is not a new technology, but it has been not leveraged to its full potential YET. The columnar storage is read optimized, that is, the read operations can be processed very fast. However, it is not write optimized, as a new insert might lead to moving of a lot of data to create place for new data. HANA handles this well with delta merge (which in itself is a topic for an entire article coming next), so let us just assume here, that the columnar storage performs very well while reading and the write operations are taken care of by the IMCE in some other ways. The columnar storage creates a lot of opportunities as follows:

Compression: As the data written next to each other is of same type, there is no need to write the same values again and again. There are many compression algorithms in HANA with the default being the dictionary algorithm, which for example maps long strings to integers

Example of dictionary algorithm: You have a Country column in your Customer table in your database. Let’s say you have 10 million customers from 100 countries. In the standard row-based storage you will need 10 million string values stored in memory. With the dictionary compression, the 100 country values will be assigned an integer based index and now you need only 10 million integers + the 100 string values + the mapping of these values. This is a lot of compression in terms of bytes stored in memory. There are more advanced compression algorithms (RTE etc) which would even reduce the 10 million integer storage.
Now imagine a scenario with 100 tables and a few thousand columns. You get the picture. Less data is exponentially proportional to fast processing. The official tests show a compression of 5-10x, that is a table which used to take 10GB of space would now need only 1-2GB of storage space.
Partitioning: SAP HANA supports two types of partitioning. A single column can be partitioned to many HANA servers and different columns of a table can be partitioned in different HANA servers. Columnar storage easily enables this partitioning.
Data stripping: There are often times when querying a table, a lot of columns are not used. For example, when you just want the revenue information from a Sales table which stores a lot of other information as well. The columnar storage enables that the unnecessary data is not read or processed. As the tables are stored in vertical fashion, there is no time wasted trying to read only the relevant information from a lot of unnecessary data.
Parallel Processing: It is always performance critical to make full use of the resources available. With the current boost in the number of CPUs, the more work they can do in parallel, the better the performance. The columnar storage enables parallel processing as different CPUs can take one column each and do the required operations (aggregations etc) in parallel. Or multiple CPUs can take a partitioned column and work in parallel for faster output.

Multiple Engines
SAP HANA has multiple engine inside its computing engine for better performance. As SAP HANA supports both SQL and OLAP reporting tools, there are separate SQL and OLAP engines to perform operations respectively. There is a separate calculation engine to do the calculations. There is a planning used for financial and sales reporting.Above all sits something like a controller which breaks the incoming request into multiple pieces and sends sub queries to these engines which are best at what they do. There are separate row and column engines to process the operations between tables stored in rows and tables stored in column format.
Caution: Currently, you can't perform a join between a table stored in row format and a table stored in column format. Also, the query/reporting designer needs to be careful about which engines are being used by the query. As the performance reduces if for example the SQL engine has to do the job of the calculation engine because the controller was not able to optimize the query perfectly.

What is ad hoc analysis?
In traditional data warehouses, such as SAP BW, a lot of pre-aggregation is done for quick results. That is the administrator (IT department) decides which information might be needed for analysis and prepares the result for the end users. This results in fast performance but the end user does not have flexibility. The performance reduces dramatically if the user wants to do analysis on some data that is not already pre-aggregated. With SAP HANA and its speedy engine, no pre-aggregation is required. The user can perform any kind of operations in their reports and does not have to wait hours to get the data ready for analysis.

I hope the above information is useful to get a better understanding of SAP HANA.Please let me know your comments/suggestions.

1: The picture is obviously a much simplified version of the engine and there is much more to it than represented in the picture.

Disclaimer: I am an SAP employee and a certified HANA consultant. All the opinions expressed here are completely my own and have no influence of my employer.

Saturday, February 25, 2012

All about SAP HANA Certification

Last year, SAP introduced HANA based on in-memory technology with which customers can analyze large volumes of data in seconds. This post is about the available certifications on SAP HANA.

As of Feb 2012, SAP has made available only one Associate Consultant certification with code: ~~C_HANAIMP_10~~. (there are two certifications available now: C_HANAIMP_1 for application consultants and C_HANATEC_1 for technical consultants). C_HANAIMP_1 certification tests knowledge on all aspects of SAP HANA 1.0 for the profile of a SAP HANA Associate Consultant.

The topics in the exam include: Business use cases for HANA platform, loading data into HANA database, modeling and creating views on basic tables to make meaning of the data, creating reports using various tools such as SAP BusinessObjects tools on these created views, optimizing the performance for these reports, user management, security and data access privileges.

Below I have gathered all the information available to the best of my knowledge related to the certification exam:

Q: How many questions are there in the exam?

A: 80

Q: What is the duration of exam?

A: 180 minutes

Q: What are the types of questions asked?

A: As from the sample questions provided by SAP, the questions have single choice or multiple choice answers. Also, as shown in the sample questions, the number of correct answers is indicated for each question.

Q: What is the passing percentage?

A: 59% - which means that you must get all answers correct for at least 48 questions.

Q: If a question has 3 correct answers, and I answer 2 correct answers, will I get partial marks?

A: No. You MUST answer all the correct answers to get marks. The scoring is binary - either you answer a question right or you answer it wrong. Nothing in between. On this note, make sure to pay attention on the indicated number of correct answers to make sure that you have chosen correct number of answers. Choosing less or more answers than indicated will directly get you 0 points irrespective of the correctess of your choices.

Q: What are the available trainings?

A: The following trainings are available and recommended:

TZHANA : It’s a 2 day classroom course. Gives a very good overview of all the different components in HANA. The best part about the training was – there were a lot of exercises for the students which gave us a good feel about the overall tool.
RHANA : Personally, this is just TZHANA compressed and available in an e-learning format. If you recently did TZHANA you can skip it. Though it’s an awesome way to do a revision of TZHANA (if you did it long time before the exam date).
OHA10 : It’s a self paced training; all the material is provided online. Obviously some content overlaps with TZHANA but extra and latest content is available. I would say this is nice to have (and not a must have) training for passing the certification exam. Of course, I would recommend going through it in the long term.
TZH300 : I passed my certification before doing this course, so clearly its not a must have. The course contents includes: how to do transformations using SLT, advanced modeling with HANA Studio, importing-exporting objects, join types, advanced SQL Scripts and CE functions etc.

Please note HA100 will replace TZHANA and HA300 will replace TZH300 in near future.

Q: Which topics are tested in the exam and how is the exam distributed among all the topics?

A: The following tables gives a good idea on where and which topics to focus on (source) . I have added here the approximate number of questions from the percentages given at the certification page.

Topic	% of exam	Approx no. of questions	Primary	Alternative
Business content	<=8%	6	TZHANA, OHA10	RHANA
Data Modeling	>=12%	19	TZHANA, TZH300, OHA10	RHANA
Data Provisioning	>=12%	15	TZHANA, TZH300, OHA10	RHANA
Optimization	>=12%	15	TZHANA, TZH300, OHA10	RHANA
Reporting	>=12%	19	TZHANA, TZH300, OHA10	RHANA
Security and Authorization	<=8%	6	TZHANA, OHA10	RHANA

Q: Is going through the trainings/material enough or do I need hands on experience of data modeling/provisioning?

A: As they say, if you read, you might forget but if you try, you remember for a long long time. Having said that, hands on experience will surely be helpful and is recommended to pass the exam but don’t waste too much time just trying to get access. If you really understand the overall concepts you would have a decent shot in the exam.

Q: How much expertise on BO tools is needed?

A: It’s an exam about HANA and its interaction with BO tools. Expertise on BO tools is not expected, just the overall understanding of the tools and how and when they are useful. To stress my point and not cross any legal boundaries, I will just copy-paste the course content of TZHANA on reporting: SAP HANA Interfaces to BI Client tools, including SAP Business Objects Explorer, SAP Crystal Reports, SAP Business Objects Dashboards and SAP Business Objects Web Intelligence.

Resources:

Go to https://www.experiencesaphana.com/ and play around. Go to the “Try” section and spend some time playing with all the “test drives” available. Might not help for the exam but definitely gives an idea of how stuff works.
Some really useful videos: Hana Modeler (29 mins, gives very good overview of data modeling in HANA), Demo of Business Objects 4.0 tools consuming data from HANA (13 mins, if you are not used to Information Design Tool (IDT), its a good watch).
Training: All HANA related trainings are listed here with sign-up links.
If you have access to an HANA system and some extra time, a lot of guides especially for reporting tools are available here: http://help.sap.com/hana_appliance . The best guide to get started on that page is the Modeling Guide, using which you can try out work flows for all aspects of Data Modeling. (Don’t be disappointed with the locked files available only to Partners and Employees; there are many open-to-all guides as well).
SAP employees and partners can access the OHA10 course by going to: http://service.sap.com/okp
If the left hand menu doesn't open automatically, go to: SAP Consultant Education -> Early Product Training -> SAP Online Knowledge Products and search for HANA
You can also request access to an HANA system from SAP. All FAQs related to this access are here.

Extra resources (not directly useful for exam, but useful in the long term)

Documents: The business case for Hana and a good overview of SAP HANA Architecture.
Some very informative documents are available on https://www.experiencesaphana.com/ Go to Browse -> Content -> Documents. Note: the filtering doesn’t work properly, spend some time to find relevant documents.

Please let me know if I have missed something or if you like/dislike the information above in comments. Also, the notes are from February 2012 and the course/exam contents may or may not have changed since then.

Please note that as an SAP employee I cannot share any documents and all the exam participants have to sign an NDA promising not to share any exam contents.

Disclaimer: I am an SAP employee and a certified HANA consultant. I have just gathered all the information available freely on the internet. All the opinions expressed here are completly my own and have nothing to do with my employer.

Thursday, February 23, 2012

Gmail still sucks!

In the name of new design layout, the new Gmail restructuring is a big letdown. This entire wait for a more "cozy" look is ridiculous. I have been a loyal Gmail user since 2002 and the interface more or less has remained same. Gmail was a breakthrough in its time due to the concept of conversations and of course the 1GB space but nothing much after that. A big overhaul is required because the way users use their mail has changed significantly over years.

All the emails we receive can be broadly classified into three categories:

1. Mails I would never look again (notifications, signups, evening plans, informatory etc etc)

2. Mails I would need in short term (for reference, to reply or read later etc)

3. Mails I would need to refer in the long term

Even with more than 10 labels and 25 filters, I have to search for any email older than a week and thanks to amazing search functionality this works quite well. Google needs to harness this capability to fullest. With the current layout, my eye concentration is something like this:

The important point to note is a lot of space is wasted in the current layout even with my gigantic screen. The whole concept of listing emails needs to be forgotten because nobody looks through them or goes to the third page to find an old email. They search. It works!

Another point to wonder is : How are the current users using email? Or What is the most important activity they can't do without emails? With all the facebooks and twitters, my conclusion is - attachments. There is no other platform currently to securely exchange documents. This needs to be highlighted and given more importance by placing a box in the front showcasing all the recently exchanged (sent/received) documents. With a little imagination, I came to a layout like below (which I am sure can be improved):

Remove pagination, remove earlier mails, if I need it, I would search it. Show me the things I would need the most after logging in. The next step would be to add the possibility to add widgets in the email interface such as calender, maps, documents and all the other google properties. (Sidenote: goolge tried it long ago with igoogle, but that concept was opposite, that needed your email to be taken out and used with widgets. Here the idea is to make email more powerful but bringing in more and more tools to make them more useful).

Priority email is a step in the right direction but it's too little too late. Some time ago, a Google engineer posted a big rant about how Google is missing on building platforms. Gmail with its millions of users could be a good starter!

I may be wrong, but I don't think I lot of Gmail users worry about the latest HD themes.