Big data analysis using HIVE

By YSN

Hive is a Data Ware House system to access and query data stored in the Hadoop File System. Hive uses a language called Hive query Language (HQL) with the same grammar and predicates  used in SQL language.

Our experience in using Hive for analysis of large data sets (big data) related to bank-card transactions has given us opportunity to garner the best features of Hive available as on date, for generating in depth analytical reports on card transactions throwing insights on several dimensions of customer usage of cards.

Here is a brief recap on other parts of Hadoop framework mentioned in this write up. Hadoop is basically two parts – 1. Distributed file system (Hadoop File system referred as HDFS , and  2. MapReduce , a computing and processing framework. Hive provides data ware house facility on top of Hadoop.

I will share here  some of the best features of Hive that were very much handy to generate analytical reports out of the large data sets in HDFS,  processed (cleaned & transformed) using Spark and Spark Sql…

Hive queries  written in HQL are converted into respective MapReduce jobs and run on the cluster to give the result. Hive provides for certain settings to convert the query statements into appropriate MapReduce jobs for efficient reading  and processing of files in the cluster.

The output format of MapReduce as sequential file is not suitable for optimal performance of Hive query , since it has to read full row even if one column is required.  The best format presently supported  by Hive is ORC file format.

Hive settings used are :

set hive.cli.print.header=true; prints column names

set hive.exec.parallel=true; If a query’s MapReduce stages are not interdependent  they could be executed in parallel

set  hive.auto.convert.join=true; map joins are active.

set hive.vectorization.execution.enable=true; this is the settings that makes Hive to process a batch of rows together instead of processing one row at a time. 

The setting of vectorization is very significant for our present discussion; this is the setting that makes Hive to process a batch of rows together instead of processing one row at a time.

The tables on which the queries run should have been created with specification – STORED AS ORC

Example:

CREATE  EXTERNAL TABLE IF NOT EXISTS card_all_data_orc  ( Customer_ID   Bigint  ,   Card_ID  String , ….)

ROW FORMAT DELIMITED

FIELDS TERMINATED BY ‘\,’  STORED AS ORC;

Since the data to be loaded is mostly available in the text or csv format , to load the ORC tables —

Create a normal table and load data into that table from .csv file. Then input the data from this table to orc table as illustrated below:

INSERT OVERWRITE  TABLE  card_all_data_orc   SELECT * FROM all_data;

ORC (Optimized Row Columnar ) file format  is ideal for the big data processing using Hive. Advantage of ORC format is mainly because of large blocks of data (file contains groups of row data called stripes), default size is 250 MB ; large blocks result in efficient reads from HDFS.  Other significant advantage of ORC format  is it stores statistics – Min, Max, Sum,Count

Hive supports non-clustered index and is helpful for faster data processing and retrieval .