Configure Apache Phoenix in CDH 5.4

By Prasad Khode

Apache Phoenix is an open source, relational database layer on top of noSQL store such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Following are the steps that need to be followed to configure Apache Phoenix in Cloudera Distribution for Hadoop (CDH)

1. Login to Cloudera Manager, click on Hosts, then Parcels.
2. Select Edit Settings.
3. Click the + sign next to an existing Remote Parcel Repository URL, and add the URL: Click Save Changes.
4. Select Hosts, then Parcels.
5. In the list of Parcel Names, CLABS_PHOENIX is now available. Select it and choose Download.
6. The first cluster is selected by default. To choose a different cluster for distribution, select it. Find CLABS_PHOENIX in the list, and click Distribute.
7. If you to use secondary indexing, add the following to the hbase-site.xml advanced configuration snippet. Go to the HBase service, click Configuration, and choose/search for HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml. Paste in the following XML, then save the changes.

8. Restart the HBase service.

Using Apache Phoenix Utilities: (more…)

Read More

Executing Oozie workflow of spark jobs in shell action

By Anusha Jallipalli

Oozie is a server-based Workflow Engine and runs in a Java Servlet-Container to schedule and manage Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs including Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp as well as system specific jobs like Java programs and shell scripts).

Here we discuss a simple work flow which takes input from HDFS and performs word count using spark job. Here the job-1 passes its output to  job-2 as programmed in the shell script.. (more…)

Read More

Big data analysis using HIVE


Hive is a Data Ware House system to access and query data stored in the Hadoop File System. Hive uses a language called Hive query Language (HQL) with the same grammar and predicates  used in SQL language.

Our experience in using Hive for analysis of large data sets (big data) related to bank-card transactions has given us opportunity to garner the best features of Hive available as on date, for generating in depth analytical reports on card transactions throwing insights on several dimensions of customer usage of cards.

Here is a brief recap on other parts of Hadoop framework mentioned in this write up. Hadoop is basically two parts – 1. Distributed file system (Hadoop File system referred as HDFS , and  2. MapReduce , a computing and processing framework. Hive provides data ware house facility on top of Hadoop.

I will share here  some of the best features of Hive that were very much handy to generate analytical reports out of the large data sets in HDFS,  processed (cleaned & transformed) using Spark and Spark Sql… (more…)

Read More

Big data ingestion using Apache Sqoop

By Prasad Khode

Apache Sqoop is a tool designed to efficiently transfer bulk data to and fro Apache Hadoop and structured datastores such as Relational databases.

We  can use Sqoop to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Similarly,  Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as Relational databases and Enterprise data warehouses. … (more…)

Read More

Debug MapReduce Code in Eclipse

By Prasad Khode

I am involved in development of a critical solution for a financial organization. The solution involves analysis of multitude of transactions and data coming from multiple resources . Hadoop is the most flexible option to handle big data and it efficiently implements MapReduce.

I share here my experience to execute and debug Map-Reduce code in Eclipse just like any other Java program.

When we run Map-Reduce code in Eclipse, Hadoop runs in a special mode called LocalJobRunner, under which all the Hadoop daemons run in a single JVM (Java Virtual Machine) instead of several different JVMs.

The default file paths are set to local file paths and not of HDFS paths. .. (more…)

Read More