Manage files in HDFS using WEBHDFS REST APIs

By  Rajashekar Yedla

Web services have become indispensable,  in the current trend of  development of applications,   for  exchange of data across the  applications / web applications.  Various application programming interfaces (APIs) are emerging to expose Web services. Representational state transfer (REST),  used by browsers, is logically the choice for building APIs.

Web services have become indispensable,  in the current trend of  development of applications,   for  exchange of data across the  applications / web applications.  Various application programming interfaces (APIs) are emerging to expose Web services. Representational state transfer (REST),  used by browsers, is logically the choice for building APIs.

I share my understanding and experience in using WEBHDFS  REST API.  (more…)

Read More

Integration of HBase and Hive – An intro to insert JSON data into HBase from Hive

By Anusha Jallipalli

Here is how JSON data  is inserted into HBase table using Hive.

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case , Hive can not drop that table directly . You will have to use the HBase shell command  to drop  such a table.

Registering the table is the first step. As part of the registration, you also need to specify a column mapping. This is how you will have to link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.

Step 1: Create a new HBase table which is to be managed by Hive (more…)

Read More

Configure Apache Phoenix in CDH 5.4

By Prasad Khode

Apache Phoenix is an open source, relational database layer on top of noSQL store such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Installation:
Following are the steps that need to be followed to configure Apache Phoenix in Cloudera Distribution for Hadoop (CDH)

1. Login to Cloudera Manager, click on Hosts, then Parcels.
2. Select Edit Settings.
3. Click the + sign next to an existing Remote Parcel Repository URL, and add the URL: http://archive.cloudera.com/cloudera-labs/phoenix/parcels/1.0/ Click Save Changes.
4. Select Hosts, then Parcels.
5. In the list of Parcel Names, CLABS_PHOENIX is now available. Select it and choose Download.
6. The first cluster is selected by default. To choose a different cluster for distribution, select it. Find CLABS_PHOENIX in the list, and click Distribute.
7. If you to use secondary indexing, add the following to the hbase-site.xml advanced configuration snippet. Go to the HBase service, click Configuration, and choose/search for HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml. Paste in the following XML, then save the changes.

8. Restart the HBase service.

Using Apache Phoenix Utilities: (more…)

Read More

CRUD operations in MongoDB using Java

By Narmada Yalagandula

MongoDB is document oriented  database . It  provides high performance, high availability, and easy scalability. It stores data in JSON like documents  instead of tables and records as in RDBMS. The basic unit of data in MongoDB is document comprising key-value pairs. Collections are sets of documents and function as tables in RDBMS.

Here , I present how  to connect to MongoDB using java and perform CRUD operations on the Mongo database collections.

We connect to MongoDB adding MongoDB driver as a dependency in pom.xml file.

(more…)

Read More

CRUD Operations in HBase

By Anusha Jallipalli

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. I discuss here how to connect to HBase using java, and perform several basic operations on HBase using java.

Here we will be using HBase-client and connect to HBase using Java. Hence we should add  HBase-client as a dependency in pom.xml file.

The following files need to be placed in the HBase environment to use java for managing data in the HBase tables. (more…)

Read More

Integrate Kafka, SparkStreaming and HBase (kafka–>SparkStreaming –> HBase)

By Anusha Jallipalli

Apache Kafka is publish-subscribe messaging system. It is a distributed, partitioned, replicated commit log service.

Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop. Spark Streaming is a real-time processing tool that runs on top of the Spark engine.

Create a pojo class as below:

StreamingToHbase program receives 4 parameters as input: <zkQuorum> <group> <topics> <numThreads>

zkQuorum:  is a list of one or more zookeeper servers that make quorum

group : is the name of kafka consumer group

topics :  is a list of one or more kafka topics to consume from

numThreads:  is the number of threads the kafka consumer should use  (more…)

Read More

Reading Data from HBase table using Spark

By Anusha Jallipalli

HBase is a data model, similar to Google’s big table, designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop Distributed File System (HDFS).

HBase is column-oriented database built on top of the HDFS. It is an open-source and is horizontally scalable. HBase is used to access very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.

Let us consider we have a table with name “student_info” within our HBase with the columnfamily “details” and column qualifiers “sid, firstName, lastName, branch, emailId”.  Create a pojo class as below:  (more…)

Read More

Using Apache Spark to read data from Cassandra tables

By Prasad Khode

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data, across many commodity servers. It provides high availability with no single point of failure (SPOF- does not stop entire system even if a part of system fails ). Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. Spark is well-suited to machine learning algorithms , as it allows user programs to load data into a cluster’s memory and query data repeatedly.

Let us consider we have a table with name “table_user” within our Cassandra database with the columns “user_first_name, user_last_name, user_email, date_of_birth”. We  create a pojo class as below:

To read data from the Cassandra tables using Apache Spark:

  1. Add Apache Spark & Cassandra dependencies in pom.xml
  2. Configure SparkConf object with the Cassandra database details
  3. Use SparkConf object and create JavaSparkContext object
  4. Use JavaSparkContext object and read the data from Cassandra table.

(more…)

Read More

Analyze Big data using Apache Spark SQL

By  Rajashekar Yedla

Apache Spark SQL is a powerful data processing engine and in-memory computing framework to perform processing quickly and analyze large vloume of data . We fetch the elements of a RDD into a Spark SQL table, and query on that table. We can write only SELECT queries on the Spark SQL table and no other SQL operations are possible. Select query on Spark SQL returns RDD only. It has Rich API’s supporting in 3 different languages (Java, Scala and Python).

We use Spark SQL extensively to perform ETL on Big Data where we find it convenient to dispense with writing complex code using Spark.

Working with Spark SQL:

As in Spark, to start with Spark SQL first we have to get the data into RDD(Resilient Distributed Data sets). Once the RDD is available, we create a Spark SQL table with desired RDD elements as table records. This we achieve using SparkSqlContext. Now we implement business logic writing appropriate SELECT queries on Spark SQL tables . The output of the query will be another RDD and the elements of output RDD will be  saved as a Text file, or as an Object file as we need.. (more…)

Read More

Executing Oozie workflow of spark jobs in shell action

By Anusha Jallipalli

Oozie is a server-based Workflow Engine and runs in a Java Servlet-Container to schedule and manage Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs including Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp as well as system specific jobs like Java programs and shell scripts).

Here we discuss a simple work flow which takes input from HDFS and performs word count using spark job. Here the job-1 passes its output to  job-2 as programmed in the shell script.. (more…)

Read More