Processing Big data with Apache Spark

By  Rajashekar Yedla

Apache Spark is used for streaming over large data sets including HDFS, HBase, Cassandra to perform ETL and advanced analytics. Spark has rich APIs that support Java,Scala and Python languages. At Bimarian we use  Apache Spark extensively to perform ETL on Big Data.  We understood that Spark is in-memory computing framework, and we used Spark to reduce unnecessary writes and reads to the disk. We found very good performance in respect of complex ETL operations in the Big Data applications as we are using Spark.

Working with Spark:

 Begin use of Spark to fetch data into Spark RDD. RDDs (Resilient Distributed Datasets) are Immutable Resilient Distributed collection of records that can be stored in HDFS or HBase. The data in the RDD is transformed as per the business logic and converted into another RDD. Thus the final output of the Spark program is another RDD with transformed data as desired. The elements of RDD output can be saved as a Text file, Object file, etc…. (more…)

Read More

Big data analysis using HIVE


Hive is a Data Ware House system to access and query data stored in the Hadoop File System. Hive uses a language called Hive query Language (HQL) with the same grammar and predicates  used in SQL language.

Our experience in using Hive for analysis of large data sets (big data) related to bank-card transactions has given us opportunity to garner the best features of Hive available as on date, for generating in depth analytical reports on card transactions throwing insights on several dimensions of customer usage of cards.

Here is a brief recap on other parts of Hadoop framework mentioned in this write up. Hadoop is basically two parts – 1. Distributed file system (Hadoop File system referred as HDFS , and  2. MapReduce , a computing and processing framework. Hive provides data ware house facility on top of Hadoop.

I will share here  some of the best features of Hive that were very much handy to generate analytical reports out of the large data sets in HDFS,  processed (cleaned & transformed) using Spark and Spark Sql… (more…)

Read More

Big data ingestion using Apache Sqoop

By Prasad Khode

Apache Sqoop is a tool designed to efficiently transfer bulk data to and fro Apache Hadoop and structured datastores such as Relational databases.

We  can use Sqoop to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Similarly,  Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as Relational databases and Enterprise data warehouses. … (more…)

Read More

Unit Testing MapReduce with MRUnit frameworks

By Anusha Jallipalli

‘Hadoop’ has become indispensable frame work for processing Big Data.  Our clients have recognized the need for processing their huge data to generate revenue. That makes us engaged in development of solutions in Hadoop frame work.

In this discussion, I present a simple and straightforward  way of unit-testing Hadoop MR programs from the eclipse IDE.

MapReduce jobs are relatively simple. In the map phase, each input record has a function applied to it, resulting in one or more key-value pairs. The reduce phase receives a group of the key-value pairs and performs the function over that group.

Testing mappers and reducers is typical of testing any other function- a given input will result in an expected output. … (more…)

Read More

Debug MapReduce Code in Eclipse

By Prasad Khode

I am involved in development of a critical solution for a financial organization. The solution involves analysis of multitude of transactions and data coming from multiple resources . Hadoop is the most flexible option to handle big data and it efficiently implements MapReduce.

I share here my experience to execute and debug Map-Reduce code in Eclipse just like any other Java program.

When we run Map-Reduce code in Eclipse, Hadoop runs in a special mode called LocalJobRunner, under which all the Hadoop daemons run in a single JVM (Java Virtual Machine) instead of several different JVMs.

The default file paths are set to local file paths and not of HDFS paths. .. (more…)

Read More