Building Fault-Tolerant Web Application on AWS

By Hari Doredla


Star Interactive platform is a web application that provides for interactive exchange of messages between a celebrity and fans. Some changes are brought out into this game. The celebrities tweet messages that should reach their millions of fans across the globe  and the fans start tweeting back the messages to the celebrity.  As the fans started tweeting back the messages , the load on the web server spiked very and the web server was not responding to the requests from the massive number of fans.


We chose the AWS cloud as the best platform to build web applications to get better performance with minimal cost and high availability (fault-tolerant) rather than scaling up the web servers using squid as Load Balancer.

AWS: Amazon Web Services is a collection of remote computing services that make up a cloud computing platform, offered over the Internet by… (more…)

Read More

Processing Big data with Apache Spark

By  Rajashekar Yedla

Apache Spark is used for streaming over large data sets including HDFS, HBase, Cassandra to perform ETL and advanced analytics. Spark has rich APIs that support Java,Scala and Python languages. At Bimarian we use  Apache Spark extensively to perform ETL on Big Data.  We understood that Spark is in-memory computing framework, and we used Spark to reduce unnecessary writes and reads to the disk. We found very good performance in respect of complex ETL operations in the Big Data applications as we are using Spark.

Working with Spark:

 Begin use of Spark to fetch data into Spark RDD. RDDs (Resilient Distributed Datasets) are Immutable Resilient Distributed collection of records that can be stored in HDFS or HBase. The data in the RDD is transformed as per the business logic and converted into another RDD. Thus the final output of the Spark program is another RDD with transformed data as desired. The elements of RDD output can be saved as a Text file, Object file, etc…. (more…)

Read More

Big data analysis using HIVE


Hive is a Data Ware House system to access and query data stored in the Hadoop File System. Hive uses a language called Hive query Language (HQL) with the same grammar and predicates  used in SQL language.

Our experience in using Hive for analysis of large data sets (big data) related to bank-card transactions has given us opportunity to garner the best features of Hive available as on date, for generating in depth analytical reports on card transactions throwing insights on several dimensions of customer usage of cards.

Here is a brief recap on other parts of Hadoop framework mentioned in this write up. Hadoop is basically two parts – 1. Distributed file system (Hadoop File system referred as HDFS , and  2. MapReduce , a computing and processing framework. Hive provides data ware house facility on top of Hadoop.

I will share here  some of the best features of Hive that were very much handy to generate analytical reports out of the large data sets in HDFS,  processed (cleaned & transformed) using Spark and Spark Sql… (more…)

Read More

Install and Configure PostgreSQL on Centos

By Harikrishna Doredla

Here is how you go with installation of Postgres 9.3 on Centos 6.5

  • Check Centos server version.
    root@hadoop3 init.d]# cat /etc/redhat-release — shows the server version
    CentOS release 6.5 (Final)
    [root@hadoop3 init.d]#
  1. Edit the repo file – /etc/yum.repos.d/CentOS-Base.repo, [base] and [updates] sections
    In the base and updates sections append a line as below:
  2. rpm -Uvh
  3. yum list postgres*
  4. yum install postgresql93-server postgresql93
  5. service postgresql-9.3 initdb
  6. service postgresql-9.3 status
  7. chkconfig –add postgresql-9.3
  8. chkconfig postgresql-9.3 on
  9. service postgresql-9.3 start
  10. su – postgres
  11. psql –version
    psql (PostgreSQL) 9.3.3
  12. Setting password for the first time login after installation:
    ALTER USER postgres with encrypted password ‘postgres';
  13. Enabling remote access:


Read More

Big data ingestion using Apache Sqoop

By Prasad Khode

Apache Sqoop is a tool designed to efficiently transfer bulk data to and fro Apache Hadoop and structured datastores such as Relational databases.

We  can use Sqoop to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Similarly,  Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as Relational databases and Enterprise data warehouses. … (more…)

Read More

Kerberos authentication with Windows Active Directory

By Harikrishna Doredla

Enable the Kerberos authentication in active directory.

  • Domain controller by default is enabled for kerberos service delegation.
  • Start-> Active Directory users and the computers -> select your domain -> domain controllers -> right click on your default-first-site-name and click on properties menu  will snap-in then select the Delegation tab (please see in below figure).
  • Make sure “Trust this computer for delegation to any service (Kerberos only)” option is enabled. ..


Read More

Unit Testing MapReduce with MRUnit frameworks

By Anusha Jallipalli

‘Hadoop’ has become indispensable frame work for processing Big Data.  Our clients have recognized the need for processing their huge data to generate revenue. That makes us engaged in development of solutions in Hadoop frame work.

In this discussion, I present a simple and straightforward  way of unit-testing Hadoop MR programs from the eclipse IDE.

MapReduce jobs are relatively simple. In the map phase, each input record has a function applied to it, resulting in one or more key-value pairs. The reduce phase receives a group of the key-value pairs and performs the function over that group.

Testing mappers and reducers is typical of testing any other function- a given input will result in an expected output. … (more…)

Read More