Homage to Dr APJ Abdul Kalam

Bimarian team salutes Dr APJ Abdul Kalam, the epitome of humanity. It is befitting and honor to the world that United Nations declared  Kalamji’s  birthday – October 15 as World Students’ Day

We  all love to see that all speeches, quotes and books of this Great Teacher, become part of the educational curriculum right from  primary school level. We trust the academicians in the world would have already initiated this. Wish Indian educationists  be harbingers.

We appeal to the community of Information Technology to share the thoughts, of this YOGI of humanity , with the kids –  own children, nephews and nieces as anecdotes, from  his books – Wings of Fire, India 2020 and Ignited Minds which are  believed to be precursors of thought of how important is a livable planet earth.

We solemnly pray that Heaven enjoys the benefit of the presence of this great soul of wisdom that helps Creating a Livable Planet Earth.

Bimarian Team

Read More

Installing Cloud Foundry BOSH Command Line Interface (CLI) on Centos 7

By Harikrishna Doredla

Cloud Foundry BOSH is an open source tool. BOSH Command Line Interface (CLI) is used to interact with the Director and to bootstrap new BOSH environments. The CLI is written in Ruby and is provided by the two gems:
bosh_cli contains main operator commands.
bosh_cli_plugin_micro contains bootstrapping commands.

Install the two Gems following the steps below:  (more…)

Read More

Analyze Big data using Apache Spark SQL

By  Rajashekar Yedla

Apache Spark SQL is a powerful data processing engine and in-memory computing framework to perform processing quickly and analyze large vloume of data . We fetch the elements of a RDD into a Spark SQL table, and query on that table. We can write only SELECT queries on the Spark SQL table and no other SQL operations are possible. Select query on Spark SQL returns RDD only. It has Rich API’s supporting in 3 different languages (Java, Scala and Python).

We use Spark SQL extensively to perform ETL on Big Data where we find it convenient to dispense with writing complex code using Spark.

Working with Spark SQL:

As in Spark, to start with Spark SQL first we have to get the data into RDD(Resilient Distributed Data sets). Once the RDD is available, we create a Spark SQL table with desired RDD elements as table records. This we achieve using SparkSqlContext. Now we implement business logic writing appropriate SELECT queries on Spark SQL tables . The output of the query will be another RDD and the elements of output RDD will be  saved as a Text file, or as an Object file as we need.. (more…)

Read More

Executing Oozie workflow of spark jobs in shell action

By Anusha Jallipalli

Oozie is a server-based Workflow Engine and runs in a Java Servlet-Container to schedule and manage Hadoop jobs.

Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availability.

Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs including Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp as well as system specific jobs like Java programs and shell scripts).

Here we discuss a simple work flow which takes input from HDFS and performs word count using spark job. Here the job-1 passes its output to  job-2 as programmed in the shell script.. (more…)

Read More

Building Fault-Tolerant Web Application on AWS

By Hari Doredla


Star Interactive platform is a web application that provides for interactive exchange of messages between a celebrity and fans. Some changes are brought out into this game. The celebrities tweet messages that should reach their millions of fans across the globe  and the fans start tweeting back the messages to the celebrity.  As the fans started tweeting back the messages , the load on the web server spiked very and the web server was not responding to the requests from the massive number of fans.


We chose the AWS cloud as the best platform to build web applications to get better performance with minimal cost and high availability (fault-tolerant) rather than scaling up the web servers using squid as Load Balancer.

AWS: Amazon Web Services is a collection of remote computing services that make up a cloud computing platform, offered over the Internet by Amazon.com… (more…)

Read More

Processing Big data with Apache Spark

By  Rajashekar Yedla

Apache Spark is used for streaming over large data sets including HDFS, HBase, Cassandra to perform ETL and advanced analytics. Spark has rich APIs that support Java,Scala and Python languages. At Bimarian we use  Apache Spark extensively to perform ETL on Big Data.  We understood that Spark is in-memory computing framework, and we used Spark to reduce unnecessary writes and reads to the disk. We found very good performance in respect of complex ETL operations in the Big Data applications as we are using Spark.

Working with Spark:

 Begin use of Spark to fetch data into Spark RDD. RDDs (Resilient Distributed Datasets) are Immutable Resilient Distributed collection of records that can be stored in HDFS or HBase. The data in the RDD is transformed as per the business logic and converted into another RDD. Thus the final output of the Spark program is another RDD with transformed data as desired. The elements of RDD output can be saved as a Text file, Object file, etc…. (more…)

Read More

Big data analysis using HIVE


Hive is a Data Ware House system to access and query data stored in the Hadoop File System. Hive uses a language called Hive query Language (HQL) with the same grammar and predicates  used in SQL language.

Our experience in using Hive for analysis of large data sets (big data) related to bank-card transactions has given us opportunity to garner the best features of Hive available as on date, for generating in depth analytical reports on card transactions throwing insights on several dimensions of customer usage of cards.

Here is a brief recap on other parts of Hadoop framework mentioned in this write up. Hadoop is basically two parts – 1. Distributed file system (Hadoop File system referred as HDFS , and  2. MapReduce , a computing and processing framework. Hive provides data ware house facility on top of Hadoop.

I will share here  some of the best features of Hive that were very much handy to generate analytical reports out of the large data sets in HDFS,  processed (cleaned & transformed) using Spark and Spark Sql… (more…)

Read More

Install and Configure PostgreSQL on Centos

By Harikrishna Doredla

Here is how you go with installation of Postgres 9.3 on Centos 6.5

  • Check Centos server version.
    root@hadoop3 init.d]# cat /etc/redhat-release — shows the server version
    CentOS release 6.5 (Final)
    [root@hadoop3 init.d]#
  1. Edit the repo file – /etc/yum.repos.d/CentOS-Base.repo, [base] and [updates] sections
    In the base and updates sections append a line as below:
  2. rpm -Uvh http://yum.postgresql.org/9.3/redhat/rhel-6-x86_64/pgdg-centos93-9.3-1.noarch.rpm
  3. yum list postgres*
  4. yum install postgresql93-server postgresql93
  5. service postgresql-9.3 initdb
  6. service postgresql-9.3 status
  7. chkconfig –add postgresql-9.3
  8. chkconfig postgresql-9.3 on
  9. service postgresql-9.3 start
  10. su – postgres
  11. psql –version
    psql (PostgreSQL) 9.3.3
  12. Setting password for the first time login after installation:
    ALTER USER postgres with encrypted password ‘postgres';
  13. Enabling remote access:


Read More

Big data ingestion using Apache Sqoop

By Prasad Khode

Apache Sqoop is a tool designed to efficiently transfer bulk data to and fro Apache Hadoop and structured datastores such as Relational databases.

We  can use Sqoop to import data from external structured data stores into Hadoop Distributed File System or related systems like Hive and HBase. Similarly,  Sqoop can be used to extract data from Hadoop and export it to external structured data stores such as Relational databases and Enterprise data warehouses. … (more…)

Read More