My first encounter with AWS Device Farm

By Sandhya

Device Farm is an app testing service available in the  bunch of  several AWS services. This facilitates testing Android app on real , physical phones of multiple flavors.

My maiden attempt to test a native app using AWS Device Farm is interesting and successful. I tested the app using appium frame work that supports cross-platform testing. I will share my experience with appium framework in my next posting.

Here is how AWS Device Farm is deployed to test Android native app, implementing appium automation test; my code is as below.  (more…)

Read More


By Prasad Khode

MySQL event is a process that is performed  affecting the data on occurrence of certain specified conditions or at a scheduled time. Events are not like triggers which run in response to a change in data . Events  are scheduled to run any number of times like the cron  job is run.

MySQL Events have been added from version 5.1.6 MySQL. Event scheduler is a process that runs in background and looks for events to execute. Before we create or schedule an event in MySQL, we need to first verify whether i event-scheduler is enabled, by default it is OFF in MySQL.

To check whether scheduler is enabled or not, use the following command: (more…)

Read More

Configure Apache Phoenix in CDH 5.4

By Prasad Khode

Apache Phoenix is an open source, relational database layer on top of noSQL store such as Apache HBase. Phoenix provides a JDBC driver that hides the intricacies of the noSQL store enabling users to create, delete, and alter SQL tables, views, indexes, and sequences; upsert and delete rows singly and in bulk; and query data through SQL.

Following are the steps that need to be followed to configure Apache Phoenix in Cloudera Distribution for Hadoop (CDH)

1. Login to Cloudera Manager, click on Hosts, then Parcels.
2. Select Edit Settings.
3. Click the + sign next to an existing Remote Parcel Repository URL, and add the URL: Click Save Changes.
4. Select Hosts, then Parcels.
5. In the list of Parcel Names, CLABS_PHOENIX is now available. Select it and choose Download.
6. The first cluster is selected by default. To choose a different cluster for distribution, select it. Find CLABS_PHOENIX in the list, and click Distribute.
7. If you to use secondary indexing, add the following to the hbase-site.xml advanced configuration snippet. Go to the HBase service, click Configuration, and choose/search for HBase Service Advanced Configuration Snippet (Safety Valve) for hbase-site.xml. Paste in the following XML, then save the changes.

8. Restart the HBase service.

Using Apache Phoenix Utilities: (more…)

Read More

CRUD operations in MongoDB using Java

By Narmada Yalagandula

MongoDB is document oriented  database . It  provides high performance, high availability, and easy scalability. It stores data in JSON like documents  instead of tables and records as in RDBMS. The basic unit of data in MongoDB is document comprising key-value pairs. Collections are sets of documents and function as tables in RDBMS.

Here , I present how  to connect to MongoDB using java and perform CRUD operations on the Mongo database collections.

We connect to MongoDB adding MongoDB driver as a dependency in pom.xml file.


Read More

CRUD Operations in HBase

By Anusha Jallipalli

HBase is a data model that is similar to Google’s big table designed to provide quick random access to huge amounts of structured data. I discuss here how to connect to HBase using java, and perform several basic operations on HBase using java.

Here we will be using HBase-client and connect to HBase using Java. Hence we should add  HBase-client as a dependency in pom.xml file.

The following files need to be placed in the HBase environment to use java for managing data in the HBase tables. (more…)

Read More

Bloom Filters – Implementation in Java

By  Ananth Kumar Ganesna


A Bloom filter is a very compact data structure that supports approximate membership queries on a set, allowing false positives.

The term Bloom filter names a data structure that supports membership queries on a set of elements. It was introduced by Burton Bloom [1970]. It differs from ordinary dictionary data structures, as the result of a membership query might be “true” although the element is not actually contained in the set. Since the data structure is randomized by using hash functions, reporting a false positive occurs with a certain probability, called the false positive rate (FPR).

Use cases:

Cache is important to handle CPU intensive operations to store and quickly access the the result  of such operations.

Sometimes the due to changes  in IO, you do not want to hit the database repeatedly and would prefer  to cache the results and update the cache only with underlying data changes.

Similarly there are other use cases where we need to perform a quick look up to decide what to do with an incoming request. For example, consider the use case where you have to identify that an URL points to a malware site or not. There could be many URLs like that, to do it in one instance, if we cache all the malware URLs in memory, that would require a lot of space to hold them.

Another use case could be to identify if a user typed string has any reference to a place in USA. Like “museum in Washington” – in this string, Washington is a name of a place in USA. Should we keep all the places in USA in memory and then lookup? How big the cache size would be? Is it effective to do it without any database support?

This is where we need to move away from basic map data structure and look for answers in more advanced data structure like Bloom Filter.

You can consider bloom filter, like any other java collection where you can put items in it and ask it whether an item already present in it or not (like a HashSet). If Bloomfilter mentions that it does not contain the item, then definitely that item is not present. But if it mentions that it has seen the item, then that may be wrong ( False positive ). If we are careful enough, we can design a bloom filter such that the probability of the wrong is controlled. (more…)

Read More

Selenium WebDriver – Installation & Configuration

By Sandhya

Selenium-WebDriver  is  an elegant programming interface and a compact Object Oriented API. It is an efficient tool for testing web applications  in different browsers irrespective of the programming language of the application. WebDriver makes direct calls to browser using each browser’s native support.

I will discuss here the installation and configuration of WebDriver, illustrating the process with a simple example.

Step 1:

Check  whether java is installed in your machine   [enter  command –   java – version in command prompt] . If your machine is not having java ; download and install java jdk [download JDK].



Now set java path.  [How to set path]

Step 2:

Download and install Eclipse IDE [ Download Eclipse IDE] . Extract the downloaded Zip file, Installation  is not required to use Eclipse.


Step 3:

Download Selenium JavaClientDriver  [JavaClientDriver]; extract the JavaClientDriver.

fig3 (more…)

Read More

Integrate Kafka, SparkStreaming and HBase (kafka–>SparkStreaming –> HBase)

By Anusha Jallipalli

Apache Kafka is publish-subscribe messaging system. It is a distributed, partitioned, replicated commit log service.

Spark Streaming is a sub-project of Apache Spark. Spark is a batch processing platform similar to Apache Hadoop. Spark Streaming is a real-time processing tool that runs on top of the Spark engine.

Create a pojo class as below:

StreamingToHbase program receives 4 parameters as input: <zkQuorum> <group> <topics> <numThreads>

zkQuorum:  is a list of one or more zookeeper servers that make quorum

group : is the name of kafka consumer group

topics :  is a list of one or more kafka topics to consume from

numThreads:  is the number of threads the kafka consumer should use  (more…)

Read More

Reading Data from HBase table using Spark

By Anusha Jallipalli

HBase is a data model, similar to Google’s big table, designed to provide quick random access to huge amounts of structured data. It leverages the fault tolerance provided by the Hadoop Distributed File System (HDFS).

HBase is column-oriented database built on top of the HDFS. It is an open-source and is horizontally scalable. HBase is used to access very large tables — billions of rows X millions of columns — atop clusters of commodity hardware.

Let us consider we have a table with name “student_info” within our HBase with the columnfamily “details” and column qualifiers “sid, firstName, lastName, branch, emailId”.  Create a pojo class as below:  (more…)

Read More

Using Apache Spark to read data from Cassandra tables

By Prasad Khode

Apache Cassandra is an open source distributed database management system designed to handle large amounts of data, across many commodity servers. It provides high availability with no single point of failure (SPOF- does not stop entire system even if a part of system fails ). Cassandra offers robust support for clusters spanning multiple data centers, with asynchronous master less replication allowing low latency operations for all clients.

Apache Spark is an open-source cluster computing framework originally developed in the AMPLab at UC Berkeley. In contrast to Hadoop’s two-stage disk-based MapReduce paradigm, Spark’s in-memory primitives provide performance up to 100 times faster for certain applications. Spark is well-suited to machine learning algorithms , as it allows user programs to load data into a cluster’s memory and query data repeatedly.

Let us consider we have a table with name “table_user” within our Cassandra database with the columns “user_first_name, user_last_name, user_email, date_of_birth”. We  create a pojo class as below:

To read data from the Cassandra tables using Apache Spark:

  1. Add Apache Spark & Cassandra dependencies in pom.xml
  2. Configure SparkConf object with the Cassandra database details
  3. Use SparkConf object and create JavaSparkContext object
  4. Use JavaSparkContext object and read the data from Cassandra table.


Read More