Manage files in HDFS using WEBHDFS REST APIs

By  Rajashekar Yedla

Web services have become indispensable,  in the current trend of  development of applications,   for  exchange of data across the  applications / web applications.  Various application programming interfaces (APIs) are emerging to expose Web services. Representational state transfer (REST),  used by browsers, is logically the choice for building APIs.

Web services have become indispensable,  in the current trend of  development of applications,   for  exchange of data across the  applications / web applications.  Various application programming interfaces (APIs) are emerging to expose Web services. Representational state transfer (REST),  used by browsers, is logically the choice for building APIs.

I share my understanding and experience in using WEBHDFS  REST API.  (more…)

Read More

Solr Data Import Handler for Search in XML Files

By Raghavendran Pedapati

Apache Solr is a web application and is built around Lucene.  Lucene has a powerful search library to provide full-text indexing . The significant aspect of Lucene search is its inverted index, meaning keyword-centric data structure i.e.  word -> Pages rather than page -> words.

Solr not only  takes advantage of all good features  like inverted  index  of search , spellchecking, hit highlighting and advanced analysis/tokenization capabilities in Lucene ;   but empowers itself as one of the powerful search application with SolrAPI . One of the advanced features  of Solr is  faceting , i.e. arranging search results in the form of  columns and numerical counts of the  key terms.

Thus Solr is the paradise of programmers to develop sophisticated and efficient search applications as it provides easier scaling and distribution.

The DataImportHandler (DIH) is a mechanism for importing structured data from a data store into Solr. It is often used with relational databases, but can also handle XML with its XPath Entity Processor. We can pass incoming XML to an XSL, as well as parse and transform the XML with built-in DIH transformers. We could translate our arbitrary XML to Solr’s standard input XML format via XSL, or map/transform the arbitrary XML to the Solr schema fields right there in the DIH config file, or a combination of both. DIH is flexible.

I will discuss here how to deploy Solr DIH for search XML files.  (more…)

Read More

Calabash-android testing framework

By Sandhya

Calabash facilitates to write and execute automated acceptance tests of mobile apps across the platforms – Android and iOS . Calabash in essence is a bridge to allow Cucumber framework to run tests on iOS and Android devices. The libraries of Calabash enable the test code to run in native and hybrid apps. The significant aspect of Cucumber is its ability to support Behaviour Driven Development (BDD). The scenarios of the test cases are written in typical English grammar of Gherkin language. The key words – Given, When, and Then are handy to describe the scenario.

I have discussed here how to run a simple test case using calabash on Android Apps. (more…)

Read More

Integration of HBase and Hive – An intro to insert JSON data into HBase from Hive

By Anusha Jallipalli

Here is how JSON data  is inserted into HBase table using Hive.

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case , Hive can not drop that table directly . You will have to use the HBase shell command  to drop  such a table.

Registering the table is the first step. As part of the registration, you also need to specify a column mapping. This is how you will have to link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.

Step 1: Create a new HBase table which is to be managed by Hive (more…)

Read More

Standalone spark cluster setup in AWS cloud

By Harikrishna Doredla

Here I discuss how the standalone Spark cluster is setup in AWS using EC2.

Let’s assume we are setting up a 3 node standalone cluster. The ip address of each node say : (m4.xlarge – $0.239 per Hour) (m4.large – $0.12 per Hour) (m4.large – $0.12 per Hour)

Each node has 100 GB EBS volume

Servers Info


Read More

Scrapy installation on CentOS and Windows

By Harikrishna Doredla

Scrapy is an application framework supporting development of applications in a given environment. I discuss here the steps of installation of Scrapy both CentOS and Windows environments including installation of the dependencies thereof.

Scrapy Installation on Centos 6.5

Scrapy needs python 2.7 and above to run in CentOS . CentOS 6.5 comes with Python 2.6 .So we need to install python 2.7+ to run Scrapy code. Here are the steps to install Python 2.7.11. Firstly install the Scrapy dependencies, preceding installation of Python 2.7.11. (more…)

Read More

Selenium integration with Jenkins in Ubuntu

By Harikrishna Doredla

Jenkins is open-source integration tool written in Java. It can be run on cross-platforms-  Windows, Linux, Mac OS and Solaris environments. Advantage of Jenkins is in its ability to checkout test scripts from repositories in SVN, Github, Bitbucket and build the test from test scripts . Jenkins can be configured to schedule test runs like cron jobs as and when desired .

Prerequisites for Selenium integration:

Ubuntu server, Jenkins

Java , Maven, Packages for Chrome and Firefox

Here we go creating the environment with the above prerequisites


Read More

Scrapy- Framework for Web Crawling and Scraping

By Bhanu Prathap Maruboina

What is Scrapy?

  • Scrapy is an application framework for crawling the designated web sites and extracting data for use in data mining, processing the data, or archival of historical data.
  • It has built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
  • An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple storage areas (FTP, S3, local file system)

Scrapy Installation(Linux/Windows)

  • Prerequisites for installation:
  • Python 2.7
  • pip and setup tools Python packages
  • lxml
  • OpenSSL


Read More

Drools – a business rules management system

By Anusha Jallipalli

Drools is a collection of tools that allow us to separate and reason over logic and data found within business processes. The two important keywords we need to notice are Logic and Data. Let us understand some terms used in the Drools rules engine , before we run a drool example:

Drools Rule Engine uses the rule-based approach to implement an Expert System.

Expert Systems are knowledge-based systems that use knowledge acquired through process followed converting it into a knowledge base that can be used for reasoning.

Rules are pieces of knowledge often expressed as, “When some conditions occur, then do some tasks.”

KnowledgeBase is an interface that manages a collection of rules, processes, and internal types. It is contained inside the package org.drools.KnowledgeBase. In Drools, these are commonly referred as knowledge definitions or knowledge.

KnowledgeBuilder interface is responsible for building a KnowledgePackage from knowledge definitions (rules, processes, types). It is contained inside the package org.drools.builder. It  will report errors through these two methods: hasErrors and getError.

Knowledge Session in Drools is the core component that fires the rules. It is the knowledge session that holds all the rules and other resources.

Facts are inserted into the session and when a specified condition is met, the subsequent rule gets fired. Thus a  Knowledge Session is created from the KnowledgeBase.

Sessions are two types:

Stateless Knowledge Session – which  can be called like a function, passing into it some data and then receiving some results back.

Stateful Knowledge Session – which lives  longer and allows iterative changes over time

Here is an example of drools:


Read More

Sync windows folder with AWS S3 bucket

By Harikrishna Doredla

Content in the  local windows folders can be synchronized with the AWS S3 buckets. This feature in AWS is helpful to ensure the files on the local computer are identical to those in  Cloud storage (AWS S3) .  The synchronization steps are discussed here to familiarize and be reference guidelines for configuration of the AWS S3.

Steps for synchronization

  1. AWS account administrator tasks
  • Create one IAM user with S3 permissions (only) and generate the security credentials (Access key and Secrete Key)
  • Create one bucket in your nearest region and share the bucket & region information
  1. Install AWS CLI in Windows server:
  3. Follow the guidelines given “To install the AWS CLI using the MSI installer”
  4. Configure the AWS CLI tool:
  5. Once CLI configuration is done, prepare bat file with AWS S3 sync up command.

        Sample bat file (test.bat):

7. This user has permission to place the files into remote folder -S3 (data once synced, cannot be deleted )
8. Schedule this bat file in Task scheduler to run as per schedule – please refer:

9. On payment of additional charges the following facilities can be availed
o Enable cross region replication for the bucket with versioning
o Event notifications for delete/create object operations on the bucket
10. We can sync as many folders we like using bat file for each folder in the local windows system.

Read More