Scrapy- Framework for Web Crawling and Scraping

By Bhanu Prathap Maruboina

What is Scrapy?

  • Scrapy is an application framework for crawling the designated web sites and extracting data for use in data mining, processing the data, or archival of historical data.
  • It has built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
  • An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
  • Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple storage areas (FTP, S3, local file system)

Scrapy Installation(Linux/Windows)

  • Prerequisites for installation:
  • Python 2.7
  • pip and setup tools Python packages
  • lxml
  • OpenSSL


Read More

Drools – a business rules management system

By Anusha Jallipalli

Drools is a collection of tools that allow us to separate and reason over logic and data found within business processes. The two important keywords we need to notice are Logic and Data. Let us understand some terms used in the Drools rules engine , before we run a drool example:

Drools Rule Engine uses the rule-based approach to implement an Expert System.

Expert Systems are knowledge-based systems that use knowledge acquired through process followed converting it into a knowledge base that can be used for reasoning.

Rules are pieces of knowledge often expressed as, “When some conditions occur, then do some tasks.”

KnowledgeBase is an interface that manages a collection of rules, processes, and internal types. It is contained inside the package org.drools.KnowledgeBase. In Drools, these are commonly referred as knowledge definitions or knowledge.

KnowledgeBuilder interface is responsible for building a KnowledgePackage from knowledge definitions (rules, processes, types). It is contained inside the package org.drools.builder. It  will report errors through these two methods: hasErrors and getError.

Knowledge Session in Drools is the core component that fires the rules. It is the knowledge session that holds all the rules and other resources.

Facts are inserted into the session and when a specified condition is met, the subsequent rule gets fired. Thus a  Knowledge Session is created from the KnowledgeBase.

Sessions are two types:

Stateless Knowledge Session – which  can be called like a function, passing into it some data and then receiving some results back.

Stateful Knowledge Session – which lives  longer and allows iterative changes over time

Here is an example of drools:


Read More

Sync windows folder with AWS S3 bucket

By Harikrishna Doredla

Content in the  local windows folders can be synchronized with the AWS S3 buckets. This feature in AWS is helpful to ensure the files on the local computer are identical to those in  Cloud storage (AWS S3) .  The synchronization steps are discussed here to familiarize and be reference guidelines for configuration of the AWS S3.

Steps for synchronization

  1. AWS account administrator tasks
  • Create one IAM user with S3 permissions (only) and generate the security credentials (Access key and Secrete Key)
  • Create one bucket in your nearest region and share the bucket & region information
  1. Install AWS CLI in Windows server:
  3. Follow the guidelines given “To install the AWS CLI using the MSI installer”
  4. Configure the AWS CLI tool:
  5. Once CLI configuration is done, prepare bat file with AWS S3 sync up command.

        Sample bat file (test.bat):

7. This user has permission to place the files into remote folder -S3 (data once synced, cannot be deleted )
8. Schedule this bat file in Task scheduler to run as per schedule – please refer:

9. On payment of additional charges the following facilities can be availed
o Enable cross region replication for the bucket with versioning
o Event notifications for delete/create object operations on the bucket
10. We can sync as many folders we like using bat file for each folder in the local windows system.

Read More

My first encounter with AWS Device Farm

By Sandhya

Device Farm is an app testing service available in the  bunch of  several AWS services. This facilitates testing Android app on real , physical phones of multiple flavors.

My maiden attempt to test a native app using AWS Device Farm is interesting and successful. I tested the app using appium frame work that supports cross-platform testing. I will share my experience with appium framework in my next posting.

Here is how AWS Device Farm is deployed to test Android native app, implementing appium automation test; my code is as below.  (more…)

Read More