By Bhanu Prathap Maruboina
What is Scrapy?
- Scrapy is an application framework for crawling the designated web sites and extracting data for use in data mining, processing the data, or archival of historical data.
- It has built-in support for selecting and extracting data from HTML/XML sources using extended CSS selectors and XPath expressions, with helper methods to extract using regular expressions.
- An interactive shell console (IPython aware) for trying out the CSS and XPath expressions to scrape data, very useful when writing or debugging your spiders.
- Built-in support for generating feed exports in multiple formats (JSON, CSV, XML) and storing them in multiple storage areas (FTP, S3, local file system)
- Prerequisites for installation:
- Python 2.7
- pip and setup tools Python packages
pip install Scrapy
- A brief description of the components is included below with links for more detailed information about them. The data flow is described below.
- The engine is responsible for controlling the data flow among all the components of the system, and triggers events when certain actions occur. See the Data Flow section below for more details.
- The Scheduler receives requests from the engine and enqueues them for feeding them later (also to the engine) whenever the engine requests.
- The Downloader is responsible for fetching web pages and feeding them to the engine which, in turn, feeds them to the spiders.
- Spiders are custom classes written to parse responses and extract items (a.k.a scraped items) from them or additional URLs (requests) to follow. Each spider is able to handle a specified domain (or group of domains). For more information see Spiders.
- The Item Pipeline is responsible for processing the items once they are extracted (or scraped) by the spiders. Typical tasks include cleansing, validation and persistence (like storing the item in a database).
Steps for scraping:
- Creating a new Scrapy project
scrapy startproject tripadvisorcrawl
Defining the Items you will extract
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
Writing a spider to crawl a site and extract Items
- To create a Spider, you must write a subclass scrapy.Spider and define the attributes:
- name: identifies the Spider. It must be unique, that is, you can’t set the same name for other Spiders.
- start_urls: a list of URLs from where the Spider will begin to crawl. The urls to be crawled will be listed here. The subsequent URLs will be generated successively from the data contained in the start URLs.
- parse(): a method of the spider, which will be called with the downloaded Response object of each start URL. The response is passed to the method as the first and single argument.
Writing an Item Pipeline to store the extracted Items
- Include the pipeline name in the settings file and write methods to push the scraped item in to database
- Go to the project’s top level directory and run the spider:
scrapy crawl tripspider
- Spiders are classes which define how a certain site (or a group of sites) will be scraped, including how to perform the crawl (i.e. follow links) and how to extract structured data from their pages (i.e. scraping items)
- This is the simplest spider, and the one from which every other spider must inherit
- It doesn’t provide any special functionality. It provides a default start_requests() implementation which sends requests from the start_urls spider attribute and calls the spider’s method; parse for each of the resulting responses.
- A string which defines the name for this spider. The spider name is how the spider is located (and instantiated) by Scrapy, so it must be unique.
- An optional list of strings containing domains that this spider is allowed to crawl. Requests for URLs not belonging to the domain names specified in this list won’t be followed if OffsiteMiddleware is enabled.
- A list of URLs from where the spider will begin to crawl, when no particular URLs are specified. So, the list of Urls to be crawled will be The subsequent URLs will be generated successively from the data contained in the start URLs.
- This method must return an iterable with the first Request(s) to crawl for this spider.
- This is the method called by Scrapy when the spider is opened for scraping when no particular URLs are specified
- This provides for defining a set of rules for following links and commonly used for crawling regular websites.
- This is a list of one (or more) Rule objects. Each Rule defines certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they are defined in this attribute.
- Item objects are the containers for collecting the scraped data. They provide a dictionary to declare the available fields.
- Declaring Items
- Items are declared using class definition and Field objects. Here is an example:
name = scrapy.Field()
price = scrapy.Field()
stock = scrapy.Field()
last_updated = scrapy.Field(serializer=str)
- Each item pipeline component (a.k.a “Item Pipeline”) is a Python class that implements a simple method. They receive an item and perform an action over it, also decide if the item to continue through the pipeline or dropped from further process.
- Typical uses of item pipelines are:
- cleansing HTML data
- validating scraped data (checking that the items contain certain fields)
- checking for duplicates (and dropping them)
- storing the scraped item in a database
- When you’re scraping web pages, the most common task you need to perform is to extract data from the HTML source. There are several libraries available to achieve this:
- XPath is a language for selecting nodes in XML documents, which can also be used with HTML
- we will use the Scrapy shell (which provides interactive testing)
- First, let’s open the shell:
- scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
- Read HTML code of that page, and construct an XPath for selecting the text inside the title tag: