Scrapy

Table of Contents

  1. Basic Crawling
  2. Mobile App
  3. Spider Recipes
  4. Scrapinghub
  5. Configuration & Management
  6. Programming Scrapy
  7. Pipeline Recipe
  8. Official Tutorial

Basic Crawling

$UR^2IM$

scrapy shell -s USER_AGENT="Mozilla/5.0" <URL>

response.body[:50]

Actual value is gained via extract() or re()

Scrapy Project

Shell is mere utility aiding testing, real codes start with Project.

scrapy startproject properties

This chapter focuses on items.py and spiders directory.

Defining ITEMS

from scrapy.item import Item, Field

class PropertiesItem(Item):
    # Primary fields
    title = Field()
    price = Field()
    description = Field()
    address = Field()
    image_urls = Field()
    
    # Calculated fields
    images = Field()
    location = Field()
    
    # Housekeeping fields
    url = Field()
    project = Field()
    spider = Field()
    server = Field()
    date = Field()

Writing Spiders

Halfway, typically one spider per website or a section of website if large. A spider code implements $UR^2IM$ process. TIP: spider or project? A project groups items and spiders, designed for same type over many sites, as above can be used generally.

scrape genspider basic/crawl web

TIP: Scrapy has many subdir but all cmd assumes root dir where lies scrapy.cfg file. Whenever referring to ‘packages and modules’, they are set as to map to directory structure. E.g. ocean.spiders.basic is under ocean/spiders directory.

The self reference in parse() enables functionality of spider.

Start coding and use log() to output info in the primary fields table.

def parse(self, response):
    self.log("title: %s" % response.xpath('//*[@itemprop="name"][1]/text()').extract())
    # similarly for others

scrapy crawl

self.log() output DEBUG: sessions for inspecting correctness

scrapy parse

This allows to use most suitable spider to parse any URL as ARG. BUT best specify.

scrapy parse --spider=crawl <URL>

This outputs similar info as above and often used for DEBUGGING.

Populating ITEM

Slight mod yet “unlocking” tons of functionalities.

INIT and return one. Adding to parse() function process.

from properties.items import PropertiesItem

# inside parse()
item = PropertiesItem()
item['title'] = response.xpath('//*[@itemprop="name"][1]/text()').extract()
# et les restes 
return item

Now scrape crawl basic returns not LOG but DICT of the item. Scrapy is built around the ITEMs to be used by PIPELINEs for more functionalities.

Saving to Files

scrape crawl basic -o items.json .jl .csv .xml

CSV and XML popular for excel apps. JSON for expressiveness and link to JavaScript. .jl files have one JSON object per line, read more efficiently.

To save on cloud:

scrape crawl basic -o "ftp://user:pass@ftp.scrapybook.com/item-s.json"

scrapy crawl basic -o "s3://aws_key:aws_secret@scrapy-book/items.json"

Scrapy parse now adjusted to the new setting. You’ll appreciate it even more while DEBUG URLs that give unexpected results.

Clean Up - ITEM LOADER and housekeeping fields

ItemLoader class replaces all messy looking extract() and xpath() operations.

from scrapy.loader import ItemLoader

def parse(self, response):
    l = ItemLoader(item = PropertiesItem(), response = response)
    l.add_xpath('title', '//*[@itemprop="name"][1]/text()')
    l.add_xpath('price', './/[@itemprop="price"][1]/text()', re = '[,.0-9]+')
    return l.load_item()

More than clean, it declares very clearly intention of action. ItemLoader provie many cool ways of mixing data, formatting, cleaning up. Note they are actively developed so keep abreast here

Processors are fast and neat functions manipulating multiple selectors.

Processor Functionality
Join() Concatenates multiple results into one
MapCompose(unicode.strip) Chaining python func: i.e. Removes leading and trailling whitespace chars
MapCompose(unicode.title) Also gives title cased results
MapCompose(float) converts strings to integers
MapCompose(lambda i: i.replace(',', ''), float) WOW, inheriting all power of lambda function….
MapCompose(lambda i: urlparse.urljoin(response.url, i)) converts relative URLs to absolute using response.url as base

Possible use of any Python expression as a processor. These are simply functions embedded in Scrapy, such that possible to try in SHELL

# scrapy shell someweb
from scrapy.loader.processors import MapCompose, Join

Join()(['hi', 'John'])
>>> u'hi John'

Let’s see how to add them inside parse()

# RECALL to import relevant modules
import datetime, socket, urlparse

# for processing items
l.add_xpath('title', 'XPATH', MapCompose())

# for easy, housekeeping fields
l.add_value('url', response.url)
l.add_value('project', self.settings.get("BOT_NAME"))
l.add_value('spider', self.name)
l.add_value('server', socket.gethosename())
l.add_value('date', datetime.datetime.now())

Perfectly looking Items and might at first glance seems complex. BUT it’s worth it, especially considering the similar power requires tons more codes in other langues, here only 25-line of codes.

Another feeling stems from all those processors and ItemLoaders These xeno-python codes are worth the effort for serious web scraping journey.

Creating Contracts

Contracts like unit tests for spiders, for quickly checking broke code. For instance, checking old spiders if working now. Contracts are included in the comments just after the name of function (docstring) starting with @

def parse(self, response):
    """ This function parses a property page.
    
    @url http://web:9312/proeprties/property_000000.html
    @returns items 1
    @scrapes title price description address image_urls
    @scrapes url project spider server date
    """

It means ‘checking this URL and you should fine one item with values on those fields enlisted here’.

scrapy check will go and check whether the contracts are valid : scrapy check basic In case of error

FAIL: [basic] parse (@scrapes post-hook)
------------------------------------------------------------------------------------
ContractFail: 'url' field is missing

Fail either in code or selector. Good first line of check.

RECAP CODE

import datetime
import urlparse
import socket
import scrapy

from scrapy.loader.processors import MapCompose, Join
from scrapy.loader import ItemLoader
from scrapy.http import Request

from properties.items import PropertiesItem


class BasicSpider(scrapy.Spider):
    name = "manual"
    allowed_domains = ["web"]

    # Start on the first index page
    start_urls = (
        'http://web:9312/properties/index_00000.html',
    )

    def parse(self, response):
        # Get the next index URLs and yield Requests
        next_selector = response.xpath('//*[contains(@class,"next")]//@href')
        for url in next_selector.extract():
            yield Request(urlparse.urljoin(response.url, url))

        # Get item URLs and yield Requests
        item_selector = response.xpath('//*[@itemprop="url"]/@href')
        for url in item_selector.extract():
            yield Request(urlparse.urljoin(response.url, url),
                          callback=self.parse_item)

    def parse_item(self, response):
        """ This function parses a property page.

        @url http://web:9312/properties/property_000000.html
        @returns items 1
        @scrapes title price description address image_urls
        @scrapes url project spider server date
        """

        # Create the loader using the response
        l = ItemLoader(item=PropertiesItem(), response=response)

        # Load fields using XPath expressions
        l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                    MapCompose(unicode.strip, unicode.title))
        l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                    MapCompose(lambda i: i.replace(',', ''), float),
                    re='[,.0-9]+')
        l.add_xpath('description', '//*[@itemprop="description"][1]/text()',
                    MapCompose(unicode.strip), Join())
        l.add_xpath('address',
                    '//*[@itemtype="http://schema.org/Place"][1]/text()',
                    MapCompose(unicode.strip))
        l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
                    MapCompose(lambda i: urlparse.urljoin(response.url, i)))

        # Housekeeping fields
        l.add_value('url', response.url)
        l.add_value('project', self.settings.get('BOT_NAME'))
        l.add_value('spider', self.name)
        l.add_value('server', socket.gethostname())
        l.add_value('date', datetime.datetime.now())

        return l.load_item()

With CrawlSpider

import datetime
import urlparse
import socket

from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader

from properties.items import PropertiesItem


class EasySpider(CrawlSpider):
    name = 'crawl'
    allowed_domains = ["web"]

    # Start on the first index page
    start_urls = (
        'http://web:9312/properties/index_00000.html',
    )

    # Rules for horizontal and vertical crawling
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
        Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
             callback='parse_item')
    )

    def parse_item(self, response):
        """ This function parses a property page.

        @url http://web:9312/properties/property_000000.html
        @returns items 1
        @scrapes title price description address image_urls
        @scrapes url project spider server date
        """

        # Create the loader using the response
        l = ItemLoader(item=PropertiesItem(), response=response)

        # Load fields using XPath expressions
        l.add_xpath('title', '//*[@itemprop="name"][1]/text()',
                    MapCompose(unicode.strip, unicode.title))
        l.add_xpath('price', './/*[@itemprop="price"][1]/text()',
                    MapCompose(lambda i: i.replace(',', ''), float),
                    re='[,.0-9]+')
        l.add_xpath('description', '//*[@itemprop="description"][1]/text()',
                    MapCompose(unicode.strip), Join())
        l.add_xpath('address',
                    '//*[@itemtype="http://schema.org/Place"][1]/text()',
                    MapCompose(unicode.strip))
        l.add_xpath('image_urls', '//*[@itemprop="image"][1]/@src',
                    MapCompose(lambda i: urlparse.urljoin(response.url, i)))

        # Housekeeping fields
        l.add_value('url', response.url)
        l.add_value('project', self.settings.get('BOT_NAME'))
        l.add_value('spider', self.name)
        l.add_value('server', socket.gethostname())
        l.add_value('date', datetime.datetime.now())

        return l.load_item()

MORE URLS

First kind, hardcode LIST of URLs in start_urls = ()

Up a notch would be start_urls = [i.strip() for i in open('todo.urls.txt').readlines()]

Crawling Direction

Check via Shell first

Two-Direction Crawling

from scrapy.http import Request
def parse(self, response):
    # Get next index and yield Requests
    next_selector = response.xpath('//@[contains(@class, 'next')]//@href')
    
    for url in next_selector.extract():
        yield Request(urlparse.urljoin(response.url, url))
    
    # Get item URLs and yield Requests
    item_selector = response.xpath('//*[itemprop='url']/@href')
    
    for url in item_selector.extract():
        yield Requests(urlparse.urljoin(response.url, url), 
                      callback = self.parse_item)

yield DOES NOT EXIT function, but continues with the for loop. PYTHON MAGIC.

For testing purpose, stop at certain items quantity

scrapy crawl manual -s CLOSESPIDER_ITEMCOUNT=90

It first read index, then spawns many Requests, executed. Scrapy uses LIFO strategy to process requests (depth first crawl). Last request submitted will be processed first. Convenient for most cases. E.g. processing each listing page before moving to the next index page, or else fill a huge queue of pending listing pages.

Modifiable in setting PRIORITY argument greater than 0 (higher than default) or less than 0. In general, scrapy scheduler will execute higher priority requests first, but don’t spend much time thinking about the exact request should be executed first. Highly likely that not use more than one or two request priority levels in most applicaitons. Note also that URLs are subject to duplication filtering, most often desired. If wishing to perform a request to the same URL more than once, dont_filter = true inside Request()

Two-Direction Crawling with CrawlSpider

Seemingly tedious code in basic spider, CrawlSpider class offers simpler methods. Once genspider, the extra code inherited on the surface are:

rules = ( 
	Rule(LinkExtractor(allow=r'Items/'), callback = 'parse_item', follow = true)
)

def parse_item(self, resposne):
    pass

TIP Why learn manual as above? yield + Requests with callback is such as USEFUL and CORE technique that will use repeated later, worth knowing.

Now mode rules one for horizontal and one for vertical crawlling

rules = (
	Rule(LinkExtractor(restrict_xpaths = '//*[contains(@class, 'next')]')),
    Rule(LinkExtractor(restrict_xpaths = '//*[@itemprop='url']'), callback = 'parse_item')
)

What differ are missing a and href constraints, for LinkExtractor by default looks for those two elemetns. Also note taht callbacks are now strings not method references in Requests(self.parse_item). Unless callback is set, a Rule will follow the extracted URLs, which means that it will scan target pages for extra links and follow them. If a callback, Rule will NOT follow the links from target pages. If need to follow links, either return/yield them from callback method, or set follow=true in Rule(). This might be useful when listing pages contain both ITEMs and extra useful navigation links!!

From Scrapy to a MOBILE APP

Choosing a Mobile App Framework

Feeding data scraped to app is easy if using appropriate tools. Many frameworks such as PhoneGap, Appcelerator, jQuery Mobile, Sencha Touch.

This tutorial uses Appery.io for its iOS, Android, Windows Phone and HTML5 compatibility and ease of use using PhoneGap and jQuery Mobile. Its paid service bundles both mobile and backend services, meaning no need to configure DB, write REST APIs or use perhaps other langues to write them.

Detail see Learning Scrapy on GitHub source code.

Quick Spider Recipes

Previously on extracting info from pages and stored as Items. This is the 80% of the use case, and this section covers special usage to focus on 2 two important classes Request and Response, the two Rs in the process model.

A spider that logs in

When website having login mechanism, the two Rs are key to extract data via inspecting Network traffic in dev tool.

Once login correctly, Request Method: POST appears in network request. Inspect data including Form Data, Cookie stores the login detail set under Request Headers. Thus a single operation, such as login, may involve several server round-trips, including POST, HTTP redirects 302, etc. Scrapy handles most of these operations automatically, with simple code needed. Inherit from CrawlSpider, define a new spider: ```class LoginSpider(CrawlSpider): name='login'`

FormRequest class send initial request that logs in by performing POST request, similar to Request with extra formadata argument to pass data (i.e. user and pass)

import datetime
import urlparse
import socket

from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy.http import FormRequest

from properties.items import PropertiesItem


class LoginSpider(CrawlSpider):
    name = 'login'
    allowed_domains = ["web"]

    # Start with a login request
    def start_requests(self):
        return [
            FormRequest(
                "http://web:9312/dynamic/login",
                formdata={"user": "user", "pass": "pass"}
            )]

    # Rules for horizontal and vertical crawling
    rules = (
        Rule(LinkExtractor(restrict_xpaths='//*[contains(@class,"next")]')),
        Rule(LinkExtractor(restrict_xpaths='//*[@itemprop="url"]'),
             callback='parse_item')
    )

    def parse_item(self, response):
        # stay the same

That’s it really. The default parse() of CrawlSpider handles Response and uses Rules exactly as previously. So little code since Scrapy handles cookies transparently for us, as soon as login, it passes them on to subsequent requests in exactly the same manner as a browser.

Naturally some login mechanism is more complex, such as a HIDDEN value, which need be POST together. This means two requests! Visit the form page and then the login page, then pass through some data. A new spider now in start_requests() return a simple Request to our form page, and will manually handle the ersponse by setting its callback to hour handler method named parse_welcome() below. In it, use the helper from_response() method of FormRequest object to create FormRequest that is pre-populated with all the fields and values from the original form. FormRequest.from_response() roughly emulates a submit click on the first form on the page with all the fields left blank.

TIP: worth familiarise with documentaion of from_response() for its many features like formname and formnumber designed to help select the form desired if multiple occur.

This effortless feature use formdata argument to fill in the user and pass fields and return FormRequest:

import datetime
import urlparse
import socket

from scrapy.loader.processors import MapCompose, Join
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from scrapy.http import Request, FormRequest

from properties.items import PropertiesItem


class NonceLoginSpider(CrawlSpider):
    name = 'noncelogin'
    allowed_domains = ["web"]

    # Start on the welcome page
    def start_requests(self):
        return [
            Request(
                "http://web:9312/dynamic/nonce",
                callback=self.parse_welcome)
        ]

    # Post welcome page's first form with the given user/pass
    def parse_welcome(self, response):
        return FormRequest.from_response(
            response,
            formdata={"user": "user", "pass": "pass"}
        )

When run in Shell, observes first GET to /hidden_login page, and then POST, folllowed by redirection on to /hidden_login_success page that leads to /gated as before.

A spider that uses JSON APIs and AJAX pages

Hidden elements managed by JSON objects dynamically. Similarly, inspect via Network traffic, often in the form of api.json. More complex APIs may require login, POST, or return more interesting data structures. At any rate, JSON is one of the easiest formats to parse as no need to write any XPATH to extract.

Python provides a great JSON parsing module - import json -> json.loads(response.body) to parse JSON and convert it to an equal object consisting of Python primitives, lists, and dicts.

Once found, make spider that works just on it.

start_urls = ('http://someurl/api.json')

More complex need can be done using previous mechanism. At this point, Scrapy will open this URL and call parse() with Response as argument.

import urlparse
import socket
import scrapy
import json

class ApiSpider(scrapy.Spider):
    name = 'api'
    allowed_domains = ["web"]

    # Start on the first index page
    start_urls = (
        'http://web:9312/properties/api.json',
    )

    # Format the URLs based on the API call response
    def parse(self, response):
        base_url = "http://web:9312/properties/"
        js = json.loads(response.body)
        for item in js:
            id = item["id"]
            title = item["title"]
            url = base_url + "property_%06d.html" % id
            yield Request(url, meta={"title": title}, callback=self.parse_item)

%06d is a very useful piece of Python syntax for creating new strings by combining Python variables. %d means treat as digit and extends to 6 characters by prepending 0s if necessary. If id has the value 5, it will be repalced with 000005, else if 34322 then 034322. yield new Request of correctly joined URL with callback.

Passing arguments between responses

If info on JSON APIs need be stored to ITEM, how to pass it from parse() to parse_item() method?

meta data as dict inside Request() used for this purpose, and the index page info. For example, let’s set a title value on this dict to store the title from JSON object: title = item["title"]

yield Request(url, meta = {"title": title}, callback = self.parse_item)

Inside parse_item(), we can use this value instead of XPath expression before:

l.add_value('title', response.meta['title'], MapCompose(unicode.strip, unicode.title))

Notice the switch from calling add_xpath() to add_value() when using meta

30x Faster Spider

Avoid scraping every single listing page if able to extract about the same info from index page !!

TIP: If a website gives 10, 50 or 100 listing pages per index page by tuning a param, such as &show=50 on URL, set to maximum before horizontal crawling.

A programming design decision here, since most website set throttle requests.

Demo of such mechanism:

def parse(self, response):
    # Get next index URLs and yield Requests
    # same as before
    
    # Iterate through products and create PropertiesItems
    selectors = response.xpath(
    	'//*[@itemtype = "http://scheme.org/Product"]')
    # differs in yielding each of 30 product from selectors and parse_item them
    for selector in selectors:
        yield self.parse_item(selector, response)

def parse_item(self, selector, response):
    # Create the laoder using the selector
    l = ItemLoader(item = PropertiesItem(), selector = selector)
    
    # Load fields using XPath
    # NOTE!! relative CURRENT XPath selector '.' for each expression!
    l.add_xpath('title', './/*[@itemprop="name"][1]/text()',
    	MapCompose(unicode.strip, unicode.title))
    # etc
    make_url = lambda i : urlparse.urljoin(response.url, i)
    l.add_xpath('image_urls', './/*[@itemprop="image"][1]/@src',
               MapCompose(make_url))
    # Housekeeping mostly the same
    l.add_xpath('image_urls', './/*[itemprop="url"][1]/@href',
               MapCompose(make_url))
    
    return l.load_item()

Slight changes made:

Spider crawling based on Excel file

In case where scrape data from many sites with ONLY XPath changes, overkill to have a spider for every site. How to use a single spider?

Create a new project generic name spider fromcsv. Create a CSV with fields containing relevant URL, items to extract (XPATH), save in project root directory.

Read the CSV into Dict:

import csv

with open("data.csv", "rU") as f:
    reader = csv.DictReader(f)
    for line in reader:
        print(line)

Modification to spider:

import csv

import scrapy
from scrapy.http import Request
from scrapy.loader import ItemLoader
from scrapy.item import Item, Field


class FromcsvSpider(scrapy.Spider):
    name = "fromcsv"

    def start_requests(self):
        with open(getattr(self, "file", "todo.csv"), "rU") as f:
            reader = csv.DictReader(f)
            for line in reader:
                request = Request(line.pop('url'))
                request.meta['fields'] = line
                yield request

    def parse(self, response):
        item = Item()
        l = ItemLoader(item=item, response=response)
        for name, xpath in response.meta['fields'].iteritems():
            if xpath:
                item.fields[name] = Field()
                l.add_xpath(name, xpath)

        return l.load_item()

Observation:

Deploying to Scrapinghub

  1. + Service
  2. Scrapy Cloud -> Project Naming -> Create
  3. Open Project -> menu on the left [JOBS, SPIDERS, COLLECTIONS, USAGE, REPORTS, ACTIVITY, PERIODIC JOBS, SETTINGS]
  4. Setting -> Scrapy Deploy -> COPY data into project’s scrapy.cfg
  5. pip install shub
  6. shub login (with API keys)
  7. shub deploy -> Run your spiders at https://dash.scrapinghub.com/p/28814/
  8. SPIDERS -> spiders uploaded
  9. Schedule -> view all info or Stop

Programmatic Access to Scrapinghub Jobs/Data

Inspecting URL of jobs and spiders to understand entry points.

curl -u <API>: https://storage.scrapinghub.com/items/<project id>/<spider id>/<job id>

Leave blank if prompt pass. This allows writing applications/services using Scrapinghub as data storage backend. Mindful of time limit in cloud plan.

Scheduling Recurring Crawls

  1. PERIODIC JOBS -> Add -> set-up

Configuration and Management

Settings

Source code information on DEFAULT PRIORITY scrapy/settings/default_settings.py

Project-level setting tuning is most practical.

Spider-level settings via custom_settings attribute in spider definitions per spider.

Last-minute mod pass Shell cmd -s CLOSESPIDER_PAGECOUNT=3

TESTING

scrapy settings --get CONCURRENT_REQUESTS -s CONCURRENT_REQUESTS=19

scrapy shell -s CONCURRENT_REQUESTS=19

Essential Settings

ANALYSIS

CODE DETAIL
(Logging) LOG_LEVEL Various levels of logs based on severity: DEBUG -> INFO -> WARNING -> ERROR -> CRITICAL , this controls threshold of level to display. Often INFO as DEBUG can be verbose.
(Logging) LOGSTATS_INTERVAL Prints number of times and pages scraped per minute. It sets logging frequency default = 60 seconds. This may be too infrequent, often 5 seconds if short run.
(Logging) LOG_ENABLED
(Logging) LOG_FILE Where logs are written, unless set, it go to STDERR except if logging gets disabled to False above.
(Logging) LOG_STDOUT Record all of its STDOUT (e.g. “print” msg) to log by set True.
(Stats) STATS_DUMP Enabled as default, it dumps values from Stats Collector to log once spider done.
(Stats) DOWNLOADER_STATS Control wheter stats are recorded for the downloader.
(Stats) DEPTH_STATS Control whether stats are collected for site depth.
(Stats) DEPTH_STATS_VERBOSE Verbose log of above.
(Stats) STATSMAILER_RCPTS A list (e.g. set to “my@gmail.com”) of e-mails to send stats to when crawl done.
(Telnet) TELNETCONSOLE_ENABLED Python shell running process enabled as default
(Telnet) TELNETCONSOLE_PORT Determines ports used to connect to console. EX1: In case wanting to look on internal status of Scrapy while running. DEBUG: Telnet console listening on 127.0.0.1:6023:6023 means telnet is on and listening in port 6023. Now on another terminal, use telnet command to connect to it: telnet localhost 6023 giving a Python console inside Scrapy, for inspecting components like engine using engine variable or est() for quick overview. Very useful when using remote machine: engine.pause() .unpause() .stop()

PERFORMANCE

Code Detail
CONCURRENT_REQUESTS Maximum number of requests concurrently, mostly protects server’s outbound cap.
CONCURRENT_REQUESTS_PER_DOMAIN More restrictive, protects remote servers by limiting numer of concurrent req per unique domain or IP
CONCURRENT_REQUESTS_PER_IP If true, the above is ignored. NOT per second, if 16 and avg req 0.25 a second then limit is 16/0.25 = 64 req per second.
CONCURRENT_ITEMS Max number of items per response concurrently, per request. if 16 CONCURRENT_REQUESTS and this 100 => 1600 items concurrently wriing to DB, etc.
DOWNLOAD_TIMEOUT Time waited before canceling request. 180 seconds as default, seemingly excessive, advised reduction to 10 seconds.
DOWNLOAD_DELAY Default to 0, mod to apply conservative download speed using this. A site might use FREQUENCY REQUEST` detect bot. |
RANDOMIZE_DOWNLOAD_DELAY If above true, this enabled to +- 50% on delay
DNSCAHCE_ENABLED For faster DNS lookups, an in-memory DNS cache is enabled by default.

CLOSING

Code Detail
CLOSESPIDER_ERRORCOUNT Auto-stop when conditions met. Often set while running spider in SHELL for testing
CLOSESPIDER_ITEMCOUNT
CLOSESPIDER_PAGECOUNT
CLOSESPIDER_TIMEOUT In seconds

HTTP CACHE

Code Detail
HTTPCACHE_ENABLED The HttpCacheMiddleware deactivated by default gives low-level cache for HTTP req/res
HTTPCACHE_DIR Relative path to project root
HTTPCACHE_POLICY If = scrapy.contrib.httpcache.RFC2616Policy enables way more sophy caching policy respecting sits hints according to RFC2616. (above two also True)
HTTPCACHE_STORAGE scrapy.contrib.httpcache.DbmCacheStorage
HTTPCACHE_DBM_MODULE Adjusting (defaults to anydbm)
HTTPCACHE_EXPIRATION_SECS
HTTPCACHE_IGNORE_HTTP_CODES
HTTPCACHE_IGNORE_MISSING
HTTPCACHE_IGNORE_SCHEMES
HTTPCACHE_GZIP

CRAWLING STYLE

Code Detail
DEPTH_LIMIT Max depth 0 meaning no limit.
DEPTH_PRIORITY This alows BREADTH FIRST Crawl by setting this to positive number changing from LIFO to FIFO: DEPTH_PRIORITY = 1 (useful for news site where recent data best use FIFO method with DEPTH_LIMIT = 3 might allow quick scan latest news on portal)
SCHEDULER_DISK_QUEUE Following above example: = scrapy.squeue.PickleFileDiskQueue
SCHEDULER_MEMORY_QUEUE Following above example: = scrapy.squeue.FifoMemoryQueue
ROBOTSTXT_OBEY
COOKIES_ENABLED CookiesMiddleware takes care of all cookie-wise operations, enabling others to log in etc. If prefer more ‘stealth’ crawling, disable this.
REFERER_ENABLED Default to True enabling populating Referer headers, defined with DEFAULT_REQUEST_HEADERS useful for weird sites banning unless showing particular request headers !!
USER_AGENT
DEFAULT_REQUEST_HEADERS Set along with Referer Headers above.

FEEDS

CODE DETAIL
FEED_URI scrapy crawl fast -o "%(name)s_%(time)s.jl" will auto-name output file. Custom variable defined in spider also allowed %(foo)s if foo defined. This is also set for S3, FTP. (e.g. =s3://mybucket/file.json along with AWS settings below)
FEED_FORMAT Auto-assigned based on URI extension, or set here.
FEED_STORE_EMPTY Bool for empty feed.
FEED_EXPORT_FILEDS Filter esp. CSV with fixed header columns if need.
FEED_URI_PARAMS Define function to postprocess any parmas to URI

MEDIA DOWNLOAD

CODE DETAIL
IMAGES_STORE Directory stored (project root relative path) URLs for images for each ITEM should be in its image_url FIELD (can be overridden by IMAGES_URLS_FIELD)
IMAGES_EXPIRES
IMAGES_THUMBS E.g. one icon-sized and one medium size per image
IMAGES_URLS_FIELD
IMAGES_RESULT_FIELD Overrides image FIELD filenames
IMAGES_MIN_HEIGHT
IMAGES_MIN_WIDTH
FILES_STORE Other media, same style as Image. Both can be set at once.
FILES_EXPIRES
FILES_URLS_FIELD
FILES_RESULT_FIELD

Example - downloading images

To use image functions - pip install image ; to enable IMAGE PIPELINE, edit projects’ settings.py add below.

ITEM_PIPELINES = {
    ...
    'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'images'
IMAGES_THUMBS = { 'small': (30, 30) }

Already have an image_urls field set to Item, so run :

scrapy crawl fast -s CLOSESPIDER_ITEMCOUNT=90

AWS

CODE DETAL
AWS_ACCESS_KEY_ID Used as: When download URL start with s3:// instead of http:// etc, or s3:// path to store files with media pipelines and store output Item feeds onto s3:// directory
AWS_SECRET_ACCESS_KEY

PROXY

CODE DETAIL
http_proxy HttpProxyMiddleware uses these settings in accordance with Unix’s convention, enabled as default.
https_proxy
no_proxy

Example - Using proxies and Crawlera’s clever proxy

DynDNS (or similar service) provides a free online tool to check your current IP, using Shell making a request to checkip.dyndns.org to see:

scrapy shell http://checkip.dyndns.org

Inside response.body see Current IP Address:

To start proxying requests, exit shell and use export command to set new proxy. Test free proxy by search through HMA’s public proxy list (http://proxylist.hidemyass.com/), e.g. assuming from lsit a 10.10.1.1 and port 80

env | grep http_proxy

Should have nothing set, then

export http_proxy=http://10.10.1.1:80

Rerun Shell will see new IP. Crawlera is Scrapy official service augmented by smart configurations

export http_proxy=myusername:password@proxy.crawlera.com:8010

Further settings

Programming Scrapy

Up to here, spiders wrote have main task in defining ways crawling and extracting. Beyond Spiders, Scrapy gives mechanisms allowing fine-tune most aspects of its functionalities, such as facing:

  1. Copy and paste lots of code among spiders of same project. Repeated code is more related to data (performing calculations on fields) rather than data sources
  2. Having to write scripts postprocessing ITEM to drop duplicate entries or calculating values
  3. Having repeated code across projects to deal with infrastructur. E.g. need to log in and transfer files to proprietary repositories, add ITEM to DB, or trigger postprocessing operations when crawls complete

Scrapy developers designed its architecture in ways allowing customisation, such as engine powering Scrapy TWISTED

SCRAPY IS A TWISTED APPLICATION

Twisted Python Framework is unusual becuase it’s event-driven and encourages writing asynchronous code.

DO NOT write BLOCK code:

Twisted gives methods allowing performing all these and more without blocking code execution.

Imaging a typical synchronous scrapping application having 4 threads and, at any moment, 3 of them are blocked waiting for responses, 1 of them blocked performing a database write access to persist and ITEM. At any moment, it’s quite unlikely to find a general-purpose thread of a scrapping app doing anything else but waiting for some blocking to pass. When blocking passes, some computations may take place for a few microseconds and then threads block again on other blocking ops likely lasting a few ms. Overall the server is not idle as it runs tens of apps utilising thousands of threads, thus after some careful tuning, CPUs remain reasonably utilised.

MULTI-THREADING (4 threads)

  • Thread 1: blocked on web request #330
  • Thread 2: blocked on database access #79
  • Thread 3: blocked on web request #330
  • Thread 4: blocked on web request #312

TWISTED (1 thread)

  • Thread 1: blocked waiting for any of the resources to free up
    • Hanging : R329, D79, R330, R312, F32, … 1000’s more…

Twisted approach favours using a single thread as possible, using modern OS I/O multiplexing functions (select(), poll(), epoll()) as HANGER, returns at once. BUT not the actual value but a hook, i.e. deferred = i_dont_block(), where hang whatever functionality wishign to run whenever value becomes free. Twisted application is made of chains of such deferred ops. Since single-threaded, no suffering costs of context switches and save resources (like memory) that extra threads require. Autrement dit, using this nonblocking infrastructure, gets similar performance if having thousands of threads.

OS developers have been optimising thread ops for decades to make fast. The performance arguments is not as strong, but certainly writing correct thread-safe code for complex apps very hard. Mind framework change in thinking in deferred/callback, Twisted code significantly simpler than threaded code. inlineCallbacks generator utility makes code even simpler.

NOTE: arguably, the most successful nonblocking I/O system until now is Node.js, mainly for its high performance/concurrency. Every Node.js app uses just nonblocking APIs.

DEFERREDS AND DEFERRED CHAINS

Deferreds are most essential mechanism Twisted offers to help write asynchronous code. APIs use deferreds to allow definig sequences of actions to be called when certain evens occur.

from twisted.internet import defer

d = defer.Deferred()
d.called
# False

d.callback(3)
d.called
# True

d.result
# 3

See that Deferred is at core a thing representing a value that hangs, when fire d called it’s called state becomes True, result attribute is set to value set on callback.

d = defer.Deferred()
def foo(v):
    print("foo called")
    return v+1

d.addCallback(foo)
d.called
# Flase
d.callback(3)
# foo called
d.called
# True
d.result
# 4

The most powerful feature of deferred is that we can chain other ops to be called when a value is set. Add a foo() func as callabck of d.

Understanding Twisted and nonblocking I/O

# ~*~ Twisted - A Python tale ~*~

from time import sleep

# Hello, I'm a developer and I mainly setup Wordpress.
def install_wordpress(customer):
	# Our hosting company Threads Ltd. is bad. I start installation and ...
	print("Start installation for", customer)
	# ...then wait till the installation finishes successfully. It is 
	# boring and I'm speding most of my time waiting while consuming
	# resources (RAM and CPU cycles). It's because the process is BLOCKING
	sleep(3)
	print("All done for", customer)
	
# I do this all day for our customers
def developer_day(customers):
	for customer in customers:
		install_wordpress(customer)
		
developer_day( ["Bill", "Elon", "Steve", "Mark"])

# Let's run it
$ ./deferreds.py 1
...
* Elasped time: 12.03 seconds

What gotten is a sequential execution. 4 customers with 3 seconds processing each means 12 overall. Doesn’t scale well.

from twisted.internet import reactor, defer, task

# Twisted has a slightly different approach
def schedule_install(customer):
    def schedule_install_wordpress():
        def on_done():
            print("Callback: Finished installation for", customer)
        print("Scheduling: Installation for", customer)
        return task.deferLater(reactor, 3, on_done)
	def all_done(_):
        print("All done for ", customer)
	
    # For each customer, schedule these processes on the CRM and that is all has to do
    d = schedule_install_wordpress()
    d.addCallback(all_done)
    return d
def twisted_developer_day(customers):
    work = [schedule_install(customer) for customer in customers]
    join = defer.DeferredList(work)
    join.addCallback(lambda _ : reactor.stop())
    
twisted_developer_day( ["Customer %d" % i for i in xrange(15)])

reactor.run()

This processes all 15 customers in parallel, 45 seconds computation in just three seconds! The trick is replacing all blocking calls to sleep() with its Twisted counterpart task.deferLater() and callback.

Guide to programming scrapy:

Problem Solution
Specific to website crawled Mod Spider
Mod or storing ITEM - domain-specific, may be reused across projects Write an Item Pipeline
Mod or dropping Request/Reponse - domain-specific, mmay be reused across projects Write a spider middleware
Executing Requests/Responses - generic, like to support some custom login scheme or a special way to handle cookies Write a downloader middleware
All other problems Write an extension

Example 1 - a very simple pipeline

Problem: Lots of spiders, but database need string format for indexing, changing individual spiders too much code.

Write a postprocess item pipeline:

from datetime import datetime

class TidyUp(object):
    def process_item(self, item, spider):
        item['date'] = map(datetime.isoformat, item['date'])
        return item

Simple class with process_item() method. Add it in tidyup.py insdie pipelines directory.

NOTE: The placement of code is free, but a separate directory is a good idea.

Now edit settings.py and set

```ITEM_PIPELINES = { ‘properties.pipelines.tidyup.TidyUp’ : 100 }`

The number 100 on dict defines the order in which pipelines are connected. If another pipeline has a smaller number, it will process ITEM prior to this pipeline.

The resulting date data will be ['2015-11-08T14:47:04.148232'] as ISO string.

Signals

Mechanism to add callbacks to events happening in system, such as when a spider opens, or when an item gets scraped. Hook to them using crawler.signals.connect() (see below example). There’re just 11 of them and maybe the easiest way to understand in action. Below is a project having an extension hooking to all signals. Plus a Item Pipeline, one Downloader and one spider middleware, logging every method invocation.

def parse(self, response):
    for i in range(2):
        item = HooksasyncItem()
        item['name'] = "Hello %d" % i
        yield item
	raise Exception("dead")

On the second ITEM, configured the Item Pipeline to raise a DropItem exception.

This illustrates when certain signals get sent via logs:

$ scrapy crawl test
...many lines....

# First we get those two signals...
INFO: Extension, signals.spider_opened fired
INFO: Extension, signals.engine_started fired

# Then for each URL get a request_scheduled signal
INFO: Extension, singals.request_scheduled fired

# when downlad compltes we get
INFO: Extension, signals.response_downloaded fired
INFO: DownloaderMiddlewareprocess_response called for example.com

# Work between 
INFO: Extension, singals.response_received fired
INFO: SpiderMiddlewareprocess_spider_input called for..

# here our parse() gets called then SpiderMiddleware use
INFO: SpiderMiddlewareprocess_spider_output called for url

# For every Item going through pipelines successfullly...
INFO: Extension, signals.item_scraped fired

# For every item gets dropped using DroptItem exception
INFO: Extension, signals.item_dropped fired

# If your spider throws sth lese..
INFO: Extension, signals.spider_error fired

# ... the above process repeats for each URL
# ... till we run out of them. then..
INFO: Extension, signals.spider_idle fired

# by hooking spider_idle you can shcedule further Requests. if you dont the spider ends
INFO: Closing spider (finished)
INFO: Extension, signals.spider_closed fire

# ....stats get printed and finally engines get stopped
INFO: Extension, singal.sengine_stopped fired

Only 11 signals, but every scrapy default middleware is implemented using just them, so they must be sufficient. Note every signal except spider_idle, error, request, you can also return deferreds instead of actual values.

Example 2 - an extension measuring thorugput and latencies

Built-in extension for this the Log Stats extenson (scrapy/extensions/logstats.py) in source code as starting point. To measure latencies, hook the request_scheduled, response_received, item_scraped signals.

class Latencies(object):
    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler)
    def __init__(self, crawler):
        self.crawler = crawler
        self.interval = crawler.settings.getfloat('LATENCIeS_INTERVAL')
        	if not self.interval:
                raise NotConfigured
		cs = crawler.signals
        cs.connect(self._spider_opened, signal=signals.spider_opened)
        cs.connect(self._spider_closed, signal=signals.spider_closed)
        cs.connect(self._request_scheduled, signal=signals.request_scheduled)
        cs.connect(self._request_received, signal=signals.request_received)
		cs.connect(self._item_scraped, signal=signals.item_scraped)
		self.latency, self.proc_latency, self.items = 0, 0, 0
	def _spider_opened(self, spider):
        self.task = task.LoopingCall(self._log, spider)
        self.task.start(self.interval)
        
	def _spider_closed(self, spider, reason):
        if self.task.running:
            self.task.stop()
	def _request_scheduled(self, request, spider):
        request.meta['schedule_time'] = time()
	def _reponse_received(self, response, request, spider):
        request.meta['received_time'] = time()
	def _item_scraped(self, item, response, spider):
        self.latency += time() - response.meta['schedule_time']
        self.proc_latency += time() - response.meta['received_tieme']
        self.items += 1
	def _log(self, spider):
        irate = float(self.items) / self.interval
        latency = self.latency / slef.items if self.items else 0
        proc_latency = self.proc_latency / self.items if self.items else 0
        spider.logger.info(("Scraped %d items at %.1f items/s, avg latencys: "
                           "%.2f s and avg time in pipelines: %.2f s") %
                           (self.itesm, irate, latency, proc_latency))
        self.latency, self.proc_latency, self.items = 0, 0, 0

The first wo methods key as they are typical. INIT middleware using Crawler object. They’re every nontrivial middleware. from_crawler(cis, craler) is way of grabbing the crawler object. Then notice in init() accesing crawler.settings and raise exception if not set. Many FooBar extensions checking the corresponding FOOBAR_ENABLED setting and raise if not set or Flase. This is very common pattern allowing middleware to be included for ease in matching settings.py settings (ITEM__PIPELINES, for example) but being disabled by default, unless enableld by flag settings. Many default middleware (AutoThrottle or HttpCache) use this pattern. In this case, the extension remains disabled unless LATENCIES_INTERVAL is est

Then __init__() register callbacks for all signals interested in using crawler.signals.connect(), and INIT afew member vars, the rest of class deploys singals handlers.

NOTE: by analogy to multithreaded context code this absence of mutexes in code will see single-threaded is eaiser and scales well in more complex scenarios.

Add this extension in latencies.py module at the same level as settings.py. Enable it by adding in settings.py

EXTENSIONS = { 'properties.latencies.Latencies' : 500, }

LATENCIES_INTERVAL = 5

Now the running log will print INFO as desinged.

EXTENDING BEYOND MIDDLEWARES

Inspecting source code in default_settings.py will see a few class names among it. Scrapy extensively sues a dependency-injection-like mechanism allowing customisation and extension of its internal obejcts. E.g. one may want to supplort more protocls for URLs beyond files, HTTP, HTTPS, S3, and FTP that are defined in downlaod_handlers_base SETTING. MOST DIFFICULT PART IS TO DISCOVER WHAT THE INETERFACE FOR YOUR CUSTOM CLASSES MUST BE (I.E. WHICH MEHTODS TO IMPLEMENT) AS MOST INTERFACES ARE NOT EXPLICIT. One has to read source code and see how these classes get used. You best bet is starting with an existing implementation and altering it to your need. That said, these interfaces become more and more stable with recent versions.

PIPELINE Recipes

Previous on middlewares, now on pipelines by showcasing consuming REST APIs, interfaciing with DB, performing CPU-intensive tasks, and interfacing with legacy services.

Using REST APIs

REST is a set of techs used to create modern web services. Its main pro is simpler more lightweight htan SOAP or else. Software designers see a similarity between CRUD (Create, Read, Update, Delete) that web services often provide and basic HTTP ops (GET POST PUT DELETE). Also seeing much of info required for typical web-serivce call could be compacted on a resoruce URL. e.g. http://api.mysite.com/customer/john is a resource URL alowing to identify target server (api.mysite.com), the fact that to performing ops related to customers table in taht server, and more specifically somehting taht has to do with somethe named johb (row-primary key). This plus other web oncepts like secure AUTH, being stateless, caching, XML, JSON as payload, etc provides a powerful yet simple, familiar and effortlessly cross-platform way to provide and consume web services. Quite common some of the fucntioanlity needed to use in Scrapy pipelien to be provided in the form of REST API.

USING treq

treq is a Pyton pkg trying to equate Python requests pkg for Twisted-based apps. One would prefer treq over Scrapy’s Request/crawler.engine.download() API for it’s equally simple, but it has perforance pros.

PIPELINE WRITING TO ELASTICSEARCH

Start by writing ITEM on an ES server. Perhaps begin with ES (even before MySQL) as persistence mechanism a bit unusual, but it’s actually the easiest thing one ca ndo. ES can be schema-less, meaning using it without any configuration. treq is also enough for this case.

curl http://es:9200 returning JSON =like data. To DELETE: curl -XDELETE http://es:9200/proeprties

@defer.inlineCallbacks
def process_item(self, item, spider):
    data = json.dumps(dict(item), ensure_ascii=False).encode("utf8")
    yield treq.post(self.es_url, data)

The first two lines define a standard process_item() able to yield Deferred as illustrated before.

Third line prepares data for insertion. First convert ITEM to dicts, with encoding etc. Last line uses post() of treq to perform POST request inserting doc in ES.

To enable pipeline, need to add it on ITEM_PIPELIENS setting insdie settings.py and INIT ES_PIPELINE_URL settings:

ITEM_PIPELINES = {
    'properties.pipelines.tidyip.TidyUp': 100,
    'properties.pipelines.es.EsWriter': 800,
}
ES_PIPELINE_URL = 'http://es:9200/proeprties/property'

NOTE: is it a good idea to use pipelines to insert ITEM in DB? NO, often DB provide orders of magnitude more efficient ways to bulk insert entries, and we shuld definietley use them instead. This would mean bulking ITEMS and batching inserting them or performing inserts as post-processing step at ned of crawl.

PIPELINE GEOCODES USING GOOGLE GEOCODING API

Say area names for our properties, like to geocode them, finding respoective coordinates. Google Geocoding API saves the effort of complex DB, sophisticated text mathcing and spatial computations.

curl "https://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=london" will return JSON of info.

Google API is accessible using same techniques as treq saving as geo.py inside pipelines directory:

@defer.inlineCallbacks
def geocode(self, addresss):
    endpoint = 'http://web:9312/maps/api/geocode/json'
    
    params = [('address', address), ('sensor', 'false')]
    response = yield treq.get(endpoint, params=params)
    content = yield response.json()
    
    geo = content['results'][0]["geometry"]["location"]
    defer.returnValue( {"lat": geo["lat"], "lon": geo["lng"]})

The endpoint is for faking for faster execution, less intrusive, available offline, more predictable. You can use endpoint = actual google api URL to hi Google’s servers, but keep in mind STRICT LIMIT ON REQUESTS.

Now process_item() becomes a single line item['location'] = yield self.geocode(item["address"][0])

Enables: ITEM_PIPELINES = { ... properties.pipelines.geo.geoPipeline': 400, ...}

Official Scrapy Tutorial

import scrapy

class QuotesSpider(scrapy.Spider):
	name = "quotes"
	allowed_domains = ["toscrape.com"]
	start_urls = ['http://quotes.toscrape.com']

	# Version1: Scraping by pages
	def parse(self, response):
		self.log(f'I just visited: {response.url}')
		for quote in response.css('div.quote'):
			item = {
				'author_name': quote.css('small.author::text').extract_first(),
				'text': quote.css('span.text::text').extract_first(),
				'tags': quote.css('a.tag::text').extract(),
			}
			yield item
		# follow pagination link
		next_page_url = response.css('li.next > a::attr(href)').extract_first()
		# stop at null page link
		if next_page_url:
			# response.urljoin('relative path') joins with response.url('abs path')
			next_page_url = response.urljoin(next_page_url)
			yield scrapy.Request(url=next_page_url, callback=self.parse)

	# Version2: Scraping through links
	def parse(self, response):
		# collecting all links wishing to click on a page
		# DEFAULT DUPEFILTER set to ignore duplicate pages
		urls = response.css('div.quote > span > a::attr(href)').extract()
		for url in urls:
			url = response.urljoin(url)
			yield scrapy.Request(url=url, callback=self.parse_details) # callback to be defined

		# follow pagination links
		# same as above

	def parse_details(self, response):
		yield {
			'name': response.css('h3.author-title::text').extract_first()
			'birth': response.css('span.author-born-date::text').extract_first()
		}

# Version3: Scraping Infinite Scrolling Pages; finding APIs powering AJAX-based inf-scroll
"""Concept
Using DevTool to inspect network as scrolling happens returning AJAX powered, mostly, JSON files;
explorable inside DevTool;
Preview by json library tools:
response.text 	:revealing JSON format
print(response.text) 	:readable format
import json
data = json.loads(response.text)
data.keys()		:prints keys
data['quotes'][0] 	:first element of quotes, a dict
data['quotes'][0]['author']['name'] 	:lowest-level data
"""
import json

class QuoteScrollSpider(scrapy.Spider):
	name = "quotes-scroll"
	api_url = 'http://quotes.toscrape.com/api/quotes?page={}'
	start_urls = [api_url.format(1)] # KEY step

	def parse(self, response):
		data = json.loads(response.text)
		for quote in data['quotes']:
			yield {
			'author': quote['author']['name'],
			'text': quote['text']
			'tags': quote['tags']
			}
		if data['has_next']:
			next_page = data['page'] + 1
			yield scrapy.Request(url=self.api_url.format(next_page), callback=self.parse)

# Version4: Submitting Forms - POST requests such as logins
"""Concept
Network-inspect requests at login reveals POST request such as login with value like username, password;
In example case, there's a hidden input 'type="hidden" name="carf_token"', inspect via page-source-code;
Its value is often HASHed of something, which in case is randomised per page load;
Solution: submit form with user/pass + carf_token scraped at page-load
"""
class LoginSpider(scrapy.Spider):
	name = 'loging-spider'
	login_url = 'http://quotes.toscrape.com/login'
	start_urls = [login_url]

	def parse(self, response):
		# extract the CSRF token (selector depending on context)
		token = response.css('input[name='csrf_token']::attr(value)').extract_first()
		# create a python dict with form values
		data = {
			'csrf_token': token,
			'username': 'whatever',
			'password': 'whatever',
		}
		# submit a POST request to web (url may differ from login page)
		yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse.quotes)


	def parse_quotes(self, response):
		"""Parse the main page after the spider logged in"""
		for q in response.css('div.quote'):
			yield {
				'author_name': q.css('small.author::text').extract_first(),
				'author_url': q.css('small.author ~ a[href*="goodreads.com"]::attr(href)').extract_first()
			}

# Even simpler way is to use FormRequest.from_request() directly parsing hidden field!!
"""
This method reads the response object and creates a FormRequest that automatically includes all the pre-filled values from the form, along with the hidden ones. 
This is how our spider's parse_tags() method looks:
So, whenever you are dealing with forms containing some hidden fields and pre-filled values, use the from_response method because your code will look much cleaner.
"""
	def parse_tags(self, response):
	    for tag in response.css('select#tag > option ::attr(value)').extract():
	        yield scrapy.FormRequest.from_response(
	            response,
	            formdata={'tag': tag},
	            callback=self.parse_results,
	        )


# Version5: Scraping JS pages with Splash: scraping JS-based webs using Scrapy + Splash
"""Concept
JS-based pages returns only static HTML when scraped by Scrapy, or whatever present inspected via PAGE SOURCE;
Active inspection will show JS called content, which is not scraped by Scrapy;

Splash - JS Engine
	docker pull scrapinghub/splash
	docker run -p 8050:8050 scrapinghub/splash
		# now splash is LISTENING to the local 8050 port
		# so Spider can REQUEST to it, Splash fetches the page, execute JS code on it, then returning rendered pages to spider
	pip install scrapy-splash

Need to config SETTING:

	DOWNLOADER_MIDDLEWARES = {
	    'scrapy_splash.SplashCookiesMiddleware': 723,
	    'scrapy_splash.SplashMiddleware': 725,
	    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
	}

The middleware needs to take precedence over HttpProxyMiddleware,
which by default is at position 750, so we set the middleware positions to numbers below 750.

You then need to set the SPLASH_URL setting in your project's settings.py:
	SPLASH_URL = 'http://localhost:8050/'

Don’t forget, if you’re using a Docker Machine on OS X or Windows, you will need to set this to the IP address of Docker’s virtual machine, e.g.:
	SPLASH_URL = 'http://192.168.59.103:8050/'

Enable SplashDeduplicateArgsMiddleware to support cache_args feature: it allows to save disk space by not storing duplicate Splash arguments multiple times in a disk request queue. If Splash 2.1+ is used the middleware also allows to save network traffic by not sending these duplicate arguments to Splash server multiple times.
	SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
	}

Scrapy currently doesn’t provide a way to override request fingerprints calculation globally, so you will also have to set a custom DUPEFILTER_CLASS and a custom cache storage backend:
	DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
	HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'

If you already use another cache storage backend, you will need to subclass it and replace all calls to scrapy.util.request.request_fingerprint with scrapy_splash.splash_request_fingerprint.

Now that the Splash middleware is enabled, you can use SplashRequest in place of scrapy.Request to render pages with Splash.

For full list of arguments in HTTP API doc: http://splash.readthedocs.org/en/latest/api.html
By default the endpoint is set to 'render.json' but here overridden and set to 'render.html' for HTML response
"""

import scrapy
from scrapy_splash import SplashRequest

class QuotesJSSpider(scrapy.Spider):
	name = 'quotesjs'

	def start_request(self):
		yield SplashRequest(
			url='http://quotes.toscrape.com/js',
			callback=self.parse,
			)
	def parse(self, response):
		for quote in response.css(...)
		yield ...

        

"""
**Running Custom JS**
Sometimes you need to press a buttom or close a modal to view pape properly.
Splash lets run custom JS code within the context of web page:

1) Using js_source Parameter
Code is run after page loaded but before page rendered, allowing use of JS code
to modify page being rendered:

EX: render page and mod its title dynamically
	yield SplashRequest(
	    'http://example.com',
	    endpoint='render.html',
	    args={'js_source': 'document.title="My Title";'},
	)

2) Splash Scripts
Splash supports LUA scripts via execute endpoint. Preferred way for preload libraries
choosing when to execute JS and retrieve output.

Sample script:
	function main(splash)
	    assert(splash:go(splash.args.url))
	    splash:wait(0.5)
	    local title = splash:evaljs("document.title")
	    return {title=title}
	end

Need to send script to execute endpoint, in lua_source arguments, returning a JSON object having title:
	{
	    "title": "Some title"
	}

Every script needs a main func to act as entry point. Able to return lua table be rendred as JSON, as here.
Using splash:go function to tell Splash to visit the URL, splash:evaljs function lets run JS within page context,
but if no need result then use splash:runjs instead

Test Splash scripts in browser at instance's index page set above, 
For mouse-click function: using splash:mouse_click

	function main(splash)
	    assert(splash:go(splash.args.url))
	    local get_dimensions = splash:jsfunc([[
	        function () {
	            var rect = document.getElementById('button').getClientRects()[0];
	            return {"x": rect.left, "y": rect.top}
	        }
	    ]])
	    splash:set_viewport_full()
	    splash:wait(0.1)
	    local dimensions = get_dimensions()
	    splash:mouse_click(dimensions.x, dimensions.y)
	    -- Wait split second to allow event to propagate.
	    splash:wait(0.1)
	    return splash:html()
	end

Here splash:jsfunc defined to return element coordinates, visible by splash:set_viewport_full, click element and return HTML
"""


"""
Run Spider on Cloud: deploy, run and manage crawlers in cloud
Above are single spider.py, now build a project
	scrapy startproject quotes_crawler
Example, move one of above spiders to project and run
	scrapy crawl spiderName

Scraping Hub as Cloud
	pip install shub

	shub login # then API key
	shub deploy # requring Project ID, which is digits on URL of project page
		# review under Code & Deploys
Run -> Spider -> Job Units etc
Inspect result under Job with downloadable formats
Schedule features
Also CLI
	shub schedule quotes # run the spider
"""

Udemy Video Summary

(1) Scrapy Architecture

Root/scrapy.pyc [settings] path and [deploy] url and project folder

Under ProjectFolder:

(2) Avoiding Ban

a) DOWNLOAD_DELAY or via time.sleep(random.randrange(1,3)) at end of code
b) USER_AGENT 
c) Proxies scrapy-proxies package or use VPN
d) Professional work using ScrapingHub
e) Be mindful of regulation and rights

(3) Runspider for standalone scripting

- Without use of ITEM/PIPELINE etc
- Print out or yield result

(4) scrapy.spiders.CrawlSpider has more functions such as RULE, which need importing itself

rules = (Rule(LinkExtractor(allow=('music'), deny_domains=('google.com')), callback='parse_be_defined', follow=False),) 

(5) scrapy.http.Request method used under ordinary parse(self,response) WITHOUT callback to loop through new Request(url) to parse() !!

(6) relative URL fixing, e.g. images

- Inspect HTML //img/@src to see relative path, e.g. ../../path/to/image.jpg
- Replace ../../ with actual image URL
    image_url = image_url.replace('../..', 'http://missingPath')

(7) Define functions to extract well-formated datapoints, e.g. tables of data

def product_info(response, value)
    return response.xpath('//th[text()="' + value + '"]/following-sibling::td/text()').extract_first()
then use it to extract and save into ITEM key-value pairs

EXAMPLE CODE

from scrapy import Spider
from scrapy.http import Request

def product_info(response, value):
    ...

class BooksSpider(Spider):
    name
    allowed_domains
    start_urls

	def parse(self, resonse):
		books = response.xpath('//h3/a/@href').extract()
		for book in books:
			abs_url = response.urljoin(book)
			yield Request(abs_url, callback=self.parse_book)
		# process next page
		next_url = response.xpath('//a[text()="next"]/@href').extract_first()
		abs_next_url = response.urljoin(next_url)
		yield Request(abs_next_url)
	
	def parse_book(self, response):
		title = response.css('h1::text').extract_first()
		price = response.xpath('//*[@class="price_color"]/text()').extract_first()
		image_url (as above)
		rating = response.xpath('//*[contains(@class, "star-rating")]/@class').extract_first()
		rating = rating.replace('star-rating ', '')
	
		description = response.xpath('//*[@id="product_description"]/following-sibling::p/text()').extract_first()
	
		# product table as above
		upc = product_info(response, 'UPC')

(8) Arguments: e.g. isolating ‘book categories’

Update above code ​ REMOVE start_url WITH:

def __init__(self, category): # constructor!
self.start_urls = [category] 

THIS CREATES A ARGUMENT-ABLE FOR __INIT__
used in Shell: 
scrapy crawl bookspider -a category="category_specific_URL"

(9) Scrapy Functions: executed at end of crawling

Anything needed to run, cleaning, sending, etc. Defined inside Spider.py

# EX overriding output filename.csv a function postprocessing 

import os
import glob

	def close(self, reason):
		csv_file = max(glob.iglob('*.csv'), keys=os.path.getctime)
		os.rename(csv_file, 'foobar.csv')

>> ... -o item.csv

(10) Feeding

… -o items.csv/json/xml # items can be whatever

(11) Image Download via built-in ImagesPipeline

Best first define Item class in items.py with all required datapoints + image Then change settings.py

ITEM_PIPELINES = {
	'scrapy.pipelines.images.ImagesPipeline': 1,
}
IMAGES_STORE = 'local/folder'

Then add ItemLoader and items.definedClass in spider.py


code · notebook · prose · gallery · qui et quoi? · main