rss
logo

I provide consulting and custom development for Natural Language Processing, Information Extraction and Search solutions.Self Picture


 learn more   get in touch 

Logo - I Build Search
Jan 21
2010

Writing a spider in 10 mins using Scrapy digg

I came across Scrapy a few days back and have grown to really love it. This tutorial will illustrate how you can write a simple spider using Scrapy to scrape data off Paul Smith. All this in 10 minutes.

Lets begin

  1. Download and install scrapy and its dependencies.
  2. This done, open up your terminal and type python scrapy-ctl.py startproject paul_smith. A scrapy project will be created.
  3. Navigate to ~/paul_smith/paul_smith/spiders and create the file paul_smith.py with the following contents:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    
    from scrapy.spider import BaseSpider
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        open('paulsmith.html', 'wb').write(response.body)
     
    SPIDER = PaulSmithSpider()
  4. To run the spider, go to ~/paul_smith type python scrapy-ctl.py crawl paulsmith.co.uk on the command line. This will fetch the page and save it to paulsmith.html.
  5. The next step is to parse the contents of the page. Open the page in your favourite editor and try to understand the pattern of the items we want to capture. You can see that <div class="yui-u"> contains the required information. We are going to modify out code like so:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="yui-u"]')
        for site in sites:
          print site.extract()
     
    SPIDER = PaulSmithSpider()

    You can read more on XPath Selectors here.

  6. Finally, looking at the HTML again, we can extract title, link, img-src & sale-price like so:

    paul_smith.py
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    
    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    import random
     
    class PaulSmithSpider(BaseSpider):
      domain_name = "paulsmith.co.uk"
      start_urls = ["http://www.paulsmith.co.uk/paul-smith-jeans-253/category.html"]
     
      def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('//div[@class="yui-u"]')
        random.shuffle(sites)
        for site in sites:
          title = site.select('a/strong[@class="thumbnail-text"]/text()').extract()
          hlink = site.select('a/@href').extract()
          price = site.select('a/strong[@class="sale"]/text()').extract()
          image = site.select('a/img/@src').extract()
     
          print title, hlink, image, price
     
    SPIDER = PaulSmithSpider()

    You can save this data to your datastore in whatever way you wish.

  7. The output of 3 random items scraped using the above code can be seen below.

Output

Shawl Collar Block Stripe Jumper
Sale: £ 74.00

Crew Neck Placement Stripe Jumper
Sale: £ 67.00

Tailored Fit, Organic Cotton Cravat Print Shirt
Sale: £ 74.00

4 Responses (rss) (trackback)

#1

Jens

March 3rd, 2010 at 10:00 pm

Hi
Trying out your tutorial but it seems like Paul Smith has made some changes to the site. I wonder if you can update you code above so it works. I’m just trying this out and a good example would be great! Keep up the good work. /Jens

#2

navid

April 15th, 2010 at 7:19 am

Can you place a note on the page that it’s out of synch with the actual website – hard to follow – perhaps you could redo it and snap some screen caps of the example you’re using so we get an idea of the type of clauses we’re looking for in the html.

#3

yk

June 5th, 2010 at 10:16 pm

@Jens, @navid

This links to the scrapy tutorial. It should help you out:
http://doc.scrapy.org/intro/tutorial.html#intro-tutorial

#4

Stephen Breen

June 21st, 2010 at 7:39 pm

Just for anyone that wants to try this example on the current site:

sites = hxs.select('//div[@class="yui-u"]')
becomes
sites = hxs.select('//div[@class="product-group-1"]')

title = site.select('a/strong[@class="thumbnail-text"]/text()').extract()
becomes
title = site.select('h3[@class="desc"]/text()').extract()

price = site.select('a/strong[@class="sale"]/text()').extract()
becomes
price = site.select('p[@class="price price-GBP"]/text()').extract() If you want to get the euro price use “price price-EUR” as the class id.

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">

Latest Articles

Apr
07

Palindromic sub-sequences in python

This bit of python code returns all palindromic subsequences in the input string.

[Read More]
Feb
19

Join a list of integers in Python

How do you run a string join on a list of integers in Python? After googling for about 10 mins, I gave up and did this. I am sure there is a better way of doing it!

[Read More]

Featured Projects

Indic to English Transliterator

Indic to English Transliterator

Transliteration is the process of converting a word from one language to another while retaining its phonetic characteristics. This application lets you convert a word from any major Indian language (currently supports Hindi, Marathi, Sanskrit and Bengali) to English.

[Read More]

Yahoo Messenger Client for *nix

Yahoo Messenger Client for *nix

Yux is an alternative Yahoo Messenger client for *nix systems that attempts to match the look and feel of the original Windows client.

[Read More]

This page and its contents are copyright © 2010, Pravin Paratey.