1

Now, we scrape the data inside the next_page_url, for every single book,

To do this, do what we did for next_page for each book link in relative_url, and pass response into a function called parse_book_page (we make in nxt less 5-2)

import scrapy
class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        books = response.css('article.product_pod')
        for book in books:
            relative_url = book.css('h3 a::attr(href)').get()
            if 'catalogue' in relative_url:
                book_url = 'https://books.toscrape.com/' + relative_url
            else:
                book_url = 'https://bosk.toscrape.com/catalogue/' + relative_url
            yield response.follow(book_url,callback=self.parse_book_page)    
            
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            if 'catalogue/' in next_page:
                next_page_url = 'https://books.toscrape.com/' + next_page
            else:
                next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
            yield response.follow(next_page_url,callback=self.parse)
    def parse_book_page(self,response):
        pass

2

open scrapy shell and find elements to grab the html elements for yield, in parse_book_page


scrapy shell

enter the page of a book to find html elements


fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')


                response.css('.product_page h1').get()

3

Hot crawl pages
CSS selectors & XPaths to extract complicated data
Saving data to CSV and JSON Format

xpath allows use more complex css paths, for example the Poetry catagory on top of page in ul.breadcrumb, we can use this xpath


                    response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

This targets the active last li elements the gets the preceding-sibling li which is Poetry

If we wanted something with no class or id, we can target and element around it like for product_description and get the following-sibling p text


                    response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

4 `xpath`

xpath allows use more complex css paths, for example the Poetry catagory on top of page in ul.breadcrumb, we can use this xpath


response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

This targets the active last li elements the gets the preceding-sibling li which is Poetry

If we wanted something with no class or id, we can target and element around it like for product_description and get the following-sibling p text


response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

5 working with table <tr>

When working with table rows, but all rows in variable, access each row by rows[index]


table_rows = response.css('table tr').get()

Then you can access whatever is inside each row by index

table_rows[0].css('td::text').get()

02

Every selector for book page url open shell


        fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')


        response.css('.product_page h1::text').get()


        response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()


        table_rows = response.css('table tr')

type


        table_rows[1].css('td ::text').get()

price


        table_rows[2].css('td ::text').get()

price


        table_rows[2].css('td ::text').get()

stars

Get by class


        response.css('p.star-rating').attrib['class']

1

2

3

4 xpath

5 working with table <tr>

02

type

price

price

stars

4 `xpath`