05-01 Get data from next_page url
.xpath
table <tr>

1

Now, we scrape the data inside the next_page_url, for every single book,

To do this, do what we did for next_page for each book link in relative_url, and pass response into a function called parse_book_page (we make in nxt less 5-2)

import scrapy
class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        books = response.css('article.product_pod')
        for book in books:
            relative_url = book.css('h3 a::attr(href)').get()
            if 'catalogue' in relative_url:
                book_url = 'https://books.toscrape.com/' + relative_url
            else:
                book_url = 'https://bosk.toscrape.com/catalogue/' + relative_url
            yield response.follow(book_url,callback=self.parse_book_page)    
            
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            if 'catalogue/' in next_page:
                next_page_url = 'https://books.toscrape.com/' + next_page
            else:
                next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
            yield response.follow(next_page_url,callback=self.parse)
    def parse_book_page(self,response):
        pass

2

open scrapy shell and find elements to grab the html elements for yield, in parse_book_page

scrapy shell

enter the page of a book to find html elements

fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
response.css('.product_page h1').get()

3

  • Hot crawl pages
  • CSS selectors & XPaths to extract complicated data
  • Saving data to CSV and JSON Format

xpath allows use more complex css paths, for example the Poetry catagory on top of page in ul.breadcrumb, we can use this xpath

response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

This targets the active last li elements the gets the preceding-sibling li which is Poetry

If we wanted something with no class or id, we can target and element around it like for product_description and get the following-sibling p text

response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

4 xpath

xpath allows use more complex css paths, for example the Poetry catagory on top of page in ul.breadcrumb, we can use this xpath

response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()

This targets the active last li elements the gets the preceding-sibling li which is Poetry

If we wanted something with no class or id, we can target and element around it like for product_description and get the following-sibling p text

response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()

5 working with table <tr>

When working with table rows, but all rows in variable, access each row by rows[index]

table_rows = response.css('table tr').get()

Then you can access whatever is inside each row by index

table_rows[0].css('td::text').get()

02

Every selector for book page url open shell

fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
response.css('.product_page h1::text').get()
response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
table_rows = response.css('table tr')
type
table_rows[1].css('td ::text').get()
price
table_rows[2].css('td ::text').get()
price
table_rows[2].css('td ::text').get()
stars

Get by class

response.css('p.star-rating').attrib['class']