1

Get next page url


response.css('li.next a::attr(href)').get()

place it code next_page

If we go to the last page, there will be no next page, so

use the condition is not None:, add next_page relative path to full path for follow()

import scrapy
class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        books = response.css('article.product_pod')
        for book in books:
            yield{
                'name': book.css('h3 a::text').get(),
                'price': book.css('.product_price .price_color::text').get(),
                'url': book.css('h3 a').attrib['href']
            }
        next_page = response.css('li.next a::attr(href)').get()

        if next_page is not None:
            next_page_url = 'https://books.toscrape.com/' + next_page
            yield response.follow(next_page_url,callback=self.parse)
        if next_page is not None:
        if 'catalogue/' in next_page:
        next_page_url = 'https://books.toscrape.com/' + next_page
        else:
        next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
        yield response.follow(next_page_url,callback=self.parse)

2

If we run


scrapy crawl bookspider

we can see item_scraped_count is only 40, meaning it only did the first 2 pages or so,

we have to inspect the url on the page in li.next > a, and see, sometimes it has catalogue/ and other times not

the tutorial is different than site,

The site. problem with the url, but not clear from looking at site html, but in scrappy shell fetch 2 different pages and see that the link is different


                        fetch('https://books.toscrape.com/')


                        fetch('https://books.toscrape.com/catalogue/page-5.html')

3

fix url follow() with condition of with or without catalogue

import scrapy
class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        books = response.css('article.product_pod')
        for book in books:
            yield{
                'name': book.css('h3 a::text').get(),
                'price': book.css('.product_price .price_color::text').get(),
                'url': book.css('h3 a').attrib['href']
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            if 'catalogue/' in next_page:
                next_page_url = 'https://books.toscrape.com/' + next_page
            else:
                next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
            yield response.follow(next_page_url,callback=self.parse)

The item_scraped_count shows 1000 now

3

fix url follow() with condition of with or without catalogue

import scrapy
class BookspiderSpider(scrapy.Spider):
    name = "bookspider"
    allowed_domains = ["books.toscrape.com"]
    start_urls = ["https://books.toscrape.com"]

    def parse(self, response):
        books = response.css('article.product_pod')
        for book in books:
            yield{
                'name': book.css('h3 a::text').get(),
                'price': book.css('.product_price .price_color::text').get(),
                'url': book.css('h3 a').attrib['href']
            }
        next_page = response.css('li.next a::attr(href)').get()
        if next_page is not None:
            if 'catalogue/' in next_page:
                next_page_url = 'https://books.toscrape.com/' + next_page
            else:
                next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
            yield response.follow(next_page_url,callback=self.parse)

The item_scraped_count shows 1000 now