Get next page url
response.css('li.next a::attr(href)').get()
place it code next_page
If we go to the last page, there will be no next page, so
use the condition is not None:,
add next_page relative path to full path for follow()
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name': book.css('h3 a::text').get(),
'price': book.css('.product_price .price_color::text').get(),
'url': book.css('h3 a').attrib['href']
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
next_page_url = 'https://books.toscrape.com/' + next_page
yield response.follow(next_page_url,callback=self.parse)
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url,callback=self.parse)
If we run
scrapy crawl bookspider
we can see item_scraped_count is only 40, meaning it only
did the first 2 pages or so,
we have to inspect the url on the page in li.next > a, and see,
sometimes it has catalogue/ and other times not
the tutorial is different than site,
The site. problem with the url, but not clear from looking at site html, but
in scrappy shell fetch 2 different pages and see that the link is different
fetch('https://books.toscrape.com/')
fetch('https://books.toscrape.com/catalogue/page-5.html')
fix url follow() with condition of with or without
catalogue
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name': book.css('h3 a::text').get(),
'price': book.css('.product_price .price_color::text').get(),
'url': book.css('h3 a').attrib['href']
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url,callback=self.parse)
The item_scraped_count shows 1000 now
fix url follow() with condition of with or without
catalogue
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
yield{
'name': book.css('h3 a::text').get(),
'price': book.css('.product_price .price_color::text').get(),
'url': book.css('h3 a').attrib['href']
}
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url,callback=self.parse)
The item_scraped_count shows 1000 now