.xpath
table <tr>Now, we scrape the data inside the next_page_url, for every single book,
To do this, do what we did for next_page for each book link in
relative_url, and pass response into a function called
parse_book_page (we make in nxt less 5-2)
import scrapy
class BookspiderSpider(scrapy.Spider):
name = "bookspider"
allowed_domains = ["books.toscrape.com"]
start_urls = ["https://books.toscrape.com"]
def parse(self, response):
books = response.css('article.product_pod')
for book in books:
relative_url = book.css('h3 a::attr(href)').get()
if 'catalogue' in relative_url:
book_url = 'https://books.toscrape.com/' + relative_url
else:
book_url = 'https://bosk.toscrape.com/catalogue/' + relative_url
yield response.follow(book_url,callback=self.parse_book_page)
next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
if 'catalogue/' in next_page:
next_page_url = 'https://books.toscrape.com/' + next_page
else:
next_page_url = 'https://books.toscrape.com/catalogue/' + next_page
yield response.follow(next_page_url,callback=self.parse)
def parse_book_page(self,response):
pass
open scrapy shell and find elements to grab the html elements for
yield, in parse_book_page
scrapy shell
enter the page of a book to find html elements
fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
response.css('.product_page h1').get()
xpath allows use more complex css paths,
for example the Poetry catagory on top of page in ul.breadcrumb, we can use
this
xpath
response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()
This targets the active last li elements the gets the preceding-sibling li which is
Poetry
If we wanted something with no class or id, we can target and element around it like for
product_description and get the following-sibling p text
response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
xpath
xpath allows use more complex css paths,
for example the Poetry catagory on top of page in ul.breadcrumb, we can use this xpath
response.xpath("//ul[@class='breadcrumb']/li[@class='active']/preceding-sibling::li[1]/a/text()").get()
This targets the active last li elements the gets the preceding-sibling li which is Poetry
If we wanted something with no class or id, we can target and element around it like for
product_description and get the following-sibling p text
response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
When working with table rows, but all rows in variable, access each row
by rows[index]
table_rows = response.css('table tr').get()
Then you can access whatever is inside each row by index
Every selector for book page url open shell
fetch('https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html')
response.css('.product_page h1::text').get()
response.xpath("//div[@id='product_description']/following-sibling::p/text()").get()
table_rows = response.css('table tr')
table_rows[1].css('td ::text').get()
table_rows[2].css('td ::text').get()
table_rows[2].css('td ::text').get()
Get by class
response.css('p.star-rating').attrib['class']