Scrapy阅读笔记

阅读时间：全文 790 字，预估用时 4 分钟

创作日期：2017-05-01

文章标签：

上篇文章：csv转xlsx

下篇文章：\[实战\]获取上海自考的所有院校及专业信息

BEGIN

说明

阅读Scrapy官网 🔗.

实战既是王道

创建项目

scrapy startproject tutorial

├── scrapy.cfg            #部署用的配置文件
└── tutorial
    ├── __init__.py
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py      #管道文件
    ├── settings.py       #项目设置文件
    └── spiders           #我们的爬虫文件刚在这里
        └── __init__.py

新建爬虫文件

在tutorial/spiders目录下新建爬虫文件quotes_spider.py并写入如下内容:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

		#start_urls = [
		#		'http://quotes.toscrape.com/page/1/',
		#		'http://quotes.toscrape.com/page/2/'
		#]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

分析上面代码:

整个爬虫定义在一个类中,继承于scrapy.Spider类,且仅仅引入了scrapy模块.
name: 表示爬虫的名字,用于外部命令行交互.
start_requests: 运行爬虫程序时执行的入口函数,其中urls数组表示爬取页面的所有链接,然后通过yield返回执行结果到管道中.
parse: 默认处理返回结果的回调函数.
当默认处理返回结果的函数为parse时可通过定义start_urls数组来替代start_requests方法.

执行爬虫任务

scrapy crawl quotes

shell操作

爬取指定网页返回对应的response,并进入shell

scrapy shell “http://quotes.toscrape.com/page/1/ 🔗”

打印response的所有的键: for i in vars(response): print i—>status, _encoding, _cached_selector, _url, request, _body, _cached_ubody, headers, flags, _cached_benc

css选择器和xpath表达式

>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').extract()
['Quotes to Scrape']
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'

提取数据

我们刚刚只是保存了整个网页到文件里,现在我们想要的要尝试提取数据.

更改原来的代码:

import scrapy
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [ 
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]   
    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract()
            }

保存提取的数据到json文件中

将yield提交的数据保存到json文件中: scrapy crawl quotes -o quotes.json

可持续话访问

自定义解析函数及获取其它url地址持续访问.

import scrapy
class AuthorSpider(scrapy.Spider):
    name = 'author'

    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        # follow links to author pages
        for href in response.css('.author + a::attr(href)').extract():
            yield scrapy.Request(response.urljoin(href), callback=self.parse_author)

        # follow pagination links
        next_page = response.css('li.next a::attr(href)').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

    def parse_author(self, response):
        def extract_with_css(query):
            return response.css(query).extract_first().strip()

        yield {
            'name': extract_with_css('h3.author-title::text'),
            'birthdate': extract_with_css('.author-born-date::text'),
            'bio': extract_with_css('.author-description::text'),
        }

传递参数及捕获

带参数执行爬虫命令: scrapy crawl quotes -a key=value

在start_requests函数中捕获参数值: key = getattr(self, 'key', None)

scrapy命令行

所有命令官方文档 🔗

查看帮助: scrapy -h 或 scrapy <command> -h
全局命令: startproject, genspider, settings, runspider, shell, fetch, view, version
项目命令: crawl, check, list, edit, parse, deploy, bench
初始化项目: scrapy startproject myproject
快速生成爬虫: scrapy genspider <-t templatename> name name.com scrapy genspider -l: basic, crawl, csvfeed, xmlfeed
启动爬虫任务: scrapy crawl name
查看所有爬虫: scrapy list
快速编辑: scrapy edit name
普通下载: scrapy fetch <url>
- --spider=spider_name: 指定爬虫程序应用到url中.
- --headers: 打印请求头信息.
- --no-redirect: 禁止重定向.
浏览器打开查看: scrapy view <url> 或shell中 view(response)
- --spider=spider_name: 指定爬虫程序应用到url中.
- --no-redirect: 禁止重定向.
启动shell: scrapy shell <url>
在项目外运行爬虫文件: scrapy runspider myspider.py

FINISH

上篇文章：csv转xlsx

下篇文章：\[实战\]获取上海自考的所有院校及专业信息