Scrapy阅读笔记
阅读时间:全文 790 字,预估用时 4 分钟
创作日期:2017-05-01
上篇文章:csv转xlsx
BEGIN
说明
阅读Scrapy官网 🔗.
实战既是王道
创建项目
scrapy startproject tutorial
├── scrapy.cfg #部署用的配置文件
└── tutorial
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py #管道文件
├── settings.py #项目设置文件
└── spiders #我们的爬虫文件刚在这里
└── __init__.py
新建爬虫文件
在tutorial/spiders
目录下新建爬虫文件quotes_spider.py
并写入如下内容:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
def start_requests(self):
urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
#start_urls = [
# 'http://quotes.toscrape.com/page/1/',
# 'http://quotes.toscrape.com/page/2/'
#]
def parse(self, response):
page = response.url.split("/")[-2]
filename = 'quotes-%s.html' % page
with open(filename, 'wb') as f:
f.write(response.body)
self.log('Saved file %s' % filename)
分析上面代码:
- 整个爬虫定义在一个类中,继承于scrapy.Spider类,且仅仅引入了scrapy模块.
name
: 表示爬虫的名字,用于外部命令行交互.start_requests
: 运行爬虫程序时执行的入口函数,其中urls数组表示爬取页面的所有链接,然后通过yield返回执行结果到管道中.parse
: 默认处理返回结果的回调函数.- 当默认处理返回结果的函数为parse时可通过定义
start_urls
数组来替代start_requests
方法.
执行爬虫任务
scrapy crawl quotes
shell操作
爬取指定网页返回对应的response,并进入shell
scrapy shell “http://quotes.toscrape.com/page/1/ 🔗”
打印response的所有的键: for i in vars(response): print i
—>status, _encoding, _cached_selector, _url, request, _body, _cached_ubody, headers, flags, _cached_benc
css选择器和xpath表达式
>>> response.css('title')
[<Selector xpath='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]
>>> response.css('title').extract()
['<title>Quotes to Scrape</title>']
>>> response.css('title::text').extract()
['Quotes to Scrape']
>>> response.css('title::text').extract_first()
'Quotes to Scrape'
>>> response.css('title::text')[0].extract()
'Quotes to Scrape'
>>> response.xpath('//title')
[<Selector xpath='//title' data='<title>Quotes to Scrape</title>'>]
>>> response.xpath('//title/text()').extract_first()
'Quotes to Scrape'
>>> response.css('li.next a::attr(href)').extract_first()
'/page/2/'
提取数据
我们刚刚只是保存了整个网页到文件里,现在我们想要的要尝试提取数据.
更改原来的代码:
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
'http://quotes.toscrape.com/page/1/',
'http://quotes.toscrape.com/page/2/',
]
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.css('small.author::text').extract_first(),
'tags': quote.css('div.tags a.tag::text').extract()
}
保存提取的数据到json文件中
将yield提交的数据保存到json文件中: scrapy crawl quotes -o quotes.json
可持续话访问
自定义解析函数及获取其它url地址持续访问.
import scrapy
class AuthorSpider(scrapy.Spider):
name = 'author'
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
# follow links to author pages
for href in response.css('.author + a::attr(href)').extract():
yield scrapy.Request(response.urljoin(href), callback=self.parse_author)
# follow pagination links
next_page = response.css('li.next a::attr(href)').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
def parse_author(self, response):
def extract_with_css(query):
return response.css(query).extract_first().strip()
yield {
'name': extract_with_css('h3.author-title::text'),
'birthdate': extract_with_css('.author-born-date::text'),
'bio': extract_with_css('.author-description::text'),
}
传递参数及捕获
带参数执行爬虫命令: scrapy crawl quotes -a key=value
在start_requests函数中捕获参数值: key = getattr(self, 'key', None)
scrapy命令行
所有命令官方文档 🔗
- 查看帮助:
scrapy -h
或scrapy <command> -h
- 全局命令:
startproject, genspider, settings, runspider, shell, fetch, view, version
- 项目命令:
crawl, check, list, edit, parse, deploy, bench
- 初始化项目:
scrapy startproject myproject
- 快速生成爬虫:
scrapy genspider <-t templatename> name name.com
scrapy genspider -l
: basic, crawl, csvfeed, xmlfeed - 启动爬虫任务:
scrapy crawl name
- 查看所有爬虫:
scrapy list
- 快速编辑:
scrapy edit name
- 普通下载:
scrapy fetch <url>
--spider=spider_name
: 指定爬虫程序应用到url中.--headers
: 打印请求头信息.--no-redirect
: 禁止重定向.
- 浏览器打开查看:
scrapy view <url>
或shell中view(response)
--spider=spider_name
: 指定爬虫程序应用到url中.--no-redirect
: 禁止重定向.
- 启动shell:
scrapy shell <url>
- 在项目外运行爬虫文件:
scrapy runspider myspider.py
FINISH
上篇文章:csv转xlsx