\[实战\]获取上海自考的所有院校及专业信息
閱讀時間:全文 544 字,預估用時 3 分鐘
創作日期:2017-04-23
上篇文章:Scrapy阅读笔记
下篇文章:上海自考专业及详细信息汇总
BEGIN
申明
本实战内容为BeautifulSoup的简单应用,分为Python功能实现和css样式优化.
因信息为其它网站爬取,为保障其网站避免无端爬虫爬取造成性能及数据的损害,因此网址内容做处理.
本脚本仅作学习用途,转载请注明出处.
Python功能实现
脚本过于简单,不做过多说明,感兴趣的可以拿我博客站点试验.
#coding:utf-8
from bs4 import BeautifulSoup
import requests, re, time, json
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )
base_url = u'http://www.xxxx.xxx'
college_special_course = {}
html_com = '<!doctype html><meta charset="UTF-8"/><link href="style.css" rel="stylesheet" type="text/css" />'
pubtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
#获取学校对应下的所有专业
def get_special(name, url):
print name
college_special_course[name] = {}
cont = requests.get(base_url + url).content
soup = BeautifulSoup(cont, 'lxml')
all_special = soup.find('ul', {'class': 'lastedlistbox'}).find_all('a')
for special in all_special:
special_name = re.sub(u' ', '', special.get_text())
special_url = special.get('href')
college_special_course[name][special_name] = []
get_course(name, special_name, special_url)
#获取指定专业的详细信息
def get_course(college_name, name, url):
print name
html_header = html_com + '<h1>' + name + '</h1>' + u'<strong>本页写入时间:' + pubtime + '</strong>'
cont = requests.get(base_url + url).content
soup = BeautifulSoup(cont, 'lxml')
special_course = soup.find('div', {'class': 'zhenwen'})
del special_course['class']
titles = [td.string for td in special_course.find('table').select('tr')[0]]
for trs in special_course.find('table').select('tr')[1:-1]:
info = {}
values = [td.string for td in trs.select('td')]
for i in range(len(titles)):
try:
info[titles[i]] = values[i]
except IndexError:
info[titles[i]] = ''
college_special_course[college_name][name].append(info)
for a in special_course.select('a'):
a.unwrap()
#写入各个课程单页
with open(name + '.html', 'a') as o:
html = str(html_header) + str(special_course).replace('\n', '')
o.write(html)
#获取学校名称
def get_college():
get_all_collage_url = u'/xxxx/xxxxxx'
cont = requests.get(get_all_collage_url).content
soup = BeautifulSoup(cont, 'lxml')
all_college= soup.find('div', {'class': 'content_in2013'}).find_all('a')
for college in all_college:
college_name = re.sub(u' ', '', college.get_text())
college_url = college.get('href')
get_special(college_name, college_url)
#数据对象写入脚本
with open('学校专业课程信息.json', 'a') as o:
o.write(json.dumps(icollege_special_course))
css样式优化
css部分显明易懂,不作分析说明
body{
width: 90%;
margin: 0 auto;
}
table{
cellpadding: 5;
}
table th{
width: 25%;
background: royalblue;
color: yellow;
}
table td{
padding: .3em;
}
table tr{
width: 100%;
}
table:nth-of-type(1) th{
width: 30%;
}
table:nth-of-type(1) th:nth-of-type(1),table:nth-of-type(1) th:nth-of-type(5){
width: 5%;
}
table:nth-of-type(1) th:nth-of-type(2){
width: 12%;
}
table:nth-of-type(1) th:nth-of-type(6){
width: 18%;
}
table:nth-of-type(2) td,table:nth-of-type(3) td{
width: 12.5%;
}
table tr:nth-of-type(odd){
background: lightgray;
}
table tr:nth-of-type(even){
background: lightcyan;
}
h1{
text-align: center;
}
总结
python果然是最简单的编程语言,尤其是对于采集数据的需求,短短几行代码实现了这个爬虫任务,期待更多的爬虫实现,如有其它有意思的需求,可以联系本人,一起嗨!ahahaha!!!
FINISH
上篇文章:Scrapy阅读笔记
下篇文章:上海自考专业及详细信息汇总