\[实战\]获取上海自考的所有院校及专业信息

Reading Time：The full text has 544 words, estimated reading time: 3 minutes

Creation Date：2017-04-23

Article Tags：

Previous Article：Scrapy阅读笔记

Next Article：上海自考专业及详细信息汇总

BEGIN

申明

本实战内容为BeautifulSoup的简单应用,分为Python功能实现和css样式优化.

因信息为其它网站爬取,为保障其网站避免无端爬虫爬取造成性能及数据的损害,因此网址内容做处理.

本脚本仅作学习用途,转载请注明出处.

Python功能实现

脚本过于简单,不做过多说明,感兴趣的可以拿我博客站点试验.

#coding:utf-8
from bs4 import BeautifulSoup
import requests, re, time, json
import sys
reload(sys)
sys.setdefaultencoding( "utf-8" )

base_url = u'http://www.xxxx.xxx'
college_special_course = {}
html_com = '<!doctype html><meta charset="UTF-8"/><link href="style.css" rel="stylesheet" type="text/css" />'
pubtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())

#获取学校对应下的所有专业
def get_special(name, url):
    print name
    college_special_course[name] = {}
    cont = requests.get(base_url + url).content
    soup = BeautifulSoup(cont, 'lxml')
    all_special = soup.find('ul', {'class': 'lastedlistbox'}).find_all('a')
    for special in all_special:
        special_name = re.sub(u' ', '', special.get_text())
        special_url = special.get('href')
        college_special_course[name][special_name] = []
        get_course(name, special_name, special_url)

#获取指定专业的详细信息
def get_course(college_name, name, url):
    print name
    html_header = html_com + '<h1>' + name + '</h1>' + u'<strong>本页写入时间：' + pubtime + '</strong>'
    cont = requests.get(base_url + url).content
    soup = BeautifulSoup(cont, 'lxml')
    special_course = soup.find('div', {'class': 'zhenwen'})
    del special_course['class']
    titles = [td.string for td in special_course.find('table').select('tr')[0]]
    for trs in special_course.find('table').select('tr')[1:-1]:
        info = {}
        values = [td.string for td in trs.select('td')]
        for i in range(len(titles)):
            try:
                info[titles[i]] = values[i]
            except IndexError:
                info[titles[i]] = ''
        college_special_course[college_name][name].append(info)
    for a in special_course.select('a'):
        a.unwrap()
    #写入各个课程单页
    with open(name + '.html', 'a') as o:
        html = str(html_header) + str(special_course).replace('\n', '')
        o.write(html)

#获取学校名称
def get_college():
    get_all_collage_url = u'/xxxx/xxxxxx'
    cont = requests.get(get_all_collage_url).content
    soup = BeautifulSoup(cont, 'lxml')
    all_college= soup.find('div', {'class': 'content_in2013'}).find_all('a')
    for college in all_college:
        college_name = re.sub(u' ', '', college.get_text())
        college_url = college.get('href')
        get_special(college_name, college_url)
    #数据对象写入脚本
    with open('学校专业课程信息.json', 'a') as o:
        o.write(json.dumps(icollege_special_course))

css样式优化

css部分显明易懂,不作分析说明

body{
  width: 90%;
  margin: 0 auto;
}
table{
  cellpadding: 5;
}
table th{
  width: 25%;
  background: royalblue;
  color: yellow;
}
table td{ 
  padding: .3em;
}
table tr{ 
  width: 100%;
}
table:nth-of-type(1) th{ 
  width: 30%;
}
table:nth-of-type(1) th:nth-of-type(1),table:nth-of-type(1) th:nth-of-type(5){
  width: 5%; 
}
table:nth-of-type(1) th:nth-of-type(2){
  width: 12%;
}
table:nth-of-type(1) th:nth-of-type(6){
  width: 18%;
}
table:nth-of-type(2) td,table:nth-of-type(3) td{ 
  width: 12.5%;
}
table tr:nth-of-type(odd){
  background: lightgray;
}
table tr:nth-of-type(even){
  background: lightcyan;
}
h1{
  text-align: center;
}

总结

python果然是最简单的编程语言,尤其是对于采集数据的需求,短短几行代码实现了这个爬虫任务,期待更多的爬虫实现,如有其它有意思的需求,可以联系本人,一起嗨!ahahaha!!!

FINISH

Previous Article：Scrapy阅读笔记

Next Article：上海自考专业及详细信息汇总