Table of Contents:
  1. 申明
    1. Python功能实现
      1. css样式优化
        1. 总结

          \[实战\]获取上海自考的所有院校及专业信息

          Reading Time:The full text has 544 words, estimated reading time: 3 minutes
          Creation Date:2017-04-23
          Article Tags:
          Previous Article:Scrapy阅读笔记
           
          BEGIN

          申明

          本实战内容为BeautifulSoup的简单应用,分为Python功能实现和css样式优化.

          因信息为其它网站爬取,为保障其网站避免无端爬虫爬取造成性能及数据的损害,因此网址内容做处理.

          本脚本仅作学习用途,转载请注明出处.

          Python功能实现

          脚本过于简单,不做过多说明,感兴趣的可以拿我博客站点试验.

          #coding:utf-8
          from bs4 import BeautifulSoup
          import requests, re, time, json
          import sys
          reload(sys)
          sys.setdefaultencoding( "utf-8" )
          
          base_url = u'http://www.xxxx.xxx'
          college_special_course = {}
          html_com = '<!doctype html><meta charset="UTF-8"/><link href="style.css" rel="stylesheet" type="text/css" />'
          pubtime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
          
          #获取学校对应下的所有专业
          def get_special(name, url):
              print name
              college_special_course[name] = {}
              cont = requests.get(base_url + url).content
              soup = BeautifulSoup(cont, 'lxml')
              all_special = soup.find('ul', {'class': 'lastedlistbox'}).find_all('a')
              for special in all_special:
                  special_name = re.sub(u' ', '', special.get_text())
                  special_url = special.get('href')
                  college_special_course[name][special_name] = []
                  get_course(name, special_name, special_url)
          
          #获取指定专业的详细信息
          def get_course(college_name, name, url):
              print name
              html_header = html_com + '<h1>' + name + '</h1>' + u'<strong>本页写入时间:' + pubtime + '</strong>'
              cont = requests.get(base_url + url).content
              soup = BeautifulSoup(cont, 'lxml')
              special_course = soup.find('div', {'class': 'zhenwen'})
              del special_course['class']
              titles = [td.string for td in special_course.find('table').select('tr')[0]]
              for trs in special_course.find('table').select('tr')[1:-1]:
                  info = {}
                  values = [td.string for td in trs.select('td')]
                  for i in range(len(titles)):
                      try:
                          info[titles[i]] = values[i]
                      except IndexError:
                          info[titles[i]] = ''
                  college_special_course[college_name][name].append(info)
              for a in special_course.select('a'):
                  a.unwrap()
              #写入各个课程单页
              with open(name + '.html', 'a') as o:
                  html = str(html_header) + str(special_course).replace('\n', '')
                  o.write(html)
          
          #获取学校名称
          def get_college():
              get_all_collage_url = u'/xxxx/xxxxxx'
              cont = requests.get(get_all_collage_url).content
              soup = BeautifulSoup(cont, 'lxml')
              all_college= soup.find('div', {'class': 'content_in2013'}).find_all('a')
              for college in all_college:
                  college_name = re.sub(u' ', '', college.get_text())
                  college_url = college.get('href')
                  get_special(college_name, college_url)
              #数据对象写入脚本
              with open('学校专业课程信息.json', 'a') as o:
                  o.write(json.dumps(icollege_special_course))

          css样式优化

          css部分显明易懂,不作分析说明

          body{
            width: 90%;
            margin: 0 auto;
          }
          table{
            cellpadding: 5;
          }
          table th{
            width: 25%;
            background: royalblue;
            color: yellow;
          }
          table td{ 
            padding: .3em;
          }
          table tr{ 
            width: 100%;
          }
          table:nth-of-type(1) th{ 
            width: 30%;
          }
          table:nth-of-type(1) th:nth-of-type(1),table:nth-of-type(1) th:nth-of-type(5){
            width: 5%; 
          }
          table:nth-of-type(1) th:nth-of-type(2){
            width: 12%;
          }
          table:nth-of-type(1) th:nth-of-type(6){
            width: 18%;
          }
          table:nth-of-type(2) td,table:nth-of-type(3) td{ 
            width: 12.5%;
          }
          table tr:nth-of-type(odd){
            background: lightgray;
          }
          table tr:nth-of-type(even){
            background: lightcyan;
          }
          h1{
            text-align: center;
          }

          总结

          python果然是最简单的编程语言,尤其是对于采集数据的需求,短短几行代码实现了这个爬虫任务,期待更多的爬虫实现,如有其它有意思的需求,可以联系本人,一起嗨!ahahaha!!!

          FINISH
          Previous Article:Scrapy阅读笔记

          Random Articles
          Life Countdown
          default