第一个python小程序解析网页

很久就想学python了，但一直找不到合适的项目来练习，python的语法很有意思，写起来很简洁，今天有空边找资料边写出来了这一个小项目。考虑到3.x的python库还不是很多，大部分资料也是python2.x的，所以我用的python2.7来进行

之前就听说python访问网络很easy，这次是真的体会到了。很简单几句话搞定，不像java那样，再简单的访问都要装饰几层才能使用。

这次是拿糗事百科的网站，从上面抓取新鲜事并整理打印出来

不想多说了，以下上代码：

 import   urllib2
  import   sgmllib

  class   Entry:
    author = ''  
    content = ''  
    pic = ''  
    up  =  0
    down  =  0
    tag  =  ''  
    comment  =  0
      def   to_string(self):
          return   '  [Entry: author=%s content=%s pic=%s tag=%s up=%d down=%d comment=%d]  '  \
             % (self.author,self.content,self.pic,self.tag,self.up,self.down,self.comment)

  class   MyHTMLParser(sgmllib.SGMLParser):
      #  所有用到的声明 
     #  note all the datas 
    datas =  []
      #   all the entries 
    entries =  []
      #  the entry now 
    entry =  Entry()
      #  last Unclosed tag 
    div_tag_unclosed =  '' 
    
     def   start_div(self,attrs):
          for  name,value  in   attrs:
              if  name == '  class  '   and  value ==  '  content  '  :
                self.div_tag_unclosed  =  '  content  ' 
             elif  name== '  class  '   and  value ==  '  tags  '   :
                self.div_tag_unclosed  =  '  tags  ' 
             elif  name== '  class  '   and  value== '  up  '  :
                self.div_tag_unclosed  =  '  up  ' 
             elif  name== '  class  '   and  value ==  '  down  '  :
                self.div_tag_unclosed  =  '  down  ' 
             elif  name== '  class  '   and  value== '  comment  '  :
                self.div_tag_unclosed  =  '  comment  ' 
             elif  name== '  class  '   and  value== '  author  '  :
                self.div_tag_unclosed  =  '  author  '  
                self.entry  =  Entry()
              elif  name== '  class  '   and  value== '  thumb  '  :
                self.div_tag_unclosed  =  '  thumb  ' 
                
     def   end_div(self):
          if  self.div_tag_unclosed ==  '  content  '   :
            self.div_tag_unclosed  = ''  
            self.entry.content  =   self.datas.pop().strip()
      def  start_a(self,attrs): pass 
     def   start_img(self,attrs):
          if  self.div_tag_unclosed ==  '  thumb  '  :
              for  name,value  in   attrs:
                  if  name== '  src  '  :
                    self.div_tag_unclosed  = ''  
                    self.entry.img  =  value.strip() 
      def  end_img(self): pass 
     def   end_a(self):
          if  self.div_tag_unclosed ==  '  author  '  :
            self.div_tag_unclosed  = ''  
            self.entry.author  =  self.datas.pop().strip()
          if  self.div_tag_unclosed ==  '  tags  '  :
            self.div_tag_unclosed  = ''  
            self.entry.tag  =  self.datas.pop().strip()
          elif  self.div_tag_unclosed ==  '  up  '  :
            self.div_tag_unclosed  = ''  
            self.entry.up  =  int(self.datas.pop().strip())
          elif  self.div_tag_unclosed ==  '  down  '  :
            self.div_tag_unclosed  = ''  
            self.entry.down  =  int(self.datas.pop().strip())
          elif  self.div_tag_unclosed ==  '  comment  '  :
            self.div_tag_unclosed  = ''  
            self.entry.comment  =  int(self.datas.pop().strip())
            self.entries.append(self.entry)
      def   handle_data(self, data):
  #          print 'data',data 
         self.datas.append(data)

  #  request the url 
response = urllib2.urlopen( '  http://www.qiushibaike.com/8hr  '  )
all  =  response.read()

  #  parse HTML 
parser =  MyHTMLParser()
parser.feed(all)
  #  print all the entries 
 for  entry  in   parser.entries:
      print  entry.to_string()

整个程序很简单，用到了urllib2来请求网络，sgmllib来解析Html，由于第一次写python程序，所以写的时候效率很低，尤其是一直想在if后面加上小括号=-=

文章来自 sheling 的博客园: http://www.cnblogs.com/sheling

本文版权归作者所有，欢迎转载，但未经作者同意必须保留此段声明，
且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

本作品采用知识共享署名-非商业性使用-相同方式共享 2.5 中国大陆许可协议进行许可。

作者： Leo_wl

出处： http://www.cnblogs.com/Leo_wl/

本文版权归作者和博客园共有，欢迎转载，但未经作者同意必须保留此段声明，且在文章页面明显位置给出原文连接，否则保留追究法律责任的权利。

版权信息

查看更多关于第一个python小程序解析网页的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did48217

更新时间：2022-09-24 阅读：38次