第一个python小程序解析网页
很久就想学python了,但一直找不到合适的项目来练习,python的语法很有意思,写起来很简洁,今天有空边找资料边写出来了这一个小项目。考虑到3.x的python库还不是很多,大部分资料也是python2.x的,所以我用的python2.7来进行
之前就听说python访问网络很easy,这次是真的体会到了。很简单几句话搞定,不像java那样,再简单的访问都要装饰几层才能使用。
这次是拿糗事百科的网站,从上面抓取新鲜事并整理打印出来
不想多说了,以下上代码:
import urllib2 import sgmllib class Entry: author = '' content = '' pic = '' up = 0 down = 0 tag = '' comment = 0 def to_string(self): return ' [Entry: author=%s content=%s pic=%s tag=%s up=%d down=%d comment=%d] ' \ % (self.author,self.content,self.pic,self.tag,self.up,self.down,self.comment) class MyHTMLParser(sgmllib.SGMLParser): # 所有用到的声明 # note all the datas datas = [] # all the entries entries = [] # the entry now entry = Entry() # last Unclosed tag div_tag_unclosed = '' def start_div(self,attrs): for name,value in attrs: if name == ' class ' and value == ' content ' : self.div_tag_unclosed = ' content ' elif name== ' class ' and value == ' tags ' : self.div_tag_unclosed = ' tags ' elif name== ' class ' and value== ' up ' : self.div_tag_unclosed = ' up ' elif name== ' class ' and value == ' down ' : self.div_tag_unclosed = ' down ' elif name== ' class ' and value== ' comment ' : self.div_tag_unclosed = ' comment ' elif name== ' class ' and value== ' author ' : self.div_tag_unclosed = ' author ' self.entry = Entry() elif name== ' class ' and value== ' thumb ' : self.div_tag_unclosed = ' thumb ' def end_div(self): if self.div_tag_unclosed == ' content ' : self.div_tag_unclosed = '' self.entry.content = self.datas.pop().strip() def start_a(self,attrs): pass def start_img(self,attrs): if self.div_tag_unclosed == ' thumb ' : for name,value in attrs: if name== ' src ' : self.div_tag_unclosed = '' self.entry.img = value.strip() def end_img(self): pass def end_a(self): if self.div_tag_unclosed == ' author ' : self.div_tag_unclosed = '' self.entry.author = self.datas.pop().strip() if self.div_tag_unclosed == ' tags ' : self.div_tag_unclosed = '' self.entry.tag = self.datas.pop().strip() elif self.div_tag_unclosed == ' up ' : self.div_tag_unclosed = '' self.entry.up = int(self.datas.pop().strip()) elif self.div_tag_unclosed == ' down ' : self.div_tag_unclosed = '' self.entry.down = int(self.datas.pop().strip()) elif self.div_tag_unclosed == ' comment ' : self.div_tag_unclosed = '' self.entry.comment = int(self.datas.pop().strip()) self.entries.append(self.entry) def handle_data(self, data): # print 'data',data self.datas.append(data) # request the url response = urllib2.urlopen( ' http://www.qiushibaike.com/8hr ' ) all = response.read() # parse HTML parser = MyHTMLParser() parser.feed(all) # print all the entries for entry in parser.entries: print entry.to_string()
整个程序很简单,用到了urllib2来请求网络,sgmllib来解析Html,由于第一次写python程序,所以写的时候效率很低,尤其是一直想在if后面加上小括号=-=
文章来自 sheling 的博客园: http://www.cnblogs.com/sheling
本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
本作品采用 知识共享署名-非商业性使用-相同方式共享 2.5 中国大陆许可协议 进行许可。
作者: Leo_wl
出处: http://www.cnblogs.com/Leo_wl/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
版权信息查看更多关于第一个python小程序解析网页的详细内容...
声明:本文来自网络,不代表【好得很程序员自学网】立场,转载请注明出处:http://www.haodehen.cn/did48217