第一个python小程序解析网页
很久就想学python了,但一直找不到合适的项目来练习,python的语法很有意思,写起来很简洁,今天有空边找资料边写出来了这一个小项目。考虑到3.x的python库还不是很多,大部分资料也是python2.x的,所以我用的python2.7来进行
之前就听说python访问网络很easy,这次是真的体会到了。很简单几句话搞定,不像java那样,再简单的访问都要装饰几层才能使用。
这次是拿糗事百科的网站,从上面抓取新鲜事并整理打印出来
不想多说了,以下上代码:
import urllib2
import sgmllib
class Entry:
author = ''
content = ''
pic = ''
up = 0
down = 0
tag = ''
comment = 0
def to_string(self):
return ' [Entry: author=%s content=%s pic=%s tag=%s up=%d down=%d comment=%d] ' % (self.author,self.content,self.pic,self.tag,self.up,self.down,self测试数据ment)
class MyHTMLParser(sgmllib.SGMLParser):
# 所有用到的声明
# note all the datas
datas = []
# all the entries
entries = []
# the entry now
entry = Entry()
# last Unclosed tag
div_tag_unclosed = ''
def start_div(self,attrs):
for name,value in attrs:
if name == ' class ' and value == ' content ' :
self.div_tag_unclosed = ' content '
elif name== ' class ' and value == ' tags ' :
self.div_tag_unclosed = ' tags '
elif name== ' class ' and value== ' up ' :
self.div_tag_unclosed = ' up '
elif name== ' class ' and value == ' down ' :
self.div_tag_unclosed = ' down '
elif name== ' class ' and value== ' comment ' :
self.div_tag_unclosed = ' comment '
elif name== ' class ' and value== ' author ' :
self.div_tag_unclosed = ' author '
self.entry = Entry()
elif name== ' class ' and value== ' thumb ' :
self.div_tag_unclosed = ' thumb '
def end_div(self):
if self.div_tag_unclosed == ' content ' :
self.div_tag_unclosed = ''
self.entry.content = self.datas.pop().strip()
def start_a(self,attrs): pass
def start_img(self,attrs):
if self.div_tag_unclosed == ' thumb ' :
for name,value in attrs:
if name== ' src ' :
self.div_tag_unclosed = ''
self.entry.img = value.strip()
def end_img(self): pass
def end_a(self):
if self.div_tag_unclosed == ' author ' :
self.div_tag_unclosed = ''
self.entry.author = self.datas.pop().strip()
if self.div_tag_unclosed == ' tags ' :
self.div_tag_unclosed = ''
self.entry.tag = self.datas.pop().strip()
elif self.div_tag_unclosed == ' up ' :
self.div_tag_unclosed = ''
self.entry.up = int(self.datas.pop().strip())
elif self.div_tag_unclosed == ' down ' :
self.div_tag_unclosed = ''
self.entry.down = int(self.datas.pop().strip())
elif self.div_tag_unclosed == ' comment ' :
self.div_tag_unclosed = ''
self.entry测试数据ment = int(self.datas.pop().strip())
self.entries.append(self.entry)
def handle_data(self, data):
# print 'data',data
self.datas.append(data)
# request the url
response = urllib2.urlopen( ' http://HdhCmsTestqiushibaike测试数据/8hr ' )
all = response.read()
# parse HTML
parser = MyHTMLParser()
parser.feed(all)
# print all the entries
for entry in parser.entries:
print entry.to_string()
整个程序很简单,用到了urllib2来请求网络,sgmllib来解析Html,由于第一次写python程序,所以写的时候效率很低,尤其是一直想在if后面加上小括号=-=
文章来自 sheling 的博客园: http://HdhCmsTestcnblogs测试数据/sheling
本文版权归作者所有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
本作品采用 知识共享署名-非商业性使用-相同方式共享 2.5 中国大陆许可协议 进行许可。
作者: Leo_wl
出处: http://HdhCmsTestcnblogs测试数据/Leo_wl/
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须保留此段声明,且在文章页面明显位置给出原文连接,否则保留追究法律责任的权利。
版权信息查看更多关于第一个python小程序解析网页的详细内容...
声明:本文来自网络,不代表【好得很程序员自学网】立场,转载请注明出处:http://www.haodehen.cn/did48217