使用python提取html文件中的特定数据的实现代码

我开始在Python中使用HTMLParser从网站中提取数据。

我得到了我想要的一切，除了两个HTML标签中的文本。

这是HTML标记的示例：

Swahili

还有其他以开头的标签。它们具有其他属性和值，因此我不想拥有它们的数据：

Thilo Schadeberg

该标记是表中的嵌入式标记。我不知道其他标签之间是否有任何区别。

我只需要某些名为'a'且属性为class =" Vocabulary"的标签中的信息，并且我想要标签中的数据，在示例中为" Swahili"。

所以我所做的是：

class AllLanguages(HTMLParser):

'''

classdocs

'''

#counter for the languages

#countLanguages = 0

def __init__(self):

HTMLParser.__init__(self)

self.inLink = False

self.dataArray = []

self.countLanguages = 0

self.lasttag = None

self.lastname = None

self.lastvalue = None

#self.text =""

def handle_starttag(self, tag, attr):

#print"Encountered a start tag:", tag

if tag == 'a':

for name, value in attr:

if name == 'class' and value == 'Vocabulary':

self.countLanguages += 1

self.inLink = True

self.lasttag = tag

#self.lastname = name

#self.lastvalue = value

print self.lasttag

#print self.lastname

#print self.lastvalue

#return tag

print self.countLanguages

def handle_endtag(self, tag):

if tag =="a":

self.inlink = False

#print"".join(self.data)

def handle_data(self, data):

if self.lasttag == 'a' and self.inLink and data.strip():

#self.dataArray.append(data)

print data

程序将打印标签中包含的所有数据，但是我只希望标签中包含的数据具有正确的属性。

如何获取此特定数据？

好像您忘记默认在handle_starttag中设置self.inLink = False：

from HTMLParser import HTMLParser

class AllLanguages(HTMLParser):

def __init__(self):

HTMLParser.__init__(self)

self.inLink = False

self.dataArray = []

self.countLanguages = 0

self.lasttag = None

self.lastname = None

self.lastvalue = None

def handle_starttag(self, tag, attrs):

self.inLink = False

if tag == 'a':

for name, value in attrs:

if name == 'class' and value == 'Vocabulary':

self.countLanguages += 1

self.inLink = True

self.lasttag = tag

def handle_endtag(self, tag):

if tag =="a":

self.inlink = False

def handle_data(self, data):

if self.lasttag == 'a' and self.inLink and data.strip():

print data

parser = AllLanguages()

parser.feed("""

<html>

<body>

Swahili

Thilo Schadeberg

English

Russian

</body>

</html>""")

打印：

Swahili

English

Russian

另外，看看：

刮y的

xml文件

美丽汤

希望有帮助。