百度文库爬虫，Python爬取百度文库内容输出word文档low版

一个比较简单的文库爬虫，所以带来的后遗症也很多明显，比较low比，只能爬取word，txt，ppt别想了，同时不能有折叠的内容，当然vip的内容也不要妄想了，百度吃相还是真难看，有钱真的可以为所欲为！

关键点就在于，协议头，直接用爬虫的协议头才能获取到内容！

 header?=?{'User-agent':?'Googlebot'}

而想要输出为word文档，那就需要使用到 docx 库！

当然格式还是差强人意，有总比没有强吧，你说是吧？！

pip安装 docx 库

 pip?install?python_docx

文档参考：https://python-docx.readthedocs.io/en/latest/

参考代码：

 def?get_word(data):
????document?=?Document()
????document.add_heading(data[0])

????for?detail?in?data[1]:
????????document.add_paragraph(detail)?#添加段落


????document.save(f'{data[0]}.docx')

附完整代码参考：

 #百度文库采集
#20200803微信：huguo00289
#https://wenku.baidu测试数据/view/312ce9da0129bd64783e0912a216147916117e27.html
#?-*-?coding:?UTF-8?-*-

import?requests,re
from?lxml?import?etree
from?docx?import?Document

def?get_detail(url):
????#url?=?'https://wenku.baidu测试数据/view/312ce9da0129bd64783e0912a216147916117e27.html'
????header?=?{'User-agent':?'Googlebot'}
????response?=?requests.get(url?,?headers?=?header).content.decode('gbk')
????#print(response)
????title_ze=r'<title>(.+?)_百度文库</title>'
????div_ze=r'<div?class="bd?doc-reader">(.+?)<div?class="aside">'
????title=re.findall(title_ze,response,re.S)[0]
????div=re.findall(div_ze,response,re.S)[0]
????div=etree.HTML(div)
????details=div.xpath('//div//text()')
????#detail='\n'.join(details)
????data=title,details
????print(data)
????return?data



def?get_word(data):
????document?=?Document()
????document.add_heading(data[0])

????for?detail?in?data[1]:
????????document.add_paragraph(detail)?#添加段落


????document.save(f'{data[0]}.docx')

if?__name__=='__main__':
????url="https://wenku.baidu测试数据/view/cb02b4a91837f111f18583d049649b6648d7092e"
????text=get_detail(url)
????get_word(text)

?? ? ?

微信公众号：二爷记

不定时分享python源码及工具

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did126123

更新时间：2022-11-28 阅读：34次