来自于华为云开发者大会,使用Python爬虫抓取图片和文字实验,应用Scrapy框架进行数据抓取,保存应用了mysql数据库,实验采用的是线上服务器,而这里照抄全是本地进行,如有不同,那肯定是本渣渣瞎改了!
step1.配置环境
1.新建文件夹 huawei
2.命令行配置python虚拟环境
python?-m?venv?ven
3.安装Scrapy框架
win7 64位系统下安装Scrapy框架 “pip install scrapy”,需要先安装相关环境,不然会报错,比如Twisted-,请自行对照python版本安装,本渣渣用的python3.8的所以下载的是Twisted-20.3.0-cp38-cp38-win_amd64.whl,没错,本渣渣是本地安装的方法安装的!
详细安装 win 安装Scrapy框架方法,推荐善用搜索引擎!
step2.创建Scrapy项目
同样的命令行操作
1.需要进入到指定目录,激活虚拟环境
cd?huawei env\Scripts\activate.bat
2.cmd命令行,新建Scrapy项目
scrapy?startproject?vmall_spider
cd?vmall_spider
scrapy?genspider?-t?crawl?vmall?"vmall测试数据"
step3.关键源码
1.vmall.py(核心爬取代码)
import?scrapy from?scrapy.linkextractors?import?LinkExtractor from?scrapy.spiders?import?CrawlSpider,?Rule from?vmall_spider.items?import?VmallSpiderItem class?VamllSpider(CrawlSpider):???? ?name?=?'vmall'???? ?allowed_domains?=?['vmall测试数据']???? ?start_urls?=?['https://HdhCmsTestvmall测试数据/']???? ?rules?=?(???????? ??Rule(LinkExtractor(allow=r'.*/product/.*'),?callback='parse_item',?follow=True),???? ??)??? ?def?parse_item(self,?response): ??title=response.xpath("//div[@class='product-meta']/h1/text()").get()????? ??image=response.xpath("//a[@id='product-img']/img/@src").get()???????? ??item=VmallSpiderItem( ???title=title, ???image=image, ??) ??print("="*30)???????? ??print(title) ??print(image)???????? ??print("="*30)???????? ??yield?item ????
2.items.py
import?scrapy class?VmallSpiderItem(scrapy.Item): ????title=scrapy.Field() ????image=scrapy.Field() ????
3.pipelines.py
数据存储处理
import?pymysql import?os from?urllib?import?request class?VmallSpiderPipeline:???? ?def?__init__(self):???????? ??dbparams={ ??????'host':'127.0.0.1',?#云数据库弹性公网IP ??????'port':3306,?#云数据库端口???????????? ??????'user':'vmall',?#云数据库用户 ??????'password':'123456',?#云数据库RDS密码 ??????'database':'vmall',?#数据库名称???????????? ??????'charset':'utf8'???????? ??}???????? ??self.conn=pymysql.connect(**dbparams)???????? ??self.cursor=self.conn.cursor()???????? ??self._sql=None ??self.path=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')???????? ??if?not?os.path.exists(self.path):???????????? ???os.mkdir(self.path)???? ?def?process_item(self,item,spider):???????? ??url=item['image']???????? ??image_name=url.split('_')[-1]???????? ??print("--------------------------image_name-----------------------------")???????? ??print(image_name)???????? ??print(url)???????? ??request.urlretrieve(url,os.path.join(self.path,image_name))???????? ??self.cursor.execute(self.sql,(item['title'],?item['image']))???????? ??self.conn测试数据mit()???????? ??return?item?? ????
3.settings.py
更改的部分代码
BOT_NAME?=?'vmall_spider' SPIDER_MODULES?=?['vmall_spider.spiders'] NEWSPIDER_MODULE?=?'vmall_spider.spiders' ROBOTSTXT_OBEY?=?False DOWNLOAD_DELAY?=?3 DEFAULT_REQUEST_HEADERS?=?{ ?'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',???? ?'Accept-Language':?'en',???? ?'User-Agent':?'Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/39.0.2171.71?Safari/537.36',???? ?'User-Agent':?'Mozilla/5.0?(X11;? Linux ?x86_64)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11',???? ?'User-Agent':?'Mozilla/5.0?(Windows;?U;?Windows?NT?6.1;?en-US)?AppleWebKit/534.16?(KHTML,?like?Gecko)?Chrome/10.0.648.133?Safari/534.16' } ITEM_PIPELINES?=?{ ?'vmall_spider.pipelines.VmallSpiderPipeline':?300, }
4.新建 start.py 运行调试入口文件
from?scrapy?import?cmdline cmdline.execute("scrapy?crawl?vmall".split())
step4.本地数据库配置
工具:phpstudy-面板(小皮面板)
链接工具:Navicat for MySQL
运行
来源:
使用Python爬虫抓取图片和文字实验
https://lab.huaweicloud测试数据/testdetail.html?testId=468&ticket=ST-1363346-YzykQhBcmiNeURp6pgL0ahIy-sso
关注本渣渣微信公众号:二爷记
? ? ??
后台回复关键字:“华为商城”
获取完整项目
查看更多关于Scrapy爬虫,华为商城商品数据爬虫demo的详细内容...