Scrapy爬虫，华为商城商品数据爬虫demo

来自于华为云开发者大会，使用Python爬虫抓取图片和文字实验，应用Scrapy框架进行数据抓取，保存应用了mysql数据库，实验采用的是线上服务器，而这里照抄全是本地进行，如有不同，那肯定是本渣渣瞎改了！

step1.配置环境

1.新建文件夹 huawei

2.命令行配置python虚拟环境

 python?-m?venv?ven

3.安装Scrapy框架

win7 64位系统下安装Scrapy框架 “pip install scrapy”，需要先安装相关环境，不然会报错，比如Twisted-，请自行对照python版本安装，本渣渣用的python3.8的所以下载的是Twisted-20.3.0-cp38-cp38-win_amd64.whl，没错，本渣渣是本地安装的方法安装的！

详细安装 win 安装Scrapy框架方法，推荐善用搜索引擎！

step2.创建Scrapy项目

同样的命令行操作

1.需要进入到指定目录，激活虚拟环境

 cd?huawei
env\Scripts\activate.bat

2.cmd命令行，新建Scrapy项目

 scrapy?startproject?vmall_spider

 cd?vmall_spider

 scrapy?genspider?-t?crawl?vmall?"vmall测试数据"

step3.关键源码

1.vmall.py（核心爬取代码）

 import?scrapy
from?scrapy.linkextractors?import?LinkExtractor
from?scrapy.spiders?import?CrawlSpider,?Rule
from?vmall_spider.items?import?VmallSpiderItem

class?VamllSpider(CrawlSpider):????
?name?=?'vmall'????
?allowed_domains?=?['vmall测试数据']????
?start_urls?=?['https://HdhCmsTestvmall测试数据/']????
?rules?=?(????????
??Rule(LinkExtractor(allow=r'.*/product/.*'),?callback='parse_item',?follow=True),????
??)???

?def?parse_item(self,?response):
??title=response.xpath("//div[@class='product-meta']/h1/text()").get()?????
??image=response.xpath("//a[@id='product-img']/img/@src").get()????????
??item=VmallSpiderItem(
???title=title,
???image=image,
??)
??print("="*30)????????
??print(title)
??print(image)????????
??print("="*30)????????
??yield?item
????

2.items.py

 import?scrapy


class?VmallSpiderItem(scrapy.Item):
????title=scrapy.Field()
????image=scrapy.Field()
????

3.pipelines.py

数据存储处理

 
import?pymysql
import?os
from?urllib?import?request


class?VmallSpiderPipeline:????
?def?__init__(self):????????
??dbparams={
??????'host':'127.0.0.1',?#云数据库弹性公网IP
??????'port':3306,?#云数据库端口????????????
??????'user':'vmall',?#云数据库用户
??????'password':'123456',?#云数据库RDS密码
??????'database':'vmall',?#数据库名称????????????
??????'charset':'utf8'????????
??}????????
??self.conn=pymysql.connect(**dbparams)????????
??self.cursor=self.conn.cursor()????????
??self._sql=None

??self.path=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')????????
??if?not?os.path.exists(self.path):????????????
???os.mkdir(self.path)????


?def?process_item(self,item,spider):????????
??url=item['image']????????
??image_name=url.split('_')[-1]????????
??print("--------------------------image_name-----------------------------")????????
??print(image_name)????????
??print(url)????????
??request.urlretrieve(url,os.path.join(self.path,image_name))????????
??self.cursor.execute(self.sql,(item['title'],?item['image']))????????
??self.conn测试数据mit()????????
??return?item??
????

3.settings.py

更改的部分代码

 BOT_NAME?=?'vmall_spider'

SPIDER_MODULES?=?['vmall_spider.spiders']
NEWSPIDER_MODULE?=?'vmall_spider.spiders'

ROBOTSTXT_OBEY?=?False

DOWNLOAD_DELAY?=?3

DEFAULT_REQUEST_HEADERS?=?{
?'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',????
?'Accept-Language':?'en',????
?'User-Agent':?'Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/39.0.2171.71?Safari/537.36',????
?'User-Agent':?'Mozilla/5.0?(X11;? Linux ?x86_64)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11',????
?'User-Agent':?'Mozilla/5.0?(Windows;?U;?Windows?NT?6.1;?en-US)?AppleWebKit/534.16?(KHTML,?like?Gecko)?Chrome/10.0.648.133?Safari/534.16'
}

ITEM_PIPELINES?=?{
?'vmall_spider.pipelines.VmallSpiderPipeline':?300,
}

4.新建 start.py 运行调试入口文件

 from?scrapy?import?cmdline
cmdline.execute("scrapy?crawl?vmall".split())

step4.本地数据库配置

工具：phpstudy-面板(小皮面板)

链接工具：Navicat for MySQL

运行

来源：

使用Python爬虫抓取图片和文字实验

https://lab.huaweicloud测试数据/testdetail.html?testId=468&ticket=ST-1363346-YzykQhBcmiNeURp6pgL0ahIy-sso

关注本渣渣微信公众号：二爷记

? ? ??

后台回复关键字：“华为商城”

获取完整项目

查看更多关于Scrapy爬虫，华为商城商品数据爬虫demo的详细内容...

声明：本文来自网络，不代表【好得很程序员自学网】立场，转载请注明出处：http://www.haodehen.cn/did126020

更新时间：2022-11-28 阅读：41次