好得很程序员自学网

<tfoot draggable='sEl'></tfoot>

Scrapy爬虫,华为商城商品数据爬虫demo

来自于华为云开发者大会,使用Python爬虫抓取图片和文字实验,应用Scrapy框架进行数据抓取,保存应用了mysql数据库,实验采用的是线上服务器,而这里照抄全是本地进行,如有不同,那肯定是本渣渣瞎改了!

step1.配置环境

1.新建文件夹 huawei

2.命令行配置python虚拟环境

 python?-m?venv?ven
 

3.安装Scrapy框架

win7 64位系统下安装Scrapy框架 “pip install scrapy”,需要先安装相关环境,不然会报错,比如Twisted-,请自行对照python版本安装,本渣渣用的python3.8的所以下载的是Twisted-20.3.0-cp38-cp38-win_amd64.whl,没错,本渣渣是本地安装的方法安装的!

详细安装 win 安装Scrapy框架方法,推荐善用搜索引擎!

step2.创建Scrapy项目

同样的命令行操作

1.需要进入到指定目录,激活虚拟环境

 cd?huawei
env\Scripts\activate.bat
 

2.cmd命令行,新建Scrapy项目

 scrapy?startproject?vmall_spider
 
 cd?vmall_spider
 
 scrapy?genspider?-t?crawl?vmall?"vmall测试数据"
 

step3.关键源码

1.vmall.py(核心爬取代码)

 import?scrapy
from?scrapy.linkextractors?import?LinkExtractor
from?scrapy.spiders?import?CrawlSpider,?Rule
from?vmall_spider.items?import?VmallSpiderItem

class?VamllSpider(CrawlSpider):????
?name?=?'vmall'????
?allowed_domains?=?['vmall测试数据']????
?start_urls?=?['https://HdhCmsTestvmall测试数据/']????
?rules?=?(????????
??Rule(LinkExtractor(allow=r'.*/product/.*'),?callback='parse_item',?follow=True),????
??)???

?def?parse_item(self,?response):
??title=response.xpath("//div[@class='product-meta']/h1/text()").get()?????
??image=response.xpath("//a[@id='product-img']/img/@src").get()????????
??item=VmallSpiderItem(
???title=title,
???image=image,
??)
??print("="*30)????????
??print(title)
??print(image)????????
??print("="*30)????????
??yield?item
????
 

2.items.py

 import?scrapy


class?VmallSpiderItem(scrapy.Item):
????title=scrapy.Field()
????image=scrapy.Field()
????
 

3.pipelines.py

数据存储处理

 
import?pymysql
import?os
from?urllib?import?request


class?VmallSpiderPipeline:????
?def?__init__(self):????????
??dbparams={
??????'host':'127.0.0.1',?#云数据库弹性公网IP
??????'port':3306,?#云数据库端口????????????
??????'user':'vmall',?#云数据库用户
??????'password':'123456',?#云数据库RDS密码
??????'database':'vmall',?#数据库名称????????????
??????'charset':'utf8'????????
??}????????
??self.conn=pymysql.connect(**dbparams)????????
??self.cursor=self.conn.cursor()????????
??self._sql=None

??self.path=os.path.join(os.path.dirname(os.path.dirname(__file__)),'images')????????
??if?not?os.path.exists(self.path):????????????
???os.mkdir(self.path)????


?def?process_item(self,item,spider):????????
??url=item['image']????????
??image_name=url.split('_')[-1]????????
??print("--------------------------image_name-----------------------------")????????
??print(image_name)????????
??print(url)????????
??request.urlretrieve(url,os.path.join(self.path,image_name))????????
??self.cursor.execute(self.sql,(item['title'],?item['image']))????????
??self.conn测试数据mit()????????
??return?item??
????
 

3.settings.py

更改的部分代码

 BOT_NAME?=?'vmall_spider'

SPIDER_MODULES?=?['vmall_spider.spiders']
NEWSPIDER_MODULE?=?'vmall_spider.spiders'

ROBOTSTXT_OBEY?=?False

DOWNLOAD_DELAY?=?3

DEFAULT_REQUEST_HEADERS?=?{
?'Accept':?'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',????
?'Accept-Language':?'en',????
?'User-Agent':?'Mozilla/5.0?(Windows?NT?6.1;?WOW64)?AppleWebKit/537.36?(KHTML,?like?Gecko)?Chrome/39.0.2171.71?Safari/537.36',????
?'User-Agent':?'Mozilla/5.0?(X11;? Linux ?x86_64)?AppleWebKit/537.11?(KHTML,?like?Gecko)?Chrome/23.0.1271.64?Safari/537.11',????
?'User-Agent':?'Mozilla/5.0?(Windows;?U;?Windows?NT?6.1;?en-US)?AppleWebKit/534.16?(KHTML,?like?Gecko)?Chrome/10.0.648.133?Safari/534.16'
}

ITEM_PIPELINES?=?{
?'vmall_spider.pipelines.VmallSpiderPipeline':?300,
}

 

4.新建 start.py 运行调试入口文件

 from?scrapy?import?cmdline
cmdline.execute("scrapy?crawl?vmall".split())

 

step4.本地数据库配置

工具:phpstudy-面板(小皮面板)

链接工具:Navicat for MySQL

运行

来源:

使用Python爬虫抓取图片和文字实验

https://lab.huaweicloud测试数据/testdetail.html?testId=468&ticket=ST-1363346-YzykQhBcmiNeURp6pgL0ahIy-sso

关注本渣渣微信公众号:二爷记

? ? ??

后台回复关键字:“华为商城”

获取完整项目

查看更多关于Scrapy爬虫,华为商城商品数据爬虫demo的详细内容...

  阅读:35次