本文共 3709 字,大约阅读时间需要 12 分钟。
scrapy项目实例:使用ImagesPipeline爬取Douban图片
初始化项目
运行以下命令创建Scrapy项目:scrapy startproject DoubanImgscd DoubanImgsscrapy genspider download_douban douban.com
生成Spider脚本
打开spiders/download_douban.py,以下是核心代码逻辑: from scrapy.spiders import Spiderfrom scrapy import Requestfrom ..items import DoubanImgsItemclass download_douban(Spider): name = 'download_douban' default_headers = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch, br', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6', 'Cache-Control': 'max-age=0', 'Connection': 'keep-alive', 'Host': 'www.douban.com', 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36', } def __init__(self, url='1638835355', *args, **kwargs): self.allowed_domains = ['douban.com'] self.start_urls = [] for i in range(23): if i == 0: page_url = f'http://www.douban.com/photos/album/{url}' else: page_url = f'http://www.douban.com/photos/album/{url}/{i*18}' self.start_urls.append(page_url) self.url = url def start_requests(self): for url in self.start_urls: yield Request( url=url, headers=self.default_headers, callback=self.parse ) def parse(self, response): list_imgs = response.xpath('//div[@class="photolst clearfix"]//img/@src').extract() if list_imgs: item = DoubanImgsItem() item['image_urls'] = list_imgs yield item 编辑DoubanImgs/settings.py,如以下所示:
# -*- coding: utf-8 -*-BOT_NAME = 'DoubanImgs'SPIDER_MODULES = ['DoubanImgs.spiders']NEWSPIDER_MODULE = 'DoubanImgs.spiders'ITEM_PIPELINES = { 'DoubanImgs.pipelines.DoubanImgDownloadPipeline': 300,}IMAGES_STORE = '.' # 图片存储路径默认为项目目录IMAGES_EXPIRES = 90 # 图片缓存时间设置HTTPCACHE_ENABLED = True # 启用HTTP缓存# HTTPCACHE_DIR = 'httpcache' # 可根据需要调整缓存目录 DoubanImgs/items.py:
# -*- coding: utf-8 -*-from scrapy import Fieldclass DoubanImgsItem(scrapy.Item): image_urls = Field() # 存储图片的URL images = Field() # 下载后的图片路径
DoubanImgs/pipelines.py:
# -*- coding: utf-8 -*-from scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom scrapy import Requestclass DoubanImgDownloadPipeline(ImagesPipeline): def get_media_requests(self, item, info): for image_url in item['image_urls']: yield Request( url=image_url, headers={ 'accept': 'image/webp,image/*,*/*;q=0.8', 'accept-encoding': 'gzip, deflate, sdch, br', 'accept-language': 'zh-CN,zh;q=0.8,en;q=0.6', 'referer': image_url, 'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36' } ) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item scrapy startproject DoubanImgs创建新项目。scrapy genspider download_douban douban.com创建爬虫实例。settings.py中设置图片存储路径和缓存时间等参数。items.py中定义图片爬取结果的结构。pipelines.py中定义图片下载和存储逻辑。scrapy crawl download_douban执行任务。如需进一步优化图片爬取效率或处理更多界面,请根据实际需求扩展Spider逻辑或配置参数。
转载地址:http://xsbxz.baihongyu.com/