Python爬虫新手入门教学（九）：多线程爬虫案例讲解- Python

前言
本文的文字及图片来源于网络,仅供学习、交流vb.net教程使用,不具有任何商业用途,如有问题请及时联系我们以作处理。

Python爬虫、数据分析、网站开发等案例教程视频免费在线观看

https://www.xin3721.com/eschool/pythonxin3721/

基本开发环境
Python 3.6
Pycharm
wkhtmltopdf
相关模块的使用
re
requests
concurrent.futures
安装Python并添加到环境变量，pip安装需要的c#教程相关模块即可。

一、明确需求
现在聊天谁还不发几个表情包？聊天时,表情包是我们重要的工具,更是拉进小伙伴们距离的好帮手,当聊天陷入尴尬境地时,随手一张表情包,让尴尬化为无形

本篇文章就用python批量爬取表情包图片，留以备用

Python爬虫新手入门教学（九）：多线程爬虫案例讲解

二、网页数据分析
Python爬虫新手入门教学（九）：多线程爬虫案例讲解

如图所示斗图网上面的图片数据都包含在 a 标签当中，可以尝试直接请求这个网页，查看response 返回的数据当中是否python基础教程也含有图片地址。

import requests


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def main(html_url):
    response = get_response(html_url)
    print(response.text)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)

在输出结果中 ctrl + F 进行搜索。

Python爬虫新手入门教学（九）：多线程爬虫案例讲解

这里有一个点想要注意一下，我用python请求网页所给我们返回的结果当中，包含图片url地址是：
data-original=“图片url”
data-backup=“图片url”

如果想要提取url地址的话，可以用parsel 解析库，或者 re 正则表达式。之前都是使用的parsel，本篇文章就用正则表达式吧。

urls = re.findall('data-original="(.*?)"', response.text)
1

单页爬取完整代码

import requests
import re


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    url = 'https://www.doutula.com/photo/list/'
    main(url)
2021
22
23
24
25
26
27
28
29
30
31

多线程爬取全站图片（如果你的内存够大）
Python爬虫新手入门教学（九）：多线程爬虫案例讲解

3631页的数据，什么表情都有，嘿嘿嘿

import requests
import re
import concurrent.futures


def get_response(html_url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
    }
    response = requests.get(url=html_url, headers=headers)
    return response


def save(image_url, image_name):
    image_content = get_response(image_url).content
    filename = 'images\\' + image_name
    with open(filename, mode='wb') as f:
        f.write(image_content)
        print(image_name)


def main(html_url):
    response = get_response(html_url)
    urls = re.findall('data-original="(.*?)"', response.text)
    for link in urls:
        image_name = link.split('/')[-1]
        save(link, image_name)


if __name__ == '__main__':
    # ThreadPoolExecutor 线程池的对象
    # max_workers  最大任务数
    executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
    for page in range(1, 3632):
        url = f'https://www.doutula.com/photo/list/?page={page}'
        # submit  往线程池里面添加任务
        executor.submit(main, url)
    executor.shutdown()
2021
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

关键词搜索

Python爬虫新手入门教学（九）：多线程爬虫案例讲解

详情内容

相关技术文章

最新源码

下载排行榜

提示信息

选择支付方式