python带你采集爆火动漫弹幕,并且做词云图可视化分析- Python

前言 ?

大家早好、午好、晚好吖~

代码提供者: 青灯教育-巳月

知识点介绍:

requests模块的使用
pandas读取表格数据
pyecharts做词云图可视化

环境介绍:

python 3.8
pycharm
requests >>> pip install requests
pyecharts >>> pip install pyecharts
jieba >>> pip install jieba
pandas >>> pip install pandas

如果安装python第三方模块:

win + R 输入 cmd 点击确定, 输入安装命令 pip install 模块名 (pip install requests) 回车
在pycharm中点击Terminal(终端) 输入安装命令

相对应的安装包/安装教程/激活码/使用教程/学习资料/工具插件 可以加落落老师微信:xinlian_00

案例:

分析数据来源 (数据动态的还是静态的)

动态数据: 在当前网页源代码里面找不到的内容

静态数据: 当前网页源代码里面能够找到的内容

实现代码:

发送请求访问网站 requests
获取数据
解析数据提取想要的内容去掉不想要的内容
保存数据

代码

导入模块

import requests     # 发送请求 第三方模块
import csv          # 内置模块
2

源码、解答、教程加Q裙：261823976 点击蓝字加入【python学习裙】

请添加图片描述

f = open('弹幕.csv', mode='a', encoding='utf-8-sig', newline='')
csv_writer = csv.writer(f)
csv_writer.writerow(['nick', 'create_time', 'content'])
for page in range(0, 46):
    print(f"--------------正在爬取第{page}页--------------")
    url = f'https://dm.video.qq.com/barrage/segment/q0044rg63ub/t/v1/{page * 30000}/{page * 30000 + 30000}'
2
3
4
5
6

1. 发送请求

    response = requests.get(url)

2. 获取数据

    # .text: 文本内容 解析 不太方便
    # .json(): json格式的 方便接下来的数据解析
    # .content: 获取二进制数据 图片 音频 视频
    json_data = response.json()
2
3
4

3. 解析数据

    # xpath/css/re/json
    # 我们获取下来的数据:
    # 非结构化数据: css/xpath(用法) 网页源代码<div></div>  lxml parsel bs4
    # 结构化数据: json数据 Python字典键值对取值方式  {"": ""}
    barrage_list = json_data['barrage_list']
    # 列表 [{}, {}, {}...... {}]
    for barrage in barrage_list:
        content = barrage['content']
        create_time = barrage['create_time']
        nick = barrage['nick']
        print(nick, create_time, content)
2
3
4
5
6
7
8
9
10
11

4. 保存数据

        csv_writer.writerow([nick, create_time, content])

效果

词云图代码

from pyecharts.charts import WordCloud      # 导入的词云图模块
from pyecharts import options as opts       # pyecharts设置选项
import pandas as pd                         # 操作表格模块
import jieba    
2
3
4

f = open('弹幕.csv', encoding="utf-8-sig")
# 1. 读取数据
data = pd.read_csv(f)['content']
# 把所有的弹幕转成列表
data_str = str(data.values.tolist()).replace("'", '').replace(',', '').replace('[', '').replace(']', '').replace(' ', '')
words = jieba.lcut(data_str)
wordlist = []
for word in words:
    if len(word) >= 1:
        wordlist.append({'word': word, 'count': 1})
df = pd.DataFrame(wordlist)
dfword = df.groupby('word')['count'].sum()
word = dfword.index.tolist()
count = dfword.values.tolist()
c = (
    WordCloud()
    .add('', [list(z) for z in zip(word, count)])
)
c.render('1.html')
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19