爬虫 --- 以爬取笔趣阁小说为例

爬虫 — 以爬取笔趣阁小说为例

1.发送请求

1
2
3
4
5
6
import requests
url = "https://www.xzmncy.com/list/5418/2610707.html"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
response = requests.get(url,headers)

这是requests请求,若返回response值为200,则表示请求成功

2.获取数据

1
response = requests.get(url,headers).text

可以通过以上方法返回得到的html文件内容,而文件中有很多标签在里面,不能直接获取想要的信息,所以需要数据解析

3.解析数据

有以下几种途径:css、xpath、re正则表达 等等

让我们来看看分别用这三种方法怎么去解析到一个章节的标题

css
1
2
3
import parsel
selector = parsel.Selector(response)
novel_title = selector.css(".bookname h1::text").get()

这种方法通过css选择器进行选择

xpath
1
2
3
import parsel
selector = parsel.Selector(response)
novel_title = selector.xpath("//*[@class="bookname"]/h1/text()").get()

注意text后面的()

re
1
2
import re
novel_title = re.findall("<h1>(.*?)</h1>",response)[0]

这里是因为h1在html文件中只有一个,故我直接导入。获取的数据是一个列表,所以我后面做了个且切片来直接获取字符串

*注意:以上方法各有利弊,选择合适的方式来解析数据

4.保存数据

1
2
with open("file_name"+".txt",mode="w",encoding="utf-8") as f:  #w是写入但是覆盖,a是追加写入,写入文件末尾 wb是二进制写入
f.write(novel_context) #写入文件

with open(download_path,mode=””,encoding=”utf-8”)中间的download_path可以写绝对路径

以上思路已经理清楚了,下面开始实践:

爬取一章

1
2
3
4
5
6
7
8
9
10
11
12
13
import parsel
import requests
url = "https://www.xzmncy.com/list/5418/2610707.html"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
response = requests.get(url,headers).text

selector = parsel.Selector(response)
novel_title = selector.css(".bookname h1::text").get() #css方法解析数据

novel_context_list = selector.css("#htmlContent p::text").getall()
novel_context = "\n".join(novel_context_list)

注意:join函数的使用:

1
2
3
4
5
6
a=["1","2","8","9"]
print(" ".join(a)) #输出1 2 8 9
print("\n".join(a)) #输出1(换行)2(换行)8(换行)9

b={"a":1,"b":2}
print(" ".join(a)) #输出a b

(注意seq不能是int整形)

爬取各章url

1
2
3
4
5
6
7
8
9
10
11
12
13
import requests
import re
url = "https://www.xzmncy.com/list/18753/"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
response = requests.get(url,headers).text
novel_name = re.findall("<h1>(.*?)</h1>",response)[0]
novel_info = re.findall('<dd><a href="(.*?)">(.*?)</a></dd>',response)
for novel_url_part,novel_title in novel_info:
novel_url = "https://www.xzmncy.com"+novel_url_part[0:24]
print(novel_url)
print(novel_title)

在小说的列表页面我们可以发现每个标签对应的章节url,此时我们获取的数据是这样的:

image-20230831120348951

我们就可以用re来解析到各个章节的url和title

完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import requests
import re
import parsel
url = "https://www.xzmncy.com/list/18753/"
headers = {
"User-Agent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/116.0"
}
response = requests.get(url,headers).text
novel_name = re.findall("<h1>(.*?)</h1>",response)[0]
novel_info = re.findall('<dd><a href="(.*?)">(.*?)</a></dd>',response)
for novel_url_part,novel_title in novel_info:
novel_url = "https://www.xzmncy.com"+novel_url_part[0:24]
novel_response = requests.get(novel_url, headers).text
selectors = parsel.Selector(novel_response)
novel_context_list = selectors.css("#htmlContent p::text").getall()
novel_context = "\n".join(novel_context_list)
print("正在保存"+novel_title)
novel_title = "*" + novel_title
with open(novel_name+".txt",mode="a") as f:
f.write(novel_title)
f.write("\n")
f.write(novel_context)
f.write("\n")
f.write("\n")

运行代码就可以看到当前的目录下出现一个txt文件,里面就是想要的小说啦~


爬虫 --- 以爬取笔趣阁小说为例
https://bayeeaa.github.io/2023/08/31/biquge/
Author
Ye
Posted on
August 31, 2023
Licensed under