爬取学习
https://www.acwing.com/blog/content/41265/ (佬的学习笔记)
利用bs4简答案例:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "*"
}
url = "https://www.shicimingju.com/book/sanguoyanyi.html"
page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
res = BeautifulSoup(page_text, 'lxml')
list = res.select('.book-mulu > ul > li')
with open('./a.txt', 'w', encoding='utf-8') as fp:
for l in list:
title = l.a.string
de_url = 'https://www.shicimingju.com' + l.a['href']
de_page_text = requests.get(url=de_url, headers=headers).content.decode('utf-8')
de_soup = BeautifulSoup(de_page_text, 'lxml')
div_tag = de_soup.find('div', class_='chapter_content')
content = div_tag.text.replace(" ", "")
fp.write(title + '\n')
fp.write(content + '\n\n')
第一回·宴桃园豪杰三结义 斩黄巾英雄首立功
滚滚长江东逝水,浪花淘尽英雄。是非成败转头空。青山依旧在,几度夕阳红。 白发渔樵江渚上,惯看秋月春风。一壶浊酒喜相逢。古今多少事,都付笑谈中。
——调寄《临江仙》
话说天下大势,分久必合,合久必分。
以下是用xpath简单案例(好像有ai检测):
import requests
from lxml import etree
if __name__ == "__main__":
url = "https://bj.58.com/ershoufang/"
header = {
'User-Agent': '*'
}
page_text = requests.get(url=url, headers=header).text
tree = etree.HTML(page_text)
li_List = tree.xpath(
'//*[@id="esfMain"]/section[@class="list-body"]'
'/section[3]/section[1]/section[2]/div') # 定位到页面数据
with open('./a4.txt', 'w', encoding='utf-8') as fp:
for l in li_List:
title = l.xpath('./a/div[2]/div[1]/div[1]/h3/text()')[0]
price = l.xpath('./a/div[2]/div[2]/p[@class="property-price-average"]/text()')[0]
fp.write(title + '\t')
fp.write(price + '\n')
富华家园 全南眼睛户型 带车位 交易仅 59174元/㎡