py虫

历史记录

清除记录

猜你想搜

AcWing热点
App
登录/注册

作者：

navystar , 2024-03-05 16:22:41 , 所有人可见 , 阅读 71

爬取学习

https://www.acwing.com/blog/content/41265/ （佬的学习笔记)

利用bs4简答案例：

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "*"
}
url = "https://www.shicimingju.com/book/sanguoyanyi.html"
page_text = requests.get(url=url, headers=headers).content.decode('utf-8')
res = BeautifulSoup(page_text, 'lxml')
list = res.select('.book-mulu > ul > li')
with open('./a.txt', 'w', encoding='utf-8') as fp:
    for l in list:
        title = l.a.string
        de_url = 'https://www.shicimingju.com' + l.a['href']
        de_page_text = requests.get(url=de_url, headers=headers).content.decode('utf-8')
        de_soup = BeautifulSoup(de_page_text, 'lxml')
        div_tag = de_soup.find('div', class_='chapter_content')
        content = div_tag.text.replace(" ", "")
        fp.write(title + '\n')
        fp.write(content + '\n\n')

第一回·宴桃园豪杰三结义斩黄巾英雄首立功
滚滚长江东逝水，浪花淘尽英雄。是非成败转头空。青山依旧在，几度夕阳红。白发渔樵江渚上，惯看秋月春风。一壶浊酒喜相逢。古今多少事，都付笑谈中。
——调寄《临江仙》
话说天下大势，分久必合，合久必分。

以下是用xpath简单案例（好像有ai检测）：

import requests
from lxml import etree
if __name__ ==  "__main__":
    url = "https://bj.58.com/ershoufang/"
    header = {
        'User-Agent': '*'
    }
    page_text = requests.get(url=url, headers=header).text
    tree = etree.HTML(page_text)
    li_List = tree.xpath(
        '//*[@id="esfMain"]/section[@class="list-body"]'
        '/section[3]/section[1]/section[2]/div')  # 定位到页面数据
    with open('./a4.txt', 'w', encoding='utf-8') as fp:
        for l in li_List:
            title = l.xpath('./a/div[2]/div[1]/div[1]/h3/text()')[0]
            price = l.xpath('./a/div[2]/div[2]/p[@class="property-price-average"]/text()')[0]
            fp.write(title + '\t')
            fp.write(price + '\n')

富华家园全南眼睛户型带车位交易仅 59174元/㎡

0 评论

App 内打开