爬虫学习
https://www.acwing.com/blog/content/41265/ (佬的学习笔记)
关于requests
步骤:1.指定url,2.发起请求,3.获取响应数据, 4.持久化存储
以下针对sougou查询爬虫简单样例(get):
import requests
if __name__ == "__main__":
url = "https://www.sogou.com/web"
header = {
'User-Agent': '*'
}
keyword = input("please input a word: ")
param = {
'query': keyword
}
res = requests.get(url=url, params=param, headers=header).text
print(res)
以下是针对百度翻译的简单样例(post):
import requests
if __name__ == "__main__":
url = "https://fanyi.baidu.com/sug"
header = {
'User-Agent': '*'
}
keyword = input("please input a word: ")
data = {
'kw': keyword // 我这里是love关键词
}
res = requests.post(url=url, data=data, headers=header).json()
with open('./a1.json', 'w', encoding='utf-8') as fp:
for ex in res['data']:
s1 = ex['k']
s2 = ex['v']
s3 = "关键词:" + s1 + ":" + s2 + "\n"
fp.write(s3)
jsons数据:
关键词:love:vt.& vi. 喜欢; 爱,热爱; 爱戴; 赞美,称赞 vt. 喜欢; 喜爱; 喜好; 爱慕 n.
关键词:Love:[人名] [英格兰人姓氏] 洛夫来源于中世纪英语教名及古英语女子名Lufu,或相应的男子名Lufa,
关键词:LOVE:abbr. League of Victims and Emphathi-zers (pro-cap
关键词:loved:v. 喜欢; 喜爱; 爱,热爱( love的过去式和过去分词 ); 喜好
关键词:Loved:[电影]爱的诺言
以下是豆瓣影片简单样例:
import requests
if __name__ == "__main__":
url = "https://movie.douban.com/j/chart/top_list"
header = {
'User-Agent': '*'
}
param = {
'type': '24',
'interval_id': '100:90',
'action': '',
'start': '0',
'limit': '20'
}
res = requests.get(url=url, params=param, headers=header).json()
with open('./a2.json', 'w', encoding='utf-8') as fp:
for r in res:
s1 = r['title']
s2 = r['url']
s3 = r['release_date']
s4 = r['score']
s5 = r['is_watched']
s6 = str(r['rank'])
s7 = ""
s8 = r['cover_url']
s9 = str(r['actors']).strip("[")
s9 = s9.strip("]")
s10 = str(r['types']).strip("[")
s10 = s10.strip("]")
s11 = str(r['regions']).strip("[")
s11 = s11.strip("]")
s12 = str(r['vote_count'])
if s5 == True:
s7 = "是"
else: s7 = "否"
fp.write("影片:" + s1 + "\n")
fp.write("观看地址:" + s2 + "\n")
fp.write("上映时间:" + s3 + "\n")
fp.write("类型:" + s10 + "\n")
fp.write("地域:" + s11 + "\n")
fp.write("评分:" + s4 + "\n")
fp.write("排行:" + s6 + "\n")
fp.write("图片地址:" + s8 + "\n")
fp.write("参演人物:" + s9 + "\n")
fp.write("当前有" + s12 + "条评价\n")
fp.write("当前是否观看:" + s7 + "\n")
fp.write("\n\n\n")
jsons数据:
影片:美丽人生
观看地址:https://movie.douban.com/subject/1292063/
上映时间:2020-01-03
类型:’剧情’, ‘喜剧’, ‘爱情’, ‘战争’
地域:’意大利’
评分:9.5
排行:1
图片地址:https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2578474613.jpg
参演人物:’罗伯托·贝尼尼’, ‘尼可莱塔·布拉斯基’, ‘乔治·坎塔里尼’, ‘朱斯蒂诺·杜拉诺’, ‘赛尔乔·比尼·布斯特里克’, ‘玛丽萨·帕雷德斯’, ‘霍斯特·布赫霍尔茨’, ‘利迪娅·阿方西’, ‘朱利亚娜·洛约迪切’, ‘亚美利哥·丰塔尼’, ‘彼得·德·席尔瓦’, ‘弗朗西斯·古佐’, ‘拉法埃拉·莱博罗尼’, ‘克劳迪奥·阿方西’, ‘吉尔·巴罗尼’, ‘马西莫·比安奇’, ‘恩尼奥·孔萨尔维’, ‘吉安卡尔洛·科森蒂诺’, ‘阿伦·克雷格’, ‘汉尼斯·赫尔曼’, ‘弗兰科·梅斯科利尼’, ‘安东尼奥·普雷斯特’, ‘吉娜·诺维勒’, ‘理查德·塞梅尔’, ‘安德烈提多娜’, ‘迪尔克·范登贝格’, ‘奥梅罗·安东努蒂’, ‘沈晓谦’, ‘张欣’
当前有1366433条评价
当前是否观看:否
以下是kfc餐厅简单样例:
import json
import requests
if __name__ == "__main__":
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
header = {
'User-Agent': '*'
}
id = 1
with open('./a3.json', 'w', encoding='utf-8') as fp:
for p in range(1, 29):
data = {
'cname': '',
'pid': '',
'keyword': '北京',
'pageIndex': p, //这里我偷个懒
'pageSize': '10'
}
res = requests.post(url=url, data=data, headers=header).text
list = json.loads(res)
for l in list['Table1']:
l1 = l['addressDetail']
l2 = l['cityName']
l3 = str(l['pro'])
l4 = l['storeName']
if l3 == "24小时":
l3 = "全天营业"
elif l3 == "None":
l3 = "未知营业时间"
fp.write("id: " + str(id) + "\n")
fp.write("餐厅名称:" + l4 + '\n')
fp.write("餐厅地址:" + l1 + "\n")
fp.write("餐厅城市:" + l2 + "\n")
fp.write("点餐方式:" + l3 + "\n")
fp.write("\n\n\n")
id = id + 1
jsons数据:
id: 1
餐厅名称:京通新城
餐厅地址:朝阳路杨闸环岛西北京通苑30号楼一层南侧
餐厅城市:北京市
点餐方式:24小时,Wi-Fi,点唱机,店内参观,礼品卡
视频中部分内容失效,无法进行调试(化妆品网站没找到,反爬虫机制已出现,方法不适用)
数据解析分类
正则
bs4
xpath
bs4进行html解析简单样例
from bs4 import BeautifulSoup
headers = {
"User-Agent": "*"
}
for s in range(0, 250, 25):
resp = requests.get(f"https://movie.douban.com/top250?start={s}", headers=headers)
html = resp.text
soup = BeautifulSoup(html, "html.parser")
title = soup.find_all("span", attrs={"class": "rating_num"})
for t in title:
t1 = t.string
if "/" not in t1:
print(t1)
以下是xpath, lxml爬取某站好看仙女姐姐图片😍😍
import requests
from lxml import etree
from urllib import request
if __name__ == "__main__":
url = "https://www.huya.com/g/4079"
header = {
'User-Agent': '*'
}
res = requests.get(url=url, headers=header).text
data = etree.HTML(res) # 数据筛选
ans = data.xpath('//img[@class="pic"]')
for i in ans:
newUrl = i.xpath('./@data-original')[0]
newName = i.xpath('./@alt')[0]
request.urlretrieve(newUrl, r'D:\py3\demo\test_demo\img\\'+newName + '.jpg')
print("打印" + newName)
其他的图片就不放了,怕分享记录没了😭