Python 爬虫 – 新闻爬取
1. 科技快讯
网址:http://www.citreport.com/
以科技快报为目标
1. 首先, 获取整个页面信息
from bs4 import BeautifulSoup from urllib.request import urlopen import re import requests html = urlopen("http://www.citreport.com/" ).read () soup = BeautifulSoup(html, features='lxml' )
2. 查看网页源码
ctrl+shit+c 查看网页源码
3. 复制框内的信息, 匹配一下:
Paper = soup.find('div' ,{'class' :"right-item news-flash-box" })print (Paper.get_text())
就这么简单2333
4. 获取快讯文字链接
print(soup.find('div' ,{'class' :"right-item news-flash-box" }).get_text()) Links = Paper.find_all("a" , {"href" : re.compile ('http://..*?\.html' )})for link in Links: print(link['href' ])
成功获得快讯文字的链接
5. 以第一篇为模板进行文章抓取
url = Links[1 ]['href' ] html = urlopen(url).read() soup = BeautifulSoup(html, features='lxml' )
6. 继续查看码源, 获取标题和文字格式
发现,标题的标签为:
<br /><h1 class="ph"
文章主题标签为:
<br /><div class="d">
则,代码如下
P_title = soup.find('div' ,{"class" :"h hm cl" }).get_text() P_body = soup.find('td' ,{"id" :"article_content" }).get_text()
这就成功啦
7. 整合一下,并且, 循环抓取本小模块的所有paper
from bs4 import BeautifulSoup from urllib.request import urlopen import re import requests html = urlopen("http://www.citreport.com/" ).read () soup = BeautifulSoup(html, features='lxml' ) Paper = soup.find('div' ,{'class' :"right-item news-flash-box" }) Links = Paper.find_all("a" , {"href" : re.compile('http://..*?\.html' )}) Paper = Paper.get_text()for link in Links: url = link['href' ] html = urlopen(url).read () soup = BeautifulSoup(html, features='lxml' ) P_title = soup.find('div' ,{"class" :"h hm cl" }).get_text() P_body = soup.find('td' ,{"id" :"article_content" }).get_text() Paper += P_title + P_bodyprint (Paper)
2 NPR News Head Line
from bs4 import BeautifulSoup import requests from urllib.request import urlopen import time url= "https://www.npr.org/" html = urlopen(url).read ().decode('utf-8' ) soup = BeautifulSoup(html, 'lxml' ) story_today = soup.find_all('div' ,{"class" :"story-wrap" }) HeadLine ="" for i in story_today: HeadLine += i.get_text()
3. 爬取B站空间主页信息
参考: https://zhuanlan.zhihu.com/p/34716924
import astfrom urllib.request import urlopenimport time ID = "86328254" url = "http://api.bilibili.com/archive_stat/stat?aid=" + ID html = urlopen(url).read().decode('utf-8' ) d = ast.literal_eval(html) Cont = d['data' ] View = "观看量: " + str (Cont['view' ]) Like = "点赞: " + str (Cont['like' ]) Reply = "回复: " + str (Cont['reply' ]) Coin = "硬币: " + str (Cont['coin' ]) Result = "\n" .join([View, Like, Reply, Coin]) print(Result)