1. 概述
新闻是我们了解外界的重要渠道,以前,我们一般通过报纸和电视来获取新闻,那时候,获取新闻不仅有一定的成本,效率还不高。
[En]
News is an important channel for us to understand the outside world. in the past, we generally obtained news through newspapers and television. at that time, access to news not only had a certain cost, but also inefficient.
而如今,获取新闻的途径太多太方便了,大量重复的新闻充斥着各大平台,获取新闻已经没有什么成本,问题变成了过滤和鉴别新闻的可信程度。
[En]
Nowadays, there are too many and convenient ways to get news, a large number of repeated news are filled with various platforms, there is no cost to get news, and the problem has become to filter and identify the credibility of news.
下面用 【新浪新闻】 作为采集对象,抛砖引玉,演示下新闻从采集到分析的整个过程。
2. 采集流程
主要流程分为4个步骤:
2.1 采集
从新浪滚动新闻页面中,找出获取新闻的API,然后,并发的采集新闻。
这里为了简单起见,主要采集了新闻标题和摘要信息。
```python