数据分析是指运用适当的统计分析方法对收集到的大量原始数据进行分析,并对数据进行详细的研究和总结,以提取有用的信息并形成结论的过程。数据分析的目的是提取和分析不易推断的信息。一旦理解了信息,就可以研究产生数据的系统的运行机制,并可以预测系统可能的响应和演变。
[En]
Data analysis refers to the process of using appropriate statistical analysis methods to analyze a large number of original data collected, and to study and summarize the data in detail in order to extract useful information and form conclusions. The purpose of data analysis is to extract and analyze the information that is not easy to infer. Once the information is understood, the operation mechanism of the system that produces the data can be studied, and the possible response and evolution of the system can be predicted.
数据分析最初用于数据保护,但已发展成为一种数据建模方法。实际上,该模型是指将所研究的系统转化为数学形式,一旦建立了数学或逻辑模型,就可以以不同的精度预测系统的响应。模型的预测能力不仅取决于建模的质量,还取决于选择高质量数据集进行分析的能力。因此,数据采集、数据提取、数据准备等预处理工作也属于数据分析范畴,对最终结果有重要影响。
[En]
Data analysis was originally used as data protection, but has developed into a methodology of data modeling. In fact, the model refers to transforming the studied system into mathematical form, and once a mathematical or logical model is established, it can predict the response of the system with different precision. The prediction ability of the model depends not only on the quality of modeling, but also on the ability to select high-quality data sets for analysis. Therefore, preprocessing work such as data acquisition, data extraction and data preparation also belongs to the category of data analysis, which have an important impact on the final results.
在数据分析中,理解数据的最好方法是将其转化为可视化的图形,以传达数字中包含的信息(有时是隐藏的)。因此,数据分析可以看作是一种模型和图形展示。根据该模型,可以预测所研究系统的响应,并用已知输出结果的数据集对该模型进行了测试。这些数据不是用来生成模型的,而是用来检验系统是否能再现实际观测输出,从而掌握模型的误差,了解其有效性和局限性。然后,将新模型与原始模型进行比较,如果新模型获胜,就可以进行数据分析的最后一步--部署。部署阶段需要根据模型给出预测结果,实现相应的决策,同时防范模型预测的潜在风险。
[En]
In data analysis, the best way to understand the data is to turn it into a visual graph to convey the information contained in the numbers (sometimes hidden). Therefore, data analysis can be seen as a model and graphical display. According to the model, the response of the studied system can be predicted, and the model is tested with a dataset with known output results. These data are not used to generate the model, but to test whether the system can reproduce the actual observed output, so as to grasp the errors of the model and understand its validity and limitations. Then, compare the new model with the original model, and if the new model wins, you can carry out the last step of data analysis-deployment. The deployment phase needs to give the prediction results according to the model, realize the corresponding decision, and guard against the potential risks predicted by the model at the same time.
数据分析的过程可以用以下几个步骤来描述:原始数据的转换和处理、数据的可视化呈现、建模和预测,而每个步骤的作用对下面的步骤至关重要。因此,数据分析可以归结为几个阶段:问题定义、数据获取、数据预处理、数据探索、数据可视化、预测模型的创建和选择、模型评估和部署。
[En]
The process of data analysis can be described by the following steps: transforming and processing the original data, presenting the data visually, modeling and predicting, and the role of each step is critical to the following steps. Therefore, data analysis can be summarized into several stages: problem definition, data acquisition, data preprocessing, data exploration, data visualization, creation and selection of prediction model, model evaluation and deployment.
1. 问题定义
在数据分析之前,我们首先需要明确数据分析的目标,即本次数据分析中要研究的主要问题和预期的分析目标,这就是问题定义。
[En]
Before data analysis, we first need to clarify the objectives of data analysis, that is, the main problems to be studied in this data analysis and the expected analysis objectives, which is called problem definition.
2. 数据采集
在问题定义阶段之后,在分析数据之前要做的第一件事就是获取数据。数据选择必须以建立预测模型为目的,而数据选择对数据分析的成功起着至关重要的作用。采集的样本数据应尽可能反映实际情况,即能够描述系统对现实刺激的反应。如果选择了不合适的数据,或者分析了不能很好地代表系统的数据集,则结果模型将偏离作为研究对象的系统。
[En]
After the problem definition phase, the first thing to do before analyzing the data is to obtain the data. Data selection must be for the purpose of creating a prediction model, and data selection plays a vital role in the success of data analysis. The sample data collected should reflect the actual situation as much as possible, that is, it can describe the response of the system to realistic stimuli. If the inappropriate data is selected, or the data set that does not well represent the system is analyzed, the resulting model will deviate from the system as the object of study.
数据的获取方式有以下几种:
① 利用SQL语句直接从企业管理数据库中调取相关业务数据。例如,提取2017年度所有的销售数据和销量排名前20位的商品数据,提取华东、华南、西部地区用户的消费数据等。
② 到特定的网站上去下载一些科研机构、企业、政府开放的公开数据集。这些数据集通常比较完善,质量相对较高。当然这种方式也有一些缺陷,就是通常这些数据的发布比较滞后,但因其具有较高的客观性和权威性,故依然具有很大的价值。
③ 编写网页爬虫,去收集互联网上的数据。例如,可以通过爬虫获取淘宝网上商品的销售和评价信息、租房网站上某城市的租房信息、豆瓣网上电影和电影评分的列表信息、网易云音乐评论排行列表信息等。基于互联网爬取的数据,可以针对某个行业、某一类人群进行分析,是一种非常精准的市场调研和竞品分析的方式。
3. 数据预处理
通过数据采集获得的数据大部分是不完整、不一致的"脏数据",无法直接进行数据分析,若直接用会使分析结果差强人意。数据预处理就是使数据采集阶段中获得的原始数据,经过数据清洗和数据转换后,转变为"干净"的数据。使用这些"干净"的数据,才能获得更加精确的分析结果。
4. 数据探索和数据可视化
数据探索的本质是从图形或统计中搜索数据,以发现数据中的模式、联系和关系。数据可视化是获取信息的最佳方式之一。通过直观地呈现数据,不仅可以快速掌握关键信息,还可以揭示简单统计无法观察到的模式和结论。
[En]
The essence of data exploration is to search for data from graphics or statistics to discover patterns, connections and relationships in the data. Data visualization is one of the best ways to obtain information. By presenting the data visually, we can not only quickly grasp the key information, but also reveal the patterns and conclusions that can not be observed by simple statistics.
数据探索包括初步的数据测试,确定数据类型,即类别数据或数值数据,并选择最合适的数据分析方法来定义模型。
[En]
Data exploration includes preliminary data testing, determining the data type, that is, category data or numerical data, and selecting the most suitable data analysis method to define the model.
一般而言,在此阶段,除了对通过数据可视化获得的图表进行详细研究之外,还可能包括以下一项或多项活动。
[En]
In general, at this stage, in addition to a detailed study of the charts obtained by data visualization, one or more of the following activities may be included.
■ 总结数据。
■ 为数据分组。
■ 探索不同属性之间的关系。
■ 识别模式和趋势。
■ 建立回归模型。
■ 建立分类模型。
5. 预测模型的创建和选择
预测模型是指用于预测并用数学语言或公式描述的事物之间的数量关系。它在一定程度上揭示了事物之间的内在规律性,并以此作为计算预测值的直接依据。在数据分析的预测模型建立和选择阶段,需要建立或选择合适的统计模型来预测某一结果的概率。
[En]
Prediction model refers to the quantitative relationship between things that are used for prediction and described by mathematical language or formula. To a certain extent, it reveals the inherent regularity between things, and takes it as the direct basis for calculating the predicted value. In the prediction model creation and selection stage of data analysis, it is necessary to create or select an appropriate statistical model to predict the probability of a certain result.
具体来说,模型主要有以下两个方面的用途。
① 使用回归模型来预测系统所产生数据的值。
② 使用分类模型或聚类模型为新数据分类。
事实上,根据输出结果的类型,模型可分为以下3种。
① 分类模型:模型输出结果为类别型数据。
② 回归模型:模型输出结果为数值型数据。
③ 聚类模型:模型输出结果为描述型数据。
6. 模型评估
模型评估阶段是测试阶段,提取整个数据分析的原始数据集的一部分作为验证集,并使用验证集来评估使用先前收集的数据创建的模型的有效性。
[En]
The model evaluation phase is the testing phase, in which part of the original data set of the whole data analysis is extracted as the verification set, and the validation set is used to evaluate the effectiveness of the model created using the previously collected data.
一般来说,用于建模的数据称为训练集,用于验证模型的数据称为验证集。
[En]
Generally speaking, the data used for modeling is called the training set, and the data used to validate the model is called the validation set.
通过将模型的输出结果与实际系统的输出结果进行比较,可以评估误码率。使用不同的测试集,可以得到模型的有效区间。事实上,预测结果只在一定范围内有效,或随预测值的范围而变化,预测值与有效值之间存在不同程度的对应关系。
[En]
By comparing the output results of the model and the actual system, the error rate can be evaluated. Using different test sets, the validity interval of the model can be obtained. In fact, the prediction results are only valid in a certain range, or vary with the range of the predicted value, and there are different levels of corresponding relationship between the predicted value and the effective value.
在模型评价的过程中,不仅可以得到模型的准确有效性,还可以将其与其他模型进行比较。模型评估的技术有很多,其中最著名的是交叉验证。它的基本操作是将训练集分成不同的部分,每个部分依次作为验证集,其余部分作为训练集。通过这种迭代方法,可以得到最优模型。
[En]
In the process of model evaluation, we can not only get the exact effectiveness of the model, but also compare it with other models. There are many techniques for model evaluation, the most famous of which is cross-validation. Its basic operation is to divide the training set into different parts, each part in turn as a verification set, while the rest is used as a training set. Through this iterative approach, the best model can be obtained.
7. 部署
数据分析的最后一步是部署,其目的是展示结果,即给出数据分析的结论。如果应用场景在业务域中,部署流程会将分析结果转换为对购买数据分析服务的客户有利的解决方案。如果应用场景在科技领域,成果将转化为设计或科技出版物。换言之,部署过程基本上是将数据分析的结果应用于实践。
[En]
The final step in data analysis is deployment, which aims to show the results, that is, to give the conclusions of the data analysis. If the application scenario is in the business domain, the deployment process converts the analysis results into a solution that is beneficial to customers who purchase data analysis services. If the application scenario is in the field of science and technology, the results will be transformed into designs or scientific and technological publications. In other words, the deployment process is basically applying the results of data analysis to practice.
数据分析的结果有多种部署场景,这一阶段通常被称为数据报告编写。数据报告的撰写应详细说明以下几点。
[En]
There are a variety of deployment scenarios for the results of data analysis, and this stage is often referred to as data report writing. The writing of the data report should describe the following points in detail.
■ 分析结果。
■ 决策部署。
■ 风险分析。
■ 商业影响评估。
目前,无论是互联网企业还是传统企业,都需要数据分析。如果企业需要做出商业决策或推出一些新产品,他们需要使用数据分析来整合和总结一些杂乱无章的数据来确定具体方向。事实上,在企业的业务分析中,数据分析有三大功能。
[En]
At present, both Internet enterprises and traditional enterprises need data analysis. If enterprises need to make business decisions or launch some new products, they need to use data analysis to integrate and summarize some messy data to determine the specific direction. In fact, in the business analysis of enterprises, data analysis has three major functions.
① 现状分析。所谓现状有两层含义,一层含义是指已经发生的事情,另一层含义是指现在所发生的事情。通过对企业的基础周报或月报进行分析,可了解企业的整体运营情况,发现企业经营中的问题,了解企业的现状。
② 原因分析。如果通过现状分析,了解到企业存在着某种隐患后,就需要分析该隐患。了解该隐患存在的原因和它是如何产生的。
③ 预测分析。在分析了现状,也分析了原因后,就需要进行预测分析。通过现在所掌握的数据,来预测未来的发展趋势等。
其实,这3种作用就是分析过去企业整体运营情况,分析现在所存在的隐患,以及预测未来企业的发展趋势。
Original: https://blog.csdn.net/xuefu_78/article/details/123086935
Author: 优雅的心情
Title: 数据分析理论

神经网络结构搜索 NAS

python——实现鼠标与键盘监听与事件处理

transformer综述汇总与变形分析(持续更新)

Ubuntu18.04跑通ORB_SLAM3(实时USB单目摄像头&本地视频.mp4&官方数据集)

卷积神经网络(CNN)——快速导读

tfrecord原理详解 手把手教生成tfrecord文件与解析tfrecord文件

随手记录录录录

【架构分析】Tensorflow Internals 源码分析4 – TF Core之MatMul Kernel 生命周期

window10环境下tensorflow-gpu-2.7.0安装

论文解读:跨模态/多光谱/多模态检测 Cross-Modality Fusion Transformer for Multispectral Object Detection

【OpenCV】 级联分类器训练模型

KT404A/C系列语音芯片参考程序&硬件设计注意事项

基于k-means聚类算法的城市出租车GPS时空分布特征分析

安装elevation_mapping与traversability_estimation
