数据分析实际案例之：pandas在餐厅评分数据中的使用

2022-11-03人工智能78

为了更好的熟练掌握pandas在实际数据分析中的应用，今天我们再介绍一下怎么使用pandas做美国餐厅评分数据的分析。

数据的来源是UCI ML Repository，包含了一千多条数据，有5个属性，分别是：

userID：用户ID

placeID：餐厅ID

rating：总体评分

food_rating：食物评分

service_rating：服务评分

我们使用pandas来读取数据：

import numpy as np

path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df

userIDplaceIDratingfood_ratingservice_rating0U10771350852221U10771350382212U10771328252223U10771350601224U1068135104112..................1156U10431326301111157U10111327151101158U10681327331101159U10681325941111160U1068132660000

1161 rows × 5 columns

如果我们关注的是不同餐厅的总评分和食物评分，我们可以先看下这些餐厅评分的平均数，这里我们使用pivot_table方法：

mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
                                 aggfunc='mean')
mean_ratings[:5]

food_ratingratingplaceID1325601.000.501325611.000.751325641.251.251325721.001.001325831.001.00

然后再看一下各个placeID，投票人数的统计：

ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]

placeID
132560     4
132561     4
132564     4
132572    15
132583     4
132584     6
132594     5
132608     6
132609     5
132613     6
dtype: int64

如果投票率太低，那么这些数字就不客观。让我们挑选一家投票率超过四人的餐厅：

[En]

If the turnout is too small, then these figures are not objective. Let's pick a restaurant with more than four voter turnout:

active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place

Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
            132609, 132613,
            ...

            135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
            135108, 135109],
           dtype='int64', name='placeID', length=124)

选择这些餐厅的平均评分数据：

mean_ratings = mean_ratings.loc[active_place]
mean_ratings

food_ratingratingplaceID1325601.0000000.5000001325611.0000000.7500001325641.2500001.2500001325721.0000001.0000001325831.0000001.000000.........1350881.1666671.0000001351041.4285710.8571431351061.2000001.2000001351081.1818181.1818181351091.2500001.000000

124 rows × 2 columns

对rating进行排序，选择评分最高的10个：

top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]

food_ratingratingplaceID1329551.8000002.0000001350342.0000002.0000001349862.0000002.0000001329221.5000001.8333331327552.0000001.8000001350741.7500001.7500001350132.0000001.7500001349761.7500001.7500001350551.7142861.7142861350751.6923081.692308

我们还可以计算平均总评分和平均食物评分的差值，并以一栏diff进行保存：

mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']

sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]

food_ratingratingdiffplaceID1326672.0000001.250000-0.7500001325941.2000000.600000-0.6000001328581.4000000.800000-0.6000001351041.4285710.857143-0.5714291325601.0000000.500000-0.5000001350271.3750000.875000-0.5000001327401.2500000.750000-0.5000001349921.5000001.000000-0.5000001327061.2500000.750000-0.5000001328701.0000000.600000-0.400000

将数据进行反转，选择差距最大的前10：

sorted_by_diff[::-1][:10]

food_ratingratingdiffplaceID1349870.5000001.0000000.5000001329371.0000001.5000000.5000001350661.0000001.5000000.5000001328511.0000001.4285710.4285711350490.6000001.0000000.4000001329221.5000001.8333330.3333331350301.3333331.5833330.2500001350631.0000001.2500000.2500001326261.0000001.2500000.2500001350001.0000001.2500000.250000

计算rating的标准差，并选择最大的前10个：

# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]

placeID
134987    1.154701
135049    1.000000
134983    1.000000
135053    0.991031
135027    0.991031
132847    0.983192
132767    0.983192
132884    0.983192
135082    0.971825
132706    0.957427
Name: rating, dtype: float64

Original: https://blog.csdn.net/superfjj/article/details/123131344
Author: flydean程序那些事
Title: 数据分析实际案例之：pandas在餐厅评分数据中的使用

数据分析实际案例之：pandas在餐厅评分数据中的使用

猿创征文｜时间序列分析算法之平稳时间序列预测算法和自回归模型(AR)详解+Python代码实现

logistic回归模型—基于R

环境混合物总体效应：加权分位数和回归（WQS）

数学建模学习：岭回归和lasso回归

R 计算均方差MSE(mean squared error)

python数据相关性绘图-散点图正态分布图回归图等及鸢尾花数据集可视化（附Python代码）

基于Lasso回归的实证分析（Python实现代码）

目标检测中边框回归的直观理解 bbox regression

通过R语言实现平稳时间序列的建模–基础（ARMA模型）

【sklearn使用】sklearn中调用R2（回归问题评价指标）的3种方式

【项目实战】Python实现GBDT(梯度提升树)回归模型(GradientBoostingRegressor算法)项目实战

机器学习算法系列（四）- 岭回归算法（Ridge Regression Algorithm）

stata基础–回归，画散点图，异质性分析

机器学习之分类回归树（CART）

机器学习基础：用 Lasso 做特征选择

利用lasso回归建立预测模型并绘制列线图二分类结局资料的lasso回归与列线图绘制

计量经济学笔记6-Eviews操作-自相关的检验与消除（DW、LM检验与FGLS、广义差分变换）

Pytorch：全连接神经网络-MLP回归

机器学习实验——回归预测算法

基于MATLAB的随机森林（RF）回归与变量影响程度（重要性）排序

机器学习算法、Python、数据分析、学习资料 & 面试大汇总（免费送）

2024 年 4 月
一	二	三	四	五	六	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30