为了更好的熟练掌握pandas在实际数据分析中的应用,今天我们再介绍一下怎么使用pandas做美国餐厅评分数据的分析。
数据的来源是UCI ML Repository,包含了一千多条数据,有5个属性,分别是:
userID: 用户ID
placeID:餐厅ID
rating:总体评分
food_rating:食物评分
service_rating:服务评分
我们使用pandas来读取数据:
import numpy as np
path = '../data/restaurant_rating_final.csv'
df = pd.read_csv(path)
df
userIDplaceIDratingfood_ratingservice_rating0U10771350852221U10771350382212U10771328252223U10771350601224U1068135104112..................1156U10431326301111157U10111327151101158U10681327331101159U10681325941111160U1068132660000
1161 rows × 5 columns
如果我们关注的是不同餐厅的总评分和食物评分,我们可以先看下这些餐厅评分的平均数,这里我们使用pivot_table方法:
mean_ratings = df.pivot_table(values=['rating','food_rating'], index='placeID',
aggfunc='mean')
mean_ratings[:5]
food_ratingratingplaceID1325601.000.501325611.000.751325641.251.251325721.001.001325831.001.00
然后再看一下各个placeID,投票人数的统计:
ratings_by_place = df.groupby('placeID').size()
ratings_by_place[:10]
placeID
132560 4
132561 4
132564 4
132572 15
132583 4
132584 6
132594 5
132608 6
132609 5
132613 6
dtype: int64
如果投票率太低,那么这些数字就不客观。让我们挑选一家投票率超过四人的餐厅:
[En]
If the turnout is too small, then these figures are not objective. Let's pick a restaurant with more than four voter turnout:
active_place = ratings_by_place.index[ratings_by_place >= 4]
active_place
Int64Index([132560, 132561, 132564, 132572, 132583, 132584, 132594, 132608,
132609, 132613,
...
135080, 135081, 135082, 135085, 135086, 135088, 135104, 135106,
135108, 135109],
dtype='int64', name='placeID', length=124)
选择这些餐厅的平均评分数据:
mean_ratings = mean_ratings.loc[active_place]
mean_ratings
food_ratingratingplaceID1325601.0000000.5000001325611.0000000.7500001325641.2500001.2500001325721.0000001.0000001325831.0000001.000000.........1350881.1666671.0000001351041.4285710.8571431351061.2000001.2000001351081.1818181.1818181351091.2500001.000000
124 rows × 2 columns
对rating进行排序,选择评分最高的10个:
top_ratings = mean_ratings.sort_values(by='rating', ascending=False)
top_ratings[:10]
food_ratingratingplaceID1329551.8000002.0000001350342.0000002.0000001349862.0000002.0000001329221.5000001.8333331327552.0000001.8000001350741.7500001.7500001350132.0000001.7500001349761.7500001.7500001350551.7142861.7142861350751.6923081.692308
我们还可以计算平均总评分和平均食物评分的差值,并以一栏diff进行保存:
mean_ratings['diff'] = mean_ratings['rating'] - mean_ratings['food_rating']
sorted_by_diff = mean_ratings.sort_values(by='diff')
sorted_by_diff[:10]
food_ratingratingdiffplaceID1326672.0000001.250000-0.7500001325941.2000000.600000-0.6000001328581.4000000.800000-0.6000001351041.4285710.857143-0.5714291325601.0000000.500000-0.5000001350271.3750000.875000-0.5000001327401.2500000.750000-0.5000001349921.5000001.000000-0.5000001327061.2500000.750000-0.5000001328701.0000000.600000-0.400000
将数据进行反转,选择差距最大的前10:
sorted_by_diff[::-1][:10]
food_ratingratingdiffplaceID1349870.5000001.0000000.5000001329371.0000001.5000000.5000001350661.0000001.5000000.5000001328511.0000001.4285710.4285711350490.6000001.0000000.4000001329221.5000001.8333330.3333331350301.3333331.5833330.2500001350631.0000001.2500000.2500001326261.0000001.2500000.2500001350001.0000001.2500000.250000
计算rating的标准差,并选择最大的前10个:
# Standard deviation of rating grouped by placeID
rating_std_by_place = df.groupby('placeID')['rating'].std()
# Filter down to active_titles
rating_std_by_place = rating_std_by_place.loc[active_place]
# Order Series by value in descending order
rating_std_by_place.sort_values(ascending=False)[:10]
placeID
134987 1.154701
135049 1.000000
134983 1.000000
135053 0.991031
135027 0.991031
132847 0.983192
132767 0.983192
132884 0.983192
135082 0.971825
132706 0.957427
Name: rating, dtype: float64
Original: https://blog.csdn.net/superfjj/article/details/123131344
Author: flydean程序那些事
Title: 数据分析实际案例之:pandas在餐厅评分数据中的使用

Tensorflow(二十八) —— 卷积神经网络(CNN)

Tensorflow 2.x入门教程

离线语音控制

python+tensorflow2.0实现简单人脸识别—–第一天:训练集的采集

一种中文作文自动评分方法及教辅系统的复现及步骤摘录

Opencv学习之:将图片的值进行范围调整 cv2.normalize()

OpenCV结合PYQT5实现简单的图像处理GUI

动态batch和静态batch的原理和代码详解

64位系统树莓派部署yolo-fatestv2—超多坑

conda安装指定版本TensorFlow

图像质量评估(5) — 畸变(Distortion)

AR人脸识别 Three.js + tensorflow.js(一)

Python语音识别实践【百度AI平台】

r包安装固定版本r包 安装某个版本r包 安装特定版本的R包
