[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

人工智能107

这本书类似于工具书或者字典,对于python具体代码的调用和使用场景写的很清楚,感觉虽然是工具书,但是对照着做一遍应该可以对机器学习中python常用的这些库有更深入的理解,在应用中也能更为熟练。

以下是根据书上的代码进行实操,注释基本写明了每句代码的作用(写在本句代码之前)和print的输出结果(写在print之后)。不一定严格按照书上内容进行,根据代码运行时具体情况稍作顺序调整,也加入了一些自己的理解。

如果你把它复制到你自己的环境中,再次运行输出,我相信你的理解会更深入、更清晰。

[En]

If you copy it to your own environment and run the output again, I believe your understanding will be deeper and clearer.

博客中的每个代码块代表一个完整的运行结果,可以直接复制和运行。

[En]

Each code block in the blog represents a complete run result, which can be copied and run directly.

本节主要是pandas库在数据处理时的基本应用。
在加载数据之后,要对数据进行处理,pandas库提供的dataframe数据帧格式,可以很方便地进行数据整理的操作。
包括:

03-1 数据帧浏览与筛选

import pandas as pd

# 创建数据帧
dataframe = pd.DataFrame()  # 表
dataframe['Name'] = ['A', 'B']  # 类似字典,表头作为键值
dataframe['Age'] = [38, 25]
dataframe['Driver'] = [True, False]
print(dataframe)
#   Name  Age  Driver
# 0    A   38    True
# 1    B   25   False

# 在底部添加新的数据行
new_person = pd.Series(['C', 40, True], index = ['Name', 'Age', 'Driver'])
dataframe = dataframe.append(new_person, ignore_index = True)
print(dataframe)
#   Name  Age  Driver
# 0    A   38    True
# 1    B   25   False
# 2    C   40    True

# 查看前两行数据
print(dataframe.head(2))
#   Name  Age  Driver
# 0    A   38    True
# 1    B   25   False

# 查看维数
print(dataframe.shape)
# (3, 3)

# 查看描述性统计量(数值型数据)
print(dataframe.describe())
#             Age
# count   3.000000
# mean   34.333333
# std     8.144528
# min    25.000000
# 25%    31.500000
# 50%    38.000000
# 75%    39.000000
# max    40.000000

# 选择第一行
print(dataframe.iloc[0])
# Name         A
# Age         38
# Driver    True
# Name: 0, dtype: object

# 选择某几行(类似数组的切片,按照行号index查找)
print(dataframe.iloc[1:3])
#   Name  Age  Driver
# 1    B   25   False
# 2    C   40    True

# 根据条件语句筛选行
print(dataframe[dataframe['Driver'] == True])
#   Name  Age  Driver
# 0    A   38    True
# 2    C   40    True

# 多个条件语句筛选
print(dataframe[(dataframe['Driver'] == True) & (dataframe['Age'] < 40)])
#   Name  Age  Driver
# 0    A   38    True

# &#x8BBE;&#x7F6E;&#x6570;&#x636E;&#x5E27;&#x7684;&#x7D22;&#x5F15;&#xFF08;&#x4E0E;index&#x533A;&#x522B;&#xFF1A;&#x5C06;&#x53EA;&#x6709;&#x552F;&#x4E00;&#x503C;&#x7684;&#x5217;&#x4F5C;&#x4E3A;&#x7D22;&#x5F15;&#xFF09;
print(dataframe) # &#x6CE8;&#x610F;&#x89C2;&#x5BDF;&#x8BBE;&#x7F6E;&#x7D22;&#x5F15;&#x524D;&#x540E;&#x6570;&#x636E;&#x5E27;&#x7684;&#x53D8;&#x5316;
#   Name  Age  Driver
# 0    A   38    True
# 1    B   25   False
# 2    C   40    True
dataframe = dataframe.set_index(dataframe['Name'])
print(dataframe) # &#x6CE8;&#x610F;&#x89C2;&#x5BDF;&#x8BBE;&#x7F6E;&#x7D22;&#x5F15;&#x524D;&#x540E;&#x6570;&#x636E;&#x5E27;&#x7684;&#x53D8;&#x5316;
#      Name  Age  Driver
# Name
# A       A   38    True
# B       B   25   False
# C       C   40    True

# &#x6839;&#x636E;&#x7D22;&#x5F15;&#x67E5;&#x770B;&#x884C;
print(dataframe.loc['A'])
# Name         A
# Age         38
# Driver    True
# Name: A, dtype: object
  • loc:根据index来索引。
  • iloc:根据行号来索引,行号从0开始,逐次加1。

03-2 数据帧修改与计算

它包括替换、重命名、计算描述性统计、查找唯一值、查找缺失的值、删除等。

[En]

It includes replacing, renaming, calculating descriptive statistics, finding unique values, finding missing values, deleting and so on.

import pandas as pd

# ---&#x521B;&#x5EFA;&#x6570;&#x636E;&#x5E27;
dataframe = pd.DataFrame()  # &#x8868;
dataframe['Name'] = ['A', 'B']  # &#x7C7B;&#x4F3C;&#x5B57;&#x5178;&#xFF0C;&#x8868;&#x5934;&#x4F5C;&#x4E3A;&#x952E;&#x503C;
dataframe['Age'] = [38, 25]
dataframe['Sex'] = ['Woman', 'Man']
dataframe['Code'] = ['one', 0]
print(dataframe)
#   Name  Age    Sex Code
# 0    A   38  Woman  one
# 1    B   25    Man    0

# ---&#x66FF;&#x6362;
# &#x66FF;&#x6362;&#x67D0;&#x5217;&#x7684;&#x67D0;&#x4E2A;&#x503C;
dataframe['Sex'] = dataframe['Sex'].replace('Woman', 'female')
print(dataframe['Sex'])
# 0    female
# 1       Man
# Name: Sex, dtype: object

# &#x66FF;&#x6362;&#x67D0;&#x5217;&#x7684;&#x591A;&#x4E2A;&#x503C;
dataframe['Sex'] = dataframe['Sex'].replace(['Woman', 'Man'], ['female', 'male'])
print(dataframe['Sex'])
# 0    female
# 1      male
# Name: Sex, dtype: object

# &#x5728;&#x6574;&#x4E2A;&#x6570;&#x636E;&#x5E27;&#x4E2D;&#x8FDB;&#x884C;&#x66FF;&#x6362;
dataframe = dataframe.replace('one', 1)
print(dataframe)
#   Name  Age    Sex  Code
# 0    A   38  Woman     1
# 1    B   25    Man     0

# ---&#x91CD;&#x547D;&#x540D;
# &#x91CD;&#x547D;&#x540D;&#x5217;&#xFF0C;&#x53C2;&#x6570;&#x662F;&#x5B57;&#x5178;&#xFF0C;&#x53EF;&#x4EE5;&#x540C;&#x65F6;&#x91CD;&#x547D;&#x540D;&#x591A;&#x4E2A;&#x5217;
dataframe = dataframe.rename(columns = {'Code': 'Num'})
print(dataframe)
#   Name  Age     Sex  Num
# 0    A   38  female    1
# 1    B   25    male    0

# &#x540C;&#x65F6;&#x4E3A;&#x6240;&#x6709;&#x7684;&#x5217;&#x91CD;&#x547D;&#x540D;&#xFF0C;&#x521B;&#x5EFA;&#x4E00;&#x4E2A;&#x5217;&#x540D;&#x7684;&#x5B57;&#x5178;
import collections
column_names = collections.defaultdict(str)
for name in dataframe.columns:
    column_names[name]
print(column_names)
# defaultdict(<class 'str'>, {'Name': '', 'Age': '', 'Sex': '', 'Num': ''})
column_names['Name'] = '&#x59D3;&#x540D;'
column_names['Age'] = '&#x5E74;&#x9F84;'
column_names['Sex'] = '&#x6027;&#x522B;'
column_names['Num'] = '&#x4EE3;&#x7801;'
dataframe = dataframe.rename(columns = column_names)
print(dataframe)
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;  &#x4EE3;&#x7801;
# 0  A  38  female   1
# 1  B  25    male   0

# ---&#x63CF;&#x8FF0;&#x6027;&#x7EDF;&#x8BA1;&#x91CF;
# &#x8BA1;&#x7B97;&#x5E38;&#x89C1;&#x7684;&#x63CF;&#x8FF0;&#x6027;&#x7EDF;&#x8BA1;&#x91CF;
print(dataframe.describe())
#               &#x5E74;&#x9F84;        &#x4EE3;&#x7801;
# count   2.000000  2.000000
# mean   31.500000  0.500000
# std     9.192388  0.707107
# min    25.000000  0.000000
# 25%    28.250000  0.250000
# 50%    31.500000  0.500000
# 75%    34.750000  0.750000
# max    38.000000  1.000000

# &#x5206;&#x522B;&#x8BA1;&#x7B97;&#x6700;&#x5927;&#x503C;&#x3001;&#x6700;&#x5C0F;&#x503C;&#x3001;&#x603B;&#x548C;&#x3001;&#x5E73;&#x5747;&#x503C;&#x3001;&#x8BA1;&#x6570;&#x503C;
print('MaxNum: ', dataframe['&#x5E74;&#x9F84;'].max())
print('MinNum: ', dataframe['&#x5E74;&#x9F84;'].min())
print('Mean: ', dataframe['&#x5E74;&#x9F84;'].mean())
print('Sum: ', dataframe['&#x5E74;&#x9F84;'].sum())
print('Count: ', dataframe['&#x5E74;&#x9F84;'].count())
# MaxNum:  38
# MinNum:  25
# Mean:  31.5
# Sum:  63
# Count:  2

# &#x4E5F;&#x53EF;&#x4EE5;&#x5BF9;&#x6574;&#x4E2A;&#x6570;&#x636E;&#x5E27;&#x5E94;&#x7528;&#x8FD9;&#x4E9B;&#x65B9;&#x6CD5;
print(dataframe.count())
# &#x59D3;&#x540D;    2
# &#x5E74;&#x9F84;    2
# &#x6027;&#x522B;    2
# &#x4EE3;&#x7801;    2
# dtype: int64

# &#x8FD8;&#x6709;&#x5176;&#x4ED6;&#x7684;&#x4E00;&#x4E9B;&#x63CF;&#x8FF0;&#x6027;&#x7EDF;&#x8BA1;&#x91CF;&#x8BA1;&#x7B97;&#x51FD;&#x6570;
print(dataframe.var()) # &#x65B9;&#x5DEE;
# &#x5E74;&#x9F84;    84.5
# &#x4EE3;&#x7801;     0.5
# dtype: float64
print(dataframe.std()) # &#x6807;&#x51C6;&#x5DEE;
# &#x5E74;&#x9F84;    9.192388
# &#x4EE3;&#x7801;    0.707107
# dtype: float64
print(dataframe.sem()) # &#x5E73;&#x5747;&#x503C;&#x6807;&#x51C6;&#x8BEF;&#x5DEE;
# &#x5E74;&#x9F84;    6.5
# &#x4EE3;&#x7801;    0.5
# dtype: float64
print(dataframe.median()) # &#x4E2D;&#x4F4D;&#x6570;
# &#x5E74;&#x9F84;    31.5
# &#x4EE3;&#x7801;     0.5
# dtype: float64
print(dataframe.mode()) # &#x4F17;&#x6570;
# print(dataframe.kurt()) # &#x5CF0;&#x6001;
# print(dataframe.skew()) # &#x504F;&#x6001;

# ---&#x552F;&#x4E00;&#x503C;
# &#x4F7F;&#x7528;unique&#x6765;&#x67E5;&#x770B;&#x7531;&#x67D0;&#x4E00;&#x5217;&#x4E2D;&#x5168;&#x90E8;&#x7684;&#x552F;&#x4E00;&#x503C;&#x7EC4;&#x6210;&#x7684;&#x6570;&#x7EC4;
# &#x6DFB;&#x52A0;&#x4E00;&#x884C;
new_person = pd.Series(['C', 40, 'male', 1], index = ['&#x59D3;&#x540D;', '&#x5E74;&#x9F84;', '&#x6027;&#x522B;', '&#x4EE3;&#x7801;'])
dataframe = dataframe.append(new_person, ignore_index = True)
print(dataframe)
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;  &#x4EE3;&#x7801;
# 0  A  38  female   1
# 1  B  25    male   0
# 2  C  40    male   1

# &#x7B5B;&#x9009;&#x51FA;&#x552F;&#x4E00;&#x503C;
print(dataframe['&#x6027;&#x522B;'].unique())
# ['female' 'male']

# &#x663E;&#x793A;&#x552F;&#x4E00;&#x503C;&#x5E76;&#x8BA1;&#x6570;
print(dataframe['&#x6027;&#x522B;'].value_counts())
# male      2
# female    1
# Name: &#x6027;&#x522B;, dtype: int64

# &#x67E5;&#x770B;&#x552F;&#x4E00;&#x503C;&#x7684;&#x4E2A;&#x6570;
print(dataframe['&#x6027;&#x522B;'].nunique())
# 2

# ---&#x7F3A;&#x5931;&#x503C;
# &#x67E5;&#x627E;&#x7F3A;&#x5931;&#x503C;
# &#x6DFB;&#x52A0;&#x4E00;&#x884C;
new_person = pd.Series(['D', 20, 'female'], index = ['&#x59D3;&#x540D;', '&#x5E74;&#x9F84;', '&#x6027;&#x522B;'])
dataframe = dataframe.append(new_person, ignore_index = True)
print(dataframe)
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  38  female  1.0
# 1  B  25    male  0.0
# 2  C  40    male  1.0
# 3  D  20  female  NaN

# &#x68C0;&#x67E5;&#x51FA;&#x7F3A;&#x5931;&#x503C;
print(dataframe.isnull())
#       &#x59D3;&#x540D;     &#x5E74;&#x9F84;     &#x6027;&#x522B;     &#x4EE3;&#x7801;
# 0  False  False  False  False
# 1  False  False  False  False
# 2  False  False  False  False
# 3  False  False  False   True

# &#x66FF;&#x6362;&#x7F3A;&#x5931;&#x503C;NaN&#xFF0C;&#x6B64;&#x65B9;&#x6CD5;&#x540C;&#x6837;&#x53EF;&#x4EE5;&#x7528;&#x6765;&#x66FF;&#x6362;-999\null\''2022-09-16 15:47:50 &#x661F;&#x671F;&#x4E94;&#x7B49;&#x7F3A;&#x5931;&#x503C;
import numpy as np
dataframe = dataframe.replace(np.nan, 0)
print(dataframe)
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  38  female  1.0
# 1  B  25    male  0.0
# 2  C  40    male  1.0
# 3  D  20  female  0.0

# ---&#x5220;&#x9664;
# &#x4E3A;&#x4E86;&#x65B9;&#x4FBF;&#x591A;&#x6B21;&#x6D4B;&#x8BD5;&#xFF0C;&#x5E76;&#x6CA1;&#x6709;&#x4F20;&#x56DE;&#x5220;&#x9664;&#x540E;&#x7684;&#x7ED3;&#x679C;&#x3002; &#x4F20;&#x56DE;&#xFF1A;dataframe = dataframe.drop()
# &#x5220;&#x9664;&#x4E00;&#x5217;
print(dataframe.drop('&#x4EE3;&#x7801;', axis = 1))  # axis&#x8868;&#x793A;&#x7EF4;&#x5EA6;&#xFF0C;axis = 1 &#x6307;&#x5217;&#xFF0C;axis = 0 &#x6307;&#x884C;
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;
# 0  A  38  female
# 1  B  25    male
# 2  C  40    male
# 3  D  20  female

# &#x5220;&#x9664;&#x591A;&#x5217;
print(dataframe.drop(['&#x5E74;&#x9F84;', '&#x6027;&#x522B;'], axis = 1))
#   &#x59D3;&#x540D;   &#x4EE3;&#x7801;
# 0  A  1.0
# 1  B  0.0
# 2  C  1.0
# 3  D  0.0

# &#x6309;&#x5217;&#x4E0B;&#x6807;&#x5220;&#x9664;&#x67D0;&#x5217;&#xFF08;&#x67D0;&#x5217;&#x53EF;&#x80FD;&#x6CA1;&#x6709;&#x540D;&#x5B57;&#xFF09;
print(dataframe.drop(dataframe.columns[1], axis = 1))
#   &#x59D3;&#x540D;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  female  1.0
# 1  B    male  0.0
# 2  C    male  1.0
# 3  D  female  0.0

# &#x6309;&#x6761;&#x4EF6;&#x5220;&#x9664;&#x67D0;&#x884C;&#xFF0C;&#x7C7B;&#x4F3C;&#x6839;&#x636E;&#x6761;&#x4EF6;&#x8BED;&#x53E5;&#x7B5B;&#x9009;&#x884C;
print(dataframe[dataframe['&#x6027;&#x522B;'] != 'male'])
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  38  female  1.0
# 3  D  20  female  0.0

# &#x6309;&#x6761;&#x4EF6;&#x5220;&#x9664;&#x67D0;&#x884C;&#xFF0C;&#x67E5;&#x627E;&#x4E0B;&#x6807;&#xFF0C;&#x5220;&#x9664;&#x7B2C;&#x4E00;&#x884C;
print(dataframe[dataframe.index != 0])
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 1  B  25    male  0.0
# 2  C  40    male  1.0
# 3  D  20  female  0.0

# &#x5220;&#x9664;&#x91CD;&#x590D;&#x7684;&#x884C;
print(dataframe.drop_duplicates())  # &#x53EA;&#x4F1A;&#x5220;&#x9664;&#x4E00;&#x6A21;&#x4E00;&#x6837;&#x7684;&#x4E24;&#x884C;
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  38  female  1.0
# 1  B  25    male  0.0
# 2  C  40    male  1.0
# 3  D  20  female  0.0
print(dataframe.drop_duplicates(subset = ['&#x4EE3;&#x7801;']))  # &#x9ED8;&#x8BA4;&#x6309;&#x7167;&#x884C;&#x5E8F;&#xFF0C;&#x5220;&#x9664;'&#x4EE3;&#x7801;'&#x8FD9;&#x5217;&#x91CD;&#x590D;&#x7684;&#x884C;&#xFF0C;&#x591A;&#x4E2A;&#x884C;&#x65F6;&#x52A0;&#x5165;&#x884C;&#x540D;&#x5230;&#x53C2;&#x6570;&#x5217;&#x8868;&#x5373;&#x53EF;
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 0  A  38  female  1.0
# 1  B  25    male  0.0
print(dataframe.drop_duplicates(subset = ['&#x4EE3;&#x7801;'], keep = 'last'))  # &#x4FDD;&#x7559;&#x6700;&#x672B;&#x7684;&#x884C;,&#x5220;&#x9664;'&#x4EE3;&#x7801;'&#x8FD9;&#x5217;&#x91CD;&#x590D;&#x7684;&#x884C;
#   &#x59D3;&#x540D;  &#x5E74;&#x9F84;      &#x6027;&#x522B;   &#x4EE3;&#x7801;
# 2  C  40    male  1.0
# 3  D  20  female  0.0
print(dataframe.duplicated(subset = ['&#x4EE3;&#x7801;']))  # &#x67E5;&#x770B;&#x91CD;&#x590D;&#x60C5;&#x51B5;&#xFF0C;&#x7C7B;&#x4F3C;isnull()
# 0    False
# 1    False
# 2     True
# 3     True
# dtype: bool
</class>

03-3 数据帧分组

import pandas as pd

# ---&#x521B;&#x5EFA;&#x6570;&#x636E;&#x5E27;
dataframe = pd.DataFrame()  # &#x8868;
dataframe['Name'] = ['A', 'B', 'C', 'D']  # &#x7C7B;&#x4F3C;&#x5B57;&#x5178;&#xFF0C;&#x8868;&#x5934;&#x4F5C;&#x4E3A;&#x952E;&#x503C;
dataframe['Age'] = [38, 25, 40, 20]
dataframe['Sex'] = ['Woman', 'Man', 'Woman', 'Man']
dataframe['Code'] = [1, 0, 1, 1]
print(dataframe)
#   Name  Age    Sex  Code
# 0    A   38  Woman     1
# 1    B   25    Man     0
# 2    C   40  Woman     1
# 3    D   20    Man     1

# &#x5BF9;&#x884C;&#x8FDB;&#x884C;&#x5206;&#x7EC4;,&#x7136;&#x540E;&#x5BF9;&#x6BCF;&#x7EC4;&#x5E94;&#x7528;&#x4E00;&#x4E2A;&#x51FD;&#x6570;
print(dataframe.groupby('Sex').mean())
#         Age  Code
# Sex
# Man    22.5   0.5
# Woman  39.0   1.0
print(dataframe.groupby('Sex')['Code'].sum()) # &#x5148;&#x6309;sex&#x5206;&#x7EC4;&#xFF0C;&#x7136;&#x540E;&#x5BF9;&#x6BCF;&#x7EC4;&#x7684;code&#x7B97;&#x603B;&#x548C;
# Sex
# Man      1
# Woman    2
# Name: Code, dtype: int64
print(dataframe.groupby(['Sex', 'Code'])['Age'].sum()) # &#x5148;&#x6309;sex&#x5206;&#x7EC4;&#xFF0C;&#x518D;&#x6309;code&#x5206;&#x7EC4;&#xFF0C;&#x7136;&#x540E;&#x5BF9;&#x6BCF;&#x7EC4;&#x7684;age&#x7B97;&#x603B;&#x548C;
# Sex    Code
# Man    0       25
#        1       20
# Woman  1       78
# Name: Age, dtype: int64

使用resample按照时间段进行分组

import pandas as pd
import numpy as np

# &#x521B;&#x5EFA;&#x65E5;&#x671F;&#x8303;&#x56F4;
time_index = pd.date_range('09/15/2022', periods = 100000, freq = '30s') # &#x53C2;&#x6570;&#x5206;&#x522B;&#x4E3A;&#xFF1A;&#x8D77;&#x59CB;&#x65F6;&#x95F4;&#x3001;&#x603B;&#x6570;&#x3001;&#x95F4;&#x9694;&#x65F6;&#x95F4;
print(time_index, len(time_index))
# DatetimeIndex(['2022-09-15 00:00:00', '2022-09-15 00:00:30',
#                '2022-09-15 00:01:00', '2022-09-15 00:01:30',
#                '2022-09-15 00:02:00', '2022-09-15 00:02:30',
#                '2022-09-15 00:03:00', '2022-09-15 00:03:30',
#                '2022-09-15 00:04:00', '2022-09-15 00:04:30',
#                ...

#                '2022-10-19 17:15:00', '2022-10-19 17:15:30',
#                '2022-10-19 17:16:00', '2022-10-19 17:16:30',
#                '2022-10-19 17:17:00', '2022-10-19 17:17:30',
#                '2022-10-19 17:18:00', '2022-10-19 17:18:30',
#                '2022-10-19 17:19:00', '2022-10-19 17:19:30'],
#               dtype='datetime64[ns]', length=100000, freq='30S') 100000

# &#x521B;&#x5EFA;&#x6570;&#x636E;&#x5E27;
dataframe = pd.DataFrame(index = time_index)  # &#x9ED8;&#x8BA4;&#x884C;&#x53F7;0-n&#x53D8;&#x4E3A;&#x65F6;&#x95F4;time_index&#x4E2D;&#x7684;&#x5185;&#x5BB9;
dataframe['Sale_Amount'] = np.random.randint(1, 10, 100000) # &#x968F;&#x673A;&#x751F;&#x6210;1-10&#x4E2D;&#x7684;&#x6574;&#x6570;
print(dataframe.head(5))
#                      Sale_Amount
# 2022-09-15 00:00:00            2
# 2022-09-15 00:00:30            8
# 2022-09-15 00:01:00            7
# 2022-09-15 00:01:30            4
# 2022-09-15 00:02:00            8

# &#x6309;&#x5468;&#x5BF9;&#x884C;&#x5206;&#x7EC4;&#xFF0C;&#x8BA1;&#x7B97;&#x6BCF;&#x4E00;&#x5468;&#x7684;&#x603B;&#x548C;
print(dataframe.resample('W').sum())
#             Sale_Amount
# 2022-09-18        57265
# 2022-09-25       101364
# 2022-10-02       100891
# 2022-10-09       100686
# 2022-10-16       101337
# 2022-10-23        39322
print(dataframe.resample('2W').sum())
#             Sale_Amount
# 2022-09-18        57562
# 2022-10-02       201239
# 2022-10-16       201288
# 2022-10-30        38938
print(dataframe.resample('M').sum())
#             Sale_Amount
# 2022-09-30       230240
# 2022-10-31       268787
# &#x9ED8;&#x8BA4;&#x60C5;&#x51B5;resample&#x8FD4;&#x56DE;&#x7684;&#x65E5;&#x671F;&#x7D22;&#x5F15;&#x662F;&#x65F6;&#x95F4;&#x7EC4;&#x53F3;&#x8FB9;&#x754C;&#x7684;&#x503C;&#xFF0C;&#x52A0;&#x5165;label&#x53C2;&#x6570;&#xFF0C;&#x6539;&#x6210;&#x8FD4;&#x56DE;&#x5DE6;&#x8FB9;&#x754C;
print(dataframe.resample('M', label = 'left').sum())
#             Sale_Amount
# 2022-08-31       229518
# 2022-09-30       269099

03-4 数据帧遍历与函数

import pandas as pd

# ---&#x521B;&#x5EFA;&#x6570;&#x636E;&#x5E27;
dataframe = pd.DataFrame()  # &#x8868;
dataframe['Name'] = ['a', 'b', 'c', 'd']  # &#x7C7B;&#x4F3C;&#x5B57;&#x5178;&#xFF0C;&#x8868;&#x5934;&#x4F5C;&#x4E3A;&#x952E;&#x503C;
dataframe['Age'] = [38, 25, 40, 20]
dataframe['Sex'] = ['Woman', 'Man', 'Woman', 'Man']
dataframe['Code'] = [1, 0, 1, 1]
print(dataframe)
#   Name  Age    Sex  Code
# 0    a   38  Woman     1
# 1    b   25    Man     0
# 2    c   40  Woman     1
# 3    d   20    Man     1

# &#x904D;&#x5386;&#x67D0;&#x5217;
for name in dataframe['Name']:
    print(name.upper())
# A
# B
# C
# D

# &#x5BF9;&#x67D0;&#x5217;&#x7684;&#x6240;&#x6709;&#x5143;&#x7D20;&#x5E94;&#x7528;&#x67D0;&#x4E2A;&#x51FD;&#x6570;
def uppercase(X):
    return  X.upper()

print(dataframe['Name'].apply(uppercase))
# 0    A
# 1    B
# 2    C
# 3    D
# Name: Name, dtype: object

# &#x5BF9;&#x6240;&#x6709;&#x5206;&#x7EC4;&#x5E94;&#x7528;&#x4E00;&#x4E2A;&#x51FD;&#x6570;
print(dataframe.groupby('Sex').apply(lambda x: x.count()))
#        Name  Age  Sex  Code
# Sex
# Man       2    2    2     2
# Woman     2    2    2     2

03-5 连接多个数据帧

import pandas as pd

# ---&#x521B;&#x5EFA;&#x6570;&#x636E;&#x5E27;
data_a = {'id': ['1', '2', '3'],
          'first': ['Alex', 'Amy', 'Allen'],
          'last': ['Anderson', 'Axkerman', 'Ali']}
dataframe_a = pd.DataFrame(data_a, columns = ['id', 'first', 'last'])
print(dataframe_a)
#   id  first      last
# 0  1   Alex  Anderson
# 1  2    Amy  Axkerman
# 2  3  Allen       Ali

data_b = {'id': ['4', '5', '6'],
          'first': ['Billy', 'Brian', 'Bran'],
          'last': ['Bonder', 'Black', 'Balwner']}
dataframe_b = pd.DataFrame(data_b, columns = ['id', 'first', 'last'])
print(dataframe_b)
#   id  first     last
# 0  4  Billy   Bonder
# 1  5  Brian    Black
# 2  6   Bran  Balwner

# &#x6CBF;&#x7740;&#x884C;&#x7684;&#x65B9;&#x5411;&#x8FDE;&#x63A5;&#x4E24;&#x4E2A;&#x6570;&#x636E;&#x5E27;
print(pd.concat([dataframe_a, dataframe_b], axis = 0))
#   id  first      last
# 0  1   Alex  Anderson
# 1  2    Amy  Axkerman
# 2  3  Allen       Ali
# 0  4  Billy    Bonder
# 1  5  Brian     Black
# 2  6   Bran   Balwner

# &#x6CBF;&#x7740;&#x5217;&#x7684;&#x65B9;&#x5411;&#x8FDE;&#x63A5;&#x4E24;&#x4E2A;&#x6570;&#x636E;&#x5E27;
print(pd.concat([dataframe_a, dataframe_b], axis = 1))
#   id  first      last id  first     last
# 0  1   Alex  Anderson  4  Billy   Bonder
# 1  2    Amy  Axkerman  5  Brian    Black
# 2  3  Allen       Ali  6   Bran  Balwner

Original: https://www.cnblogs.com/camilia/p/16694966.html
Author: CAMILIA
Title: [Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理



相关阅读

Title: 【MATLAB深度学习工具箱】学习笔记--鸢尾花聚类Iris Clustering

问题定义

本示例用于说明一个 自组织映射神经网络(self-organizing map neural network 如何通过拓扑角度将鸢尾花进行聚类。

每一个鸢尾花采用以下四个特征进行描述: 【说明:具体特征含义不是很懂】

  • Sepal length in cm
  • Sepal width in cm
  • Petal length in cm
  • Petal width in cm

这是一个聚类问题,根据样本的相似性进行分组。

【说明:之前的几篇文章中的分类问题,待分类项在问题求解之初就已经明确,如螃蟹的公母(2种)、酒的分类(3种)、字母的分类(26种)、数字的分类(10种)。此问题的特点是待形成的分类无法提前知道。】

数据准备

x = iris_dataset;

数据集维度如下所示,x中共包含150组数据,每一组数据为前述的四组特征。

size(x)

ans =
4 150

采用神经网络进行聚类

selforgmap函数是专门设计的用于自组织分类的函数,通过选择足够多的神经元,可以捕获足够多的细节。

采用8×8的六方网格神经元进行聚类。

net = selforgmap([8 8]);
view(net)

网络如下图所示:

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

训练过程如下:

[net,tr] = train(net,x);
nntraintool

得到如下结果:

迭代终止条件为达到了设定的最大迭代次数。

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Topology:SOM拓扑

显示了神经网络的拓扑结构。每一个神经元作为一个分类,邻接的神经元表明是相似的分类。

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Neighbor Connections:SOM邻接关系

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Neighbor Distances:SOM邻接距离

在欧氏距离范数上显示神经元与邻接神经元的距离。颜色越明亮,则说明距离越近。颜色越深,说明距离越远。

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Input Planes:SOM 输入平面

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Sample Hits:SOM 采样命中

展示了每类花的个数。

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

SOM Weight Positions:SOM 权重距离

[Python]-pandas模块-机器学习Python入门《Python机器学习手册》-03-数据整理

Original: https://blog.csdn.net/bear_miao/article/details/121317561
Author: 明天已在HiaHia
Title: 【MATLAB深度学习工具箱】学习笔记--鸢尾花聚类Iris Clustering