数据处理可视化的最有价值的 50 张图
文章目录
- 数据处理可视化的最有价值的 50 张图
* - 前言
- 设置
- 第一章 相关图 (Correlation)
-- 01/50 散点图 (Scatter plot)
- 02/50 带环绕的气泡图 (Bubble plot with Encircling)
- 03/50-1 具有最佳拟合线的散点图 (Scatter plot with line of best fit)
- 03/50-2 每条回归线在其自己的列中 (Each regression line in its own column)
- 04/50 带状抖动图 (Jittering with stripplot)
- 05/50 计数图 (Counts Plot)
- 06/50 边际直方图 (Marginal Histogram)
- 07/50 边际箱线图 (Marginal Boxplot)
- 08/50 相关图 (Correllogram)
- 09/50 成对图(Pairwise Plot)
- 第二章 偏差图 (Deviation)
- - 第三章 排序图 (Ranking)
- - 第四章 (Distribution)
-- 20/50 连续变量的直方图 (Histogram for Continuous Variable)
- 21/50 分类变量直方图 (Histogram for Categorical Variable)
- 22/50 密度图 (Density Plot)
- 23/50 带直方图的密度曲线 (Density Curves with Histogram)
- 24/50 Joy 图 (Joy Plot)
- 25/50 分布点阵图 (Distributed Dot Plot)
- 26/50 盒图 (Box Plot)
- 27/50 点盒图 (Dot + Box Plot)
- 28/50 小提琴图 (Violin Plot)
- 29/50 人口金字塔 (Population Pyramid)
- 30/50 分类图 (Categorical Plots)
- 附件
2018 ,大佬博主Selva Prabhakaran在自己运营的机器学习网站MachineLearning Plus上发布了博文:Python可视化50图
https://www.machinelearningplus.com/plots/top-50-matplotlib-visualizations-the-master-plots-python/
前言
由于原博客全是E英,又很少注释,同时有几处错误,因此,将上面全部翻译出并代码加以注释,这里用 jupyter Note 记录代码部分,附件(.ipynb文件和数据文件data.zip)供大家下载学习用,届时完善后可能将会上传至 github.com 。
设置
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.filterwarnings(action='once')
large = 22; med = 16; small = 12
params = {'axes.titlesize': large,
'legend.fontsize': med,
'figure.figsize': (16, 10),
'axes.labelsize': med,
'axes.titlesize': med,
'xtick.labelsize': med,
'ytick.labelsize': med,
'figure.titlesize': large}
plt.rcParams.update(params)
plt.style.use('seaborn-whitegrid')
sns.set_style("white")
%matplotlib inline
解决中文和负号问题
plt.rcParams['font.sans-serif']=['Simhei']
plt.rcParams['axes.unicode_minus']=False
print(mpl.__version__)
print(sns.__version__)
2.1.0
0.8.0
事实上我这个版本也是可运行 , 不过部分细节上有所不同,我会在具体代码作说明
第一章 相关图 (Correlation)
相关图用于可视化 2 个或多个变量之间的关系。也就是说,一个变量相对于另一个变量如何变化。
01/50 散点图 (Scatter plot)
散点图是用于研究两个变量之间关系的经典且基本的图。如果您的数据中有多个组,您可能希望以不同的颜色可视化每个组。在matplotlib中,可以方便地使用 plt.scatterplot()
。
midwest = pd.read_csv("./data/midwest_filter.csv")
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]
plt.figure(figsize=(16, 10)
, dpi=80
, facecolor='w'
, edgecolor='k'
)
for i, category in enumerate(categories):
plt.scatter('area', 'poptotal',
data=midwest.loc[midwest.category==category, :],
s=20, c=colors[i], label=str(category))
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
xlabel='Area', ylabel='Population')
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.ylabel('Population',fontsize=22)
plt.xlabel('Area',fontsize=22)
plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)
plt.legend(fontsize=12)
plt.show()
02/50 带环绕的气泡图 (Bubble plot with Encircling)
有时您想显示边界内的一组点以强调它们的重要性。在此示例中,您从应该被包围的数据帧中获取记录,并将其传递给下面代码中的描述。 encircle()
画轮廓曲线 (Encircling),除了这里的多边形,还可以是椭圆 在以后的文章将进一步的讲解。
from matplotlib import patches
from scipy.spatial import ConvexHull
import warnings; warnings.simplefilter('ignore')
sns.set_style("white")
midwest = pd.read_csv("./data/midwest_filter.csv")
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')
for i, category in enumerate(categories):
plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :], s='dot_size', c=colors[i], label=str(category), edgecolors='black', linewidths=.5)
def encircle(x,y, ax=None, **kw):
if not ax: ax=plt.gca()
p = np.c_[x,y]
hull = ConvexHull(p)
poly = plt.Polygon(p[hull.vertices,:], **kw)
ax.add_patch(poly)
midwest_encircle_data = midwest.loc[midwest.state=='IN', :]
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
xlabel='Area', ylabel='Population')
plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Bubble Plot with Encircling", fontsize=22)
plt.legend(fontsize=12)
plt.show()
03/50-1 具有最佳拟合线的散点图 (Scatter plot with line of best fit)
如果您想了解两个变量将如何影响,最佳拟合线就是您的选择。下图显示了数据中各个组之间的最佳拟合线有何不同。要禁用分组并仅为整个数据集绘制一条最佳拟合线,请从下面的调用中删除hue='cyl'参数。 sns.lmplot()
df = pd.read_csv("./data/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,
size=7, aspect=1.6, robust=True, palette='tab10',
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)
plt.show()
03/50-2 每条回归线在其自己的列中 (Each regression line in its own column)
另外,您也可以在其各自的列中显示每个组的最佳拟合线。您可以通过设置参数 col=groupingcolumn
来实现。
df = pd.read_csv("./data/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy",
data=df_select,
size=7,
robust=True,
palette='Set1',
col="cyl",
scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.show()
04/50 带状抖动图 (Jittering with stripplot)
通常多个数据点具有完全相同的 X 和 Y 值。结果,多个点相互绘制并隐藏。为避免这种情况,请稍微抖动点,以便您可以直观地看到它们。这很方便使用 seaborn 的.stripplot()。
df = pd.read_csv("./data/mpg_ggplot2.csv")
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt.show()
05/50 计数图 (Counts Plot)
避免点重叠问题的另一种选择是根据点中有多少个点来增加点的大小。因此,点的尺寸越大,周围的点就越集中。
[En]
Another option to avoid the problem of point overlap is to increase the size of the point based on how many points there are in the point. Therefore, the larger the size of the point, the more concentrated the surrounding points.
df = pd.read_csv("./data/mpg_ggplot2.csv")
df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts')
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
sns.stripplot(df_counts.cty, df_counts.hwy, sizes=df_counts.counts*25, ax=ax)
plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)
plt.show()
06/50 边际直方图 (Marginal Histogram)
边际直方图具有沿 X 和 Y 轴变量的直方图。这用于可视化 X 和 Y 之间的关系以及 X 和 Y 的单变量分布。该图经常用于探索性数据分析 (EDA)。
df = pd.read_csv("./data/mpg_ggplot2.csv")
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])
ax_main.scatter('displ', 'hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df,
cmap="tab10",
edgecolors='gray', linewidths=.5)
ax_bottom.hist(df.displ, 40, histtype='stepfilled', orientation='vertical', color='deeppink')
ax_bottom.invert_yaxis()
ax_right.hist(df.hwy, 40, histtype='stepfilled', orientation='horizontal', color='deeppink')
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(14)
xlabels = ax_main.get_xticks().tolist()
ax_main.set_xticklabels(xlabels)
plt.show()
07/50 边际箱线图 (Marginal Boxplot)
边际箱线图的用途与边际直方图相似。但是,箱线图有助于确定 X 和 Y 的中位数、第 25 和第 75 个百分位数(四分位)。
df = pd.read_csv("./data/mpg_ggplot2.csv")
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])
ax_main.scatter('displ', 'hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)
sns.boxplot(df.hwy, ax=ax_right, orient="v")
sns.boxplot(df.displ, ax=ax_bottom, orient="h")
ax_bottom.set(xlabel='')
ax_right.set(ylabel='')
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
item.set_fontsize(14)
plt.show()
08/50 相关图 (Correllogram)
相关图用于直观地查看给定数据框(或二维数组)中所有变量之间的相关量。
[En]
The correlation graph is used to visually view the amount of correlation between all variables in a given data frame (or two-dimensional array).
又称热力图 (heatmap)
df = pd.read_csv("./data/mtcars.csv")
plt.figure(figsize=(12,10), dpi= 80)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns
,cmap='RdYlGn'
, center=0
, annot=True
)
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()
09/50 成对图(Pairwise Plot)
成对图是最流行的一种探索性分析,它被用来理解所有可能的数值变量对之间的关系。它是进行双变量分析的必要工具。
[En]
Pairwise graph is the most popular kind of exploratory analysis, which is used to understand the relationship between all possible pairs of numerical variables. It is a necessary tool for bivariate analysis.
df = sns.load_dataset('iris')
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df
, kind="scatter"
, hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()
df = sns.load_dataset('iris')
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df
, kind="reg"
, hue="species"
)
plt.show()
第二章 偏差图 (Deviation)
10/50 发散条形图 (Diverging Bars)
如果您希望根据单个指标查看项目中的更改,并可视化差异的顺序和数量,则分散条是一个很好的工具。
[En]
Divergence bars are a good tool if you want to see the changes in the project according to a single metric and visualize the order and number of differences.
它有助于快速区分数据中的群体表现,立即传达这一点非常直观。
[En]
It helps to quickly distinguish the performance of groups in the data, and it is very intuitive to convey this immediately.
df = pd.read_csv("./data/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
plt.figure(figsize=(14,10), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=5)
plt.gca().set(ylabel='$Model$', xlabel='$Mileage$')
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()
11/50 发散型文本(Diverging Texts)
分散的文本类似于发散条形图
如果你想把图表中每一项的价值都用一种漂亮的、可展示的方式显示出来,那么这是一个更合适的方式。
[En]
If you want to show the value of each item in the chart in a beautiful and presentable way, then it is a more appropriate way.
df = pd.read_csv("./data/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'green' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
plt.figure(figsize=(14,14), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z)
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex, 2)
, horizontalalignment='right' if x < 0 else 'left',
verticalalignment='center',
fontdict={'color':'red' if x < 0 else 'green', 'size':14})
plt.yticks(df.index, df.cars, fontsize=12)
plt.title('Diverging Text Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5, 2.5)
plt.show()
这里需要说明的是:
- 水平对齐参数,用了类似于列表推导式的式子,不同的是这里只需要单个值不需要最后形成列表
- 另外,left和right这两个对齐方式很容易混淆
- 以右对齐为例来说明: 首先我们需要知道的是文本放置的位置点其实就是线条的末端
所谓的右对齐就是要求文本的最右端要与这个位置点对齐
12/50 发散点图 (Diverging Dot Plot)
散点图也类似于发散柱。然而,与发散的酒吧相比,酒吧的缺乏减少了群体之间的对比和差异。
[En]
The divergence point diagram is also similar to the divergence column. However, compared with divergent bars, the lack of bars reduces the contrast and differences between groups.
df = pd.read_csv("./data/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = ['red' if x < 0 else 'darkgreen' for x in df['mpg_z']]
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
plt.figure(figsize=(14,16), dpi= 80)
plt.scatter(df.mpg_z, df.index, s=450, alpha=.6, color=df.colors)
for x, y, tex in zip(df.mpg_z, df.index, df.mpg_z):
t = plt.text(x, y, round(tex, 1), horizontalalignment='center',
verticalalignment='center', fontdict={'color':'white'})
plt.gca().spines["top"].set_alpha(.3)
plt.gca().spines["bottom"].set_alpha(.3)
plt.gca().spines["right"].set_alpha(.3)
plt.gca().spines["left"].set_alpha(.3)
plt.yticks(df.index, df.cars)
plt.title('Diverging Dotplot of Car Mileage', fontdict={'size':20})
plt.xlabel('$Mileage$')
plt.grid(linestyle='--', alpha=0.5)
plt.xlim(-2.5, 2.5)
plt.show()
13/50 带标记的分歧棒棒糖图 (Diverging Lollipop Chart with Markers)
带标签的棒棒糖通过突出显示您想要注意的任何重要数据点并在图表中适当地进行推理来提供灵活的视觉偏差。
[En]
Tagged lollipops provide a flexible visual deviation by highlighting any important data points you want to pay attention to and reasoning appropriately in the chart.
df = pd.read_csv("./data/mtcars.csv")
x = df.loc[:, ['mpg']]
df['mpg_z'] = (x - x.mean())/x.std()
df['colors'] = 'black'
df.loc[df.cars == 'Fiat X1-9', 'colors'] = 'darkorange'
df.sort_values('mpg_z', inplace=True)
df.reset_index(inplace=True)
import matplotlib.patches as patches
plt.figure(figsize=(14,16), dpi= 80)
plt.hlines(y=df.index, xmin=0, xmax=df.mpg_z, color=df.colors, alpha=0.4, linewidth=1)
plt.scatter(df.mpg_z, df.index, color=df.colors, s=[600 if x == 'Fiat X1-9' else 300 for x in df.cars], alpha=0.6)
plt.yticks(df.index, df.cars)
plt.xticks(fontsize=12)
plt.annotate('Mercedes Models', xy=(0.0, 11.0), xytext=(1.0, 11), xycoords='data',
fontsize=15, ha='center', va='center',
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(arrowstyle='-[, widthB=2.0, lengthB=1.5', lw=2.0, color='steelblue'), color='white')
p1 = patches.Rectangle((-2.0, -1), width=.3, height=3, alpha=.2, facecolor='red')
p2 = patches.Rectangle((1.5, 27), width=.8, height=5, alpha=.2, facecolor='green')
plt.gca().add_patch(p1)
plt.gca().add_patch(p2)
plt.title('Diverging Bars of Car Mileage', fontdict={'size':20})
plt.grid(linestyle='--', alpha=0.5)
plt.show()
14/50 面积图 (Area Chart)
通过给轴线之间的区域上色,区域地图不仅强调了波峰和波谷,还强调了高点和低点的持续时间。高点持续的时间越长,线下区域就越大。
[En]
By coloring the area between the axis and the line, the area map emphasizes not only the peaks and troughs, but also the duration of the high and low points. The longer the high point lasts, the larger the offline area will be.
df = pd.read_csv("./data/economics.csv", parse_dates=['date']).head(100)
x = np.arange(df.shape[0])
y_returns = (df.psavert.diff().fillna(0)/df.psavert.shift(1)).fillna(0) * 100
plt.figure(figsize=(16,10), dpi= 80)
plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] >= 0, facecolor='green', interpolate=True, alpha=0.7)
plt.fill_between(x[1:], y_returns[1:], 0, where=y_returns[1:] 0, facecolor='red', interpolate=True, alpha=0.7)
plt.annotate('Peak \n1975', xy=(94.0, 21.0), xytext=(88.0, 28),
bbox=dict(boxstyle='square', fc='firebrick'),
arrowprops=dict(facecolor='steelblue', shrink=0.05), fontsize=15, color='white')
xtickvals = [str(m)[:3].upper()+"-"+str(y) for y,m in zip(df.date.dt.year, df.date.dt.month_name())]
plt.gca().set_xticks(x[::6])
plt.gca().set_xticklabels(xtickvals[::6], rotation=90, fontdict={'horizontalalignment': 'center', 'verticalalignment': 'center_baseline'})
plt.ylim(-35,35)
plt.xlim(1,100)
plt.title("Month Economics Return %", fontsize=22)
plt.ylabel('Monthly returns %')
plt.grid(alpha=0.5)
plt.show()
第三章 排序图 (Ranking)
15/50 有序条形图 (Ordered Bar Chart)
有序条形图有效地传达了项目的排名顺序。排序图是Python可视化中最简单的图像之一,它的主要作用是帮助我们比较变量的大小
典型的排序图有:柱状图,坡度图,哑铃图......
df_raw = pd.read_csv("./data/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
import matplotlib.patches as patches
fig, ax = plt.subplots(figsize=(16,10), facecolor='white', dpi= 80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=20)
for i, cty in enumerate(df.cty):
ax.text(i, cty+0.5, round(cty, 1), horizontalalignment='center')
ax.set_title('Bar Chart for Highway Mileage', fontdict={'size':22})
ax.set(ylabel='Miles Per Gallon', ylim=(0, 30))
plt.xticks(df.index, df.manufacturer.str.upper(), rotation=60, horizontalalignment='right', fontsize=12)
p1 = patches.Rectangle((.57, -0.005), width=.33, height=.13, alpha=.1, facecolor='green', transform=fig.transFigure)
p2 = patches.Rectangle((.124, -0.005), width=.446, height=.13, alpha=.1, facecolor='red', transform=fig.transFigure)
fig.add_artist(p1)
fig.add_artist(p2)
plt.show()
16/50 棒棒糖图 (Lollipop Chart)
棒棒糖图以视觉上令人愉悦的方式与有序条形图具有类似的效果。
df_raw = pd.read_csv("./data/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.vlines(x=df.index, ymin=0, ymax=df.cty, color='firebrick', alpha=0.7, linewidth=2)
ax.scatter(x=df.index, y=df.cty, s=75, color='firebrick', alpha=0.7)
ax.set_title('Lollipop Chart for Highway Mileage', fontdict={'size':22})
ax.set_ylabel('Miles Per Gallon')
ax.set_xticks(df.index)
ax.set_xticklabels(df.manufacturer.str.upper(), rotation=60
, fontdict={'horizontalalignment': 'right', 'size':12})
ax.set_ylim(0, 30)
for row in df.itertuples():
ax.text(row.Index, row.cty+.5, s=round(row.cty, 2), horizontalalignment= 'center', verticalalignment='bottom', fontsize=14)
plt.show()
17/50 点阵图 (Dot Plot)
点图表示项目的排名顺序。因为它沿水平轴对齐,所以您可以更轻松地可视化点之间的距离。
[En]
The dot chart represents the ranking order of the projects. Because it is aligned along the horizontal axis, you can more easily visualize the distance between points.
df_raw = pd.read_csv("./data/mpg_ggplot2.csv")
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', inplace=True)
df.reset_index(inplace=True)
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.hlines(y=df.index, xmin=11, xmax=26, color='gray', alpha=0.7, linewidth=1, linestyles='dashdot')
ax.scatter(y=df.index, x=df.cty, s=75, color='firebrick', alpha=0.7)
ax.set_title('Dot Plot for Highway Mileage', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon')
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'})
ax.set_xlim(10, 27)
plt.show()
18/50 斜率图 (Slope Chart)
斜率图最适合比较给定人员/项目的"之前"和"之后"相对位置。
import matplotlib.lines as mlines
df = pd.read_csv("./data/gdppercap.csv")
left_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1952'])]
right_label = [str(c) + ', '+ str(round(y)) for c, y in zip(df.continent, df['1957'])]
klass = ['red' if (y1-y2) < 0 else 'green' for y1, y2 in zip(df['1952'], df['1957'])]
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='red' if p1[1]-p2[1] > 0 else 'green', marker='o', markersize=6)
ax.add_line(l)
return l
fig, ax = plt.subplots(1,1,figsize=(14,14), dpi= 80)
ax.vlines(x=1, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
ax.vlines(x=3, ymin=500, ymax=13000, color='black', alpha=0.7, linewidth=1, linestyles='dotted')
ax.scatter(y=df['1952'], x=np.repeat(1, df.shape[0]), s=10, color='black', alpha=0.7)
ax.scatter(y=df['1957'], x=np.repeat(3, df.shape[0]), s=10, color='black', alpha=0.7)
for p1, p2, c in zip(df['1952'], df['1957'], df['continent']):
newline([1,p1], [3,p2])
ax.text(1-0.05, p1, c + ', ' + str(round(p1)), horizontalalignment='right', verticalalignment='center', fontdict={'size':14})
ax.text(3+0.05, p2, c + ', ' + str(round(p2)), horizontalalignment='left', verticalalignment='center', fontdict={'size':14})
ax.text(1-0.05, 13000, 'BEFORE', horizontalalignment='right', verticalalignment='center', fontdict={'size':18, 'weight':700})
ax.text(3+0.05, 13000, 'AFTER', horizontalalignment='left', verticalalignment='center', fontdict={'size':18, 'weight':700})
ax.set_title("Slopechart: Comparing GDP Per Capita between 1952 vs 1957", fontdict={'size':22})
ax.set(xlim=(0,4), ylim=(0,14000), ylabel='Mean GDP Per Capita')
ax.set_xticks([1,3])
ax.set_xticklabels(["1952", "1957"])
plt.yticks(np.arange(500, 13000, 2000), fontsize=12)
plt.gca().spines["top"].set_alpha(.0)
plt.gca().spines["bottom"].set_alpha(.0)
plt.gca().spines["right"].set_alpha(.0)
plt.gca().spines["left"].set_alpha(.0)
plt.show()
19/50哑铃图 (Dumbbell Plot)
哑铃图传达了各种物品的"之前"和"之后"位置以及物品的等级顺序。如果您想可视化特定项目/计划对不同对象的影响,它非常有用。
import matplotlib.lines as mlines
df = pd.read_csv("./data/health.csv")
df.sort_values('pct_2014', inplace=True)
df.reset_index(inplace=True)
def newline(p1, p2, color='black'):
ax = plt.gca()
l = mlines.Line2D([p1[0],p2[0]], [p1[1],p2[1]], color='skyblue')
ax.add_line(l)
return l
fig, ax = plt.subplots(1,1,figsize=(14,14), facecolor='#f7f7f7', dpi= 80)
ax.vlines(x=.05, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.10, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.15, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.vlines(x=.20, ymin=0, ymax=26, color='black', alpha=1, linewidth=1, linestyles='dotted')
ax.scatter(y=df['index'], x=df['pct_2013'], s=50, color='#0e668b', alpha=0.7)
ax.scatter(y=df['index'], x=df['pct_2014'], s=50, color='#a3c4dc', alpha=0.7)
for i, p1, p2 in zip(df['index'], df['pct_2013'], df['pct_2014']):
newline([p1, i], [p2, i])
ax.set_facecolor('#f7f7f7')
ax.set_title("Dumbell Chart: Pct Change - 2013 vs 2014", fontdict={'size':22})
ax.set(xlim=(0,.25), ylim=(-1, 27), ylabel='Mean GDP Per Capita')
ax.set_xticks([.05, .1, .15, .20])
ax.set_xticklabels(['5%', '15%', '20%', '25%'])
ax.set_xticklabels(['5%', '15%', '20%', '25%'])
plt.show()
第四章 (Distribution)
20/50 连续变量的直方图 (Histogram for Continuous Variable)
直方图显示了给定变量的频率分布。下面的表示法根据类别变量对频率条进行分组,以便更深入地了解连续变量和分类变量。
[En]
The histogram shows the frequency distribution of a given variable. The following representation groups the frequency bar according to category variables for a more in-depth understanding of continuous and classified variables.
df = pd.read_csv("./data/mpg_ggplot2.csv")
x_var = 'displ'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, 30, stacked=True, density=False, color=colors[:len(vals)])
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 25)
plt.xticks(ticks=bins[::3], labels=[round(b,1) for b in bins[::3]])
plt.show()
21/50 分类变量直方图 (Histogram for Categorical Variable)
类别变量的直方图显示了变量的频率分布。通过对条进行阴影处理,可以可视化与另一个表示颜色的分类变量相关的分布。
[En]
The histogram of the category variable shows the frequency distribution of the variable. By shading the bar, you can visualize the distribution related to another classification variable that represents the color.
df = pd.read_csv("./data/mpg_ggplot2.csv")
x_var = 'manufacturer'
groupby_var = 'class'
df_agg = df.loc[:, [x_var, groupby_var]].groupby(groupby_var)
vals = [df[x_var].values.tolist() for i, df in df_agg]
plt.figure(figsize=(16,9), dpi= 80)
colors = [plt.cm.Spectral(i/float(len(vals)-1)) for i in range(len(vals))]
n, bins, patches = plt.hist(vals, df[x_var].unique().__len__(), stacked=True, density=False, color=colors[:len(vals)])
plt.legend({group:col for group, col in zip(np.unique(df[groupby_var]).tolist(), colors[:len(vals)])})
plt.title(f"Stacked Histogram of ${x_var}$ colored by ${groupby_var}$", fontsize=22)
plt.xlabel(x_var)
plt.ylabel("Frequency")
plt.ylim(0, 40)
plt.xticks( rotation=90, horizontalalignment='left')
plt.show()
22/50 密度图 (Density Plot)
密度图是可视化连续变量分布的常用工具。通过按"响应"变量对它们进行分组,可以检查 X 和 Y 之间的关系。以下情况如果出于表示目的,以描述城市里程的分布如何随气缸数的变化而变化。
df = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(16,10), dpi= 80)
sns.kdeplot(df.loc[df['cyl'] == 4, "cty"], shade=True, color="g", label="Cyl=4", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 5, "cty"], shade=True, color="deeppink", label="Cyl=5", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 6, "cty"], shade=True, color="dodgerblue", label="Cyl=6", alpha=.7)
sns.kdeplot(df.loc[df['cyl'] == 8, "cty"], shade=True, color="orange", label="Cyl=8", alpha=.7)
plt.title('Density Plot of City Mileage by n_Cylinders', fontsize=22)
plt.legend()
plt.show()
23/50 带直方图的密度曲线 (Density Curves with Histogram)
带有直方图的密度曲线汇集了两个图表所传达的集体信息,因此您可以将它们全部放在一个图表中,而不是两个图表中。
[En]
A density curve with a histogram brings together the collective information conveyed by the two diagrams, so you can put them all in one diagram instead of two.
df = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(16,10), dpi= 80)
plt.figure(figsize=(13,10), dpi= 80)
sns.distplot(df.loc[df['class'] == 'compact', "cty"], color="dodgerblue", label="Compact", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'suv', "cty"], color="orange", label="SUV", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
sns.distplot(df.loc[df['class'] == 'minivan', "cty"], color="g", label="minivan", hist_kws={'alpha':.7}, kde_kws={'linewidth':3})
plt.ylim(0, 0.35)
plt.title('Density Plot of City Mileage by Vehicle Type', fontsize=22)
plt.legend()
plt.show()
24/50 Joy 图 (Joy Plot)
Joy 图允许不同组的密度曲线重叠,这是可视化大量组彼此之间分布的好方法。它看起来赏心悦目,并清晰地传达了正确的信息。可以使用基于 joypy
的软件包轻松构建它。
import joypy
mpg = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(16,10), dpi= 80)
fig, axes = joypy.joyplot(mpg, column=['hwy', 'cty'], by="class", ylim='own', figsize=(14,10))
plt.title('Joy Plot of City and Highway Mileage by Class', fontsize=22)
plt.show()
25/50 分布点阵图 (Distributed Dot Plot)
分布点图表显示了按组划分的点的单变量分布。点越暗,数据点在该区域的集中度就越高。通过以不同的方式对中位数进行着色,群体的实际位置立即变得明显。
[En]
The distributed point graph shows the univariate distribution of points divided by group. The darker the point, the higher the concentration of the data points in the area. By coloring the median in different ways, the actual position of the group immediately becomes obvious.
import matplotlib.patches as mpatches
df_raw = pd.read_csv("./data/mpg_ggplot2.csv")
cyl_colors = {4:'tab:red', 5:'tab:green', 6:'tab:blue', 8:'tab:orange'}
df_raw['cyl_color'] = df_raw.cyl.map(cyl_colors)
df = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.mean())
df.sort_values('cty', ascending=False, inplace=True)
df.reset_index(inplace=True)
df_median = df_raw[['cty', 'manufacturer']].groupby('manufacturer').apply(lambda x: x.median())
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)
ax.hlines(y=df.index, xmin=0, xmax=40, color='gray', alpha=0.5, linewidth=.5, linestyles='dashdot')
for i, make in enumerate(df.manufacturer):
df_make = df_raw.loc[df_raw.manufacturer==make, :]
ax.scatter(y=np.repeat(i, df_make.shape[0]), x='cty', data=df_make, s=75, edgecolors='gray', c='w', alpha=0.5)
ax.scatter(y=i, x='cty', data=df_median.loc[df_median.index==make, :], s=75, c='firebrick')
ax.text(33, 13, "$red \; dots \; are \; the \: median$", fontdict={'size':12}, color='firebrick')
red_patch = plt.plot([],[], marker="o", ms=10, ls="", mec=None, color='firebrick', label="Median")
plt.legend(handles=red_patch)
ax.set_title('Distribution of City Mileage by Make', fontdict={'size':22})
ax.set_xlabel('Miles Per Gallon (City)', alpha=0.7)
ax.set_yticks(df.index)
ax.set_yticklabels(df.manufacturer.str.title(), fontdict={'horizontalalignment': 'right'}, alpha=0.7)
ax.set_xlim(1, 40)
plt.xticks(alpha=0.7)
plt.gca().spines["top"].set_visible(False)
plt.gca().spines["bottom"].set_visible(False)
plt.gca().spines["right"].set_visible(False)
plt.gca().spines["left"].set_visible(False)
plt.grid(axis='both', alpha=.4, linewidth=.1)
plt.show()
26/50 盒图 (Box Plot)
盒图是一种可视化分布的好方法,记住中位数、25%、 75% (四分位)和异常值。但是,您需要小心解释框的大小,这可能会扭曲该组中包含的点数。因此,手动提供每个框中的观察次数可以帮助克服这个缺点。例如,左边的前两个盒子的尺寸相同,尽管它们分别有 5 块和 47 块。因此,有必要写下该组中的观察次数。
df = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(13,10), dpi= 80)
sns.boxplot(x='class', y='hwy', data=df, notch=False)
def add_n_obs(df,group_col,y):
medians_dict = {grp[0]:grp[1][y].median() for grp in df.groupby(group_col)}
xticklabels = [x.get_text() for x in plt.gca().get_xticklabels()]
n_obs = df.groupby(group_col)[y].size().values
for (x, xticklabel), n_ob in zip(enumerate(xticklabels), n_obs):
plt.text(x, medians_dict[xticklabel]*1.01, "#obs : "+str(n_ob), horizontalalignment='center', fontdict={'size':14}, color='white')
add_n_obs(df,group_col='class',y='hwy')
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.ylim(10, 40)
plt.show()
27/50 点盒图 (Dot + Box Plot)
点盒图 传达与分组箱线图相似的信息。此外,这些点还可以让您了解每组中有多少数据点。
df = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(13,10), dpi= 80)
sns.boxplot(x='class', y='hwy', data=df, hue='cyl')
sns.stripplot(x='class', y='hwy', data=df, color='black', size=3, jitter=1)
for i in range(len(df['class'].unique())-1):
plt.vlines(i+.5, 10, 45, linestyles='solid', colors='gray', alpha=0.2)
plt.title('Box Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.legend(title='Cylinders')
plt.show()
28/50 小提琴图 (Violin Plot)
小提琴图片是一个视觉上令人愉快的替代方框图。小提琴的形状或面积取决于它所观察到的次数。然而,小提琴情节可能更难阅读,在专业环境中也不常用。
[En]
The violin picture is a visually pleasant substitute for the box diagram. The shape or area of a violin depends on the number of observations it has. However, the violin plot may be more difficult to read and is not commonly used in a professional environment.
df = pd.read_csv("./data/mpg_ggplot2.csv")
plt.figure(figsize=(13,10), dpi= 80)
sns.violinplot(x='class', y='hwy', data=df, scale='width', inner='quartile')
plt.title('Violin Plot of Highway Mileage by Vehicle Class', fontsize=22)
plt.show()
29/50 人口金字塔 (Population Pyramid)
人口金字塔可以用来显示按体积排序的组的分布。或者它也可以用来显示人口的逐渐过滤,因为它在下面用来显示有多少人通过了营销漏斗的每个阶段。
[En]
The population pyramid can be used to display the distribution of groups sorted by volume. Or it can also be used to show the gradual filtering of the population, as it is used below to show how many people have passed each stage of the marketing funnel.
df = pd.read_csv("./data/email_campaign_funnel.csv")
plt.figure(figsize=(13,10), dpi= 80)
group_col = 'Gender'
order_of_bars = df.Stage.unique()[::-1]
colors = [plt.cm.Spectral(i/float(len(df[group_col].unique())-1)) for i in range(len(df[group_col].unique()))]
for c, group in zip(colors, df[group_col].unique()):
sns.barplot(x='Users', y='Stage', data=df.loc[df[group_col]==group, :], order=order_of_bars, color=c, label=group)
plt.xlabel("$Users$")
plt.ylabel("Stage of Purchase")
plt.yticks(fontsize=12)
plt.title("Population Pyramid of the Marketing Funnel", fontsize=22)
plt.legend()
plt.show()
30/50 分类图 (Categorical Plots)
库提供的分类图seaborn可用于可视化 2 个或更多分类变量彼此相关的计数分布。
titanic = sns.load_dataset("titanic")
g = sns.catplot("alive", col="deck", col_wrap=4,
data=titanic[titanic.deck.notnull()],
kind="count", height=3.5, aspect=.8,
palette='tab20')
fig.suptitle('sf')
plt.show()
titanic = sns.load_dataset("titanic")
sns.catplot(x="age", y="embark_town",
hue="sex", col="class",
data=titanic[titanic.embark_town.notnull()],
orient="h", height=5, aspect=1, palette="tab10",
kind="violin", dodge=True, cut=0, bw=.2)
附件
内容包括数据集和jupyter notebook 的源文件。
下载
Original: https://blog.csdn.net/cndrip/article/details/124087387
Author: cndrip
Title: 数据处理可视化的最有价值的 50 张图 (上)
相关阅读
Title: Numpy报错:ImportError: numpy.core.multiarray failed to import
导入自定义的 python 模块时,出现以下报错:
ImportError: numpy.core.multiarray failed to import
from .cv2 import *
ImportError: numpy.core.multiarray failed to import
原因:
numpy 版本过低或者过高
解决:
- 查看numpy 版本:
pip show numpy
我当前环境中的 numpy 版本是:Version: 1.16.5
- 升级:
pip install -U numpy
(tensorflow) Robin-macbook-pro:~ robin$ pip install -U numpy
Collecting numpy
Downloading https://files.pythonhosted.org/packages/6a/9d/984f87a8d5b28b1d4afc042d8f436a76d6210fb582214f35a0ea1db3be66/numpy-1.19.5-cp36-cp36m-macosx_10_9_x86_64.whl (15.6MB)
|████████████████████████████████| 15.6MB 1.3MB/s
ERROR: tensorflow 1.13.1 has requirement protobuf>=3.6.1, but you'll have protobuf 3.6.0 which is incompatible.
Installing collected packages: numpy
Found existing installation: numpy 1.16.5
Uninstalling numpy-1.16.5:
Successfully uninstalled numpy-1.16.5
Successfully installed numpy-1.19.5
结果还是不行,遂给 numpy 降级: pip install -U numpy==1.14.0
(之前是 1.16.5
)
不仅造成了不少冲突,而且没效果:
(tensorflow) Robin-macbook-pro:~ robin$ pip install -U numpy==1.14.0
Collecting numpy==1.14.0
Downloading https://files.pythonhosted.org/packages/33/c4/1ea5344793c159556110e42c94c9374cb08ce2a2727374cd467bd97f6579/numpy-1.14.0-cp36-cp36m-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.macosx_10_10_intel.macosx_10_10_x86_64.whl (4.7MB)
|████████████████████████████████| 4.7MB 230kB/s
ERROR: tensorflow 1.13.1 has requirement protobuf>=3.6.1, but you'll have protobuf 3.6.0 which is incompatible.
ERROR: pmdarima 1.3.0 has requirement numpy>=1.16, but you'll have numpy 1.14.0 which is incompatible.
ERROR: phik 0.9.8 has requirement numpy>=1.15.4, but you'll have numpy 1.14.0 which is incompatible.
ERROR: librosa 0.8.0 has requirement numpy>=1.15.0, but you'll have numpy 1.14.0 which is incompatible.
ERROR: astropy 4.0 has requirement numpy>=1.16, but you'll have numpy 1.14.0 which is incompatible.
Installing collected packages: numpy
Found existing installation: numpy 1.19.5
Uninstalling numpy-1.19.5:
Successfully uninstalled numpy-1.19.5
Successfully installed numpy-1.14.0
还是不行:
将numpy更到最新版本: pip install -U numpy
,
同时更新 opencv 的版本试试,这是当前版本:
本身就是最新版本,尝试过没用。
后来发现,问题在于
- ①
import numpy as numpy
会报错:ImportError: numpy.core.multiarray failed to import
- ②
import cv2
会报错:AttributeError: module 'logging' has no attribute 'Handler'
最后发现,最为离奇诡异的是,在不同的文件夹下面执行相同的代码( import numpy as np
)是没有任何问题的
(1)
/Users/robin/software/anaconda3/envs/tensorflow/bin/python3.6 /Users/robin/MLcode/Pycharm_Project/tensorflow/2021/0823_face_recognition_environment/test.py
Process finished with exit code 0
(2)
/Users/robin/software/anaconda3/envs/tensorflow/bin/python3.6 /Users/robin/MLcode/Pycharm_Project/tensorflow/2021/0823_face_recognition_environment/ui/test.py
Traceback (most recent call last):
File "/Users/robin/MLcode/Pycharm_Project/tensorflow/2021/0823_face_recognition_environment/ui/test.py", line 1, in <module>
import numpy as np
File "/Users/robin/software/anaconda3/envs/tensorflow/lib/python3.6/site-packages/numpy/__init__.py", line 187, in <module>
from .testing import Tester
File "/Users/robin/software/anaconda3/envs/tensorflow/lib/python3.6/site-packages/numpy/testing/__init__.py", line 10, in <module>
from unittest import TestCase
File "/Users/robin/software/anaconda3/envs/tensorflow/lib/python3.6/unittest/__init__.py", line 59, in <module>
from .case import (TestCase, FunctionTestCase, SkipTest, skip, skipIf,
File "/Users/robin/software/anaconda3/envs/tensorflow/lib/python3.6/unittest/case.py", line 278, in <module>
class _CapturingHandler(logging.Handler):
AttributeError: module 'logging' has no attribute 'Handler'
Process finished with exit code 1
最后我放弃治疗了,新建了一个文件夹,将文件移动过去了,就当做 Pycharm 抽风了吧
浪费一下午时间...!!!!
Original: https://blog.csdn.net/Robin_Pi/article/details/120544691
Author: Robin_Pi
Title: Numpy报错:ImportError: numpy.core.multiarray failed to import

cuda+cudnn+tensorflow-gpu+keras安装及版本对应

Kaldi搭建语音识别系统—发音词典相关文件准备

新开源基于WEBRTC+讯飞听写API的质检SDK库

深度学习与计算机视觉教程(14) | 图像分割 (FCN,SegNet,U-Net,PSPNet,DeepLab,RefineNet)(CV通关指南·完结)

线性预测编码(LPC)笔记

如何利用随机森林算法进行微博情绪分类

关于英伟达jetson nano的搭配双目摄像头跑ORB_SLAM2

TensorFlow-深度学习笔记

使用Tensorflow的RNN(LSTM)生成音乐(基础)

Tensorflow 2.x(keras)源码详解之第四章:Dataset&TFRecord

使用卷积神经网络和 Python 进行图像分类

说话人识别(speaker Recognition/Verification)简介

Tensorflow 2.x(keras)源码详解之第十四章:keras中的回调及自定义回调

PyTorch: 目标检测(object detection)介绍
