对泰坦尼克号的数据 进行特征分析、数据清理、数据填充、处理分类特征、将连续特征转化为离散特征、合并特征、制作模型、模型预测
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
train_df=pd.read_csv('train.csv')
test_df=pd.read_csv('test.csv')
combine=[train_df,test_df]
PassengerId=test_df['PassengerId']
特征分析
分析存活率与几个因素之间的关系
train_df[['Pclass','Survived']].groupby(['Pclass'],as_index=False).mean().sort_values(by='Survived',ascending=False)
train_df[['Sex','Survived']].groupby(['Sex'],as_index=False).mean().sort_values(by='Survived',ascending=False)
train_df[['SibSp','Survived']].groupby(['SibSp'],as_index=False).mean().sort_values(by='Survived',ascending=False)
train_df[['Parch','Survived']].groupby(['Parch'],as_index=False).mean().sort_values(by='Survived',ascending=False)
train_df[['Embarked','Survived']].groupby(['Embarked'],as_index=False).mean().sort_values(by='Survived',ascending=False)
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(1,2,1)
ax2=fig.add_subplot(1,2,2)
age1=train_df.Age[train_df['Survived']==0]
age2=train_df.Age[train_df['Survived']==1]
ax1.set_title('Survived=0')
ax1.hist(age1,bins=20)
ax1.set_xlabel('age')
ax1.set_yticks([10,20,30,40,50,60])
ax2.set_title('Survived=1')
ax2.hist(age2,bins=20)
ax2.set_xlabel('age')
ax2.set_yticks([10,20,30,40,50,60])
plt.show()
fare=train_df.Fare[train_df['Survived']==1]
plt.hist(fare)
plt.show()
数据清理
train_df=train_df.drop(['Ticket','Cabin'],axis=1)
test_df=test_df.drop(['Ticket','Cabin'],axis=1)
combine=[train_df,test_df]
train_df.head()
train_df.Name.str.extract(' ([a-zA-Z]+)\.',expand=False)
for database in combine:
database['Title']=database.Name.str.extract(' ([a-zA-Z]+)\.',expand=False)
pd.crosstab(train_df['Title'],train_df['Sex'])
for database in combine:
database['Title']=database['Title'].replace(['Lady','Countess','Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'],'Rare')
database['Title']=database['Title'].replace('Mlle','Miss')
database['Title']=database['Title'].replace('Ms','Miss')
database['Title']=database['Title'].replace('Mme','Mrs')
train_df[['Title','Survived']].groupby(['Title'],as_index=False).mean().sort_values(by='Survived',ascending=False)
train_df.drop(['PassengerId','Name'],axis=1,inplace=True)
test_df.drop(['PassengerId','Name'],axis=1,inplace=True)
combine=[train_df,test_df]
train_df
数据填充
train_df.info()
从图中会发现 Age 和 Embarked 都有缺失值
freq=train_df.Embarked.dropna().mode()[0]
for database in combine:
database.Embarked=database.Embarked.fillna(freq)
grp = train_df.groupby(['Pclass','Sex','Title'])['Age'].mean().reset_index()
def fill_age(x):
return grp[(grp['Pclass']==x['Pclass']) & (grp['Sex']==x['Sex']) & (grp['Title']==x['Title'])].Age.values[0]
train_df['Age'] = train_df.apply(lambda x: fill_age(x) if np.isnan(x['Age']) else x['Age'] ,axis=1)
test_df['Age'] = test_df.apply(lambda x: fill_age(x) if np.isnan(x['Age']) else x['Age'] ,axis=1)
combine = [train_df,test_df]
处理分类特征
将数据根据分类转化为数字 0.1.2.3...
for database in combine:
database['Sex'] = database['Sex'].map({'female':1,'male':0}).astype(int)
title_mapping = {'Mr':1 ,'Miss':2 ,'Mrs':3 ,'Master':4 ,'Rare':5}
for database in combine:
database['Title'] = database['Title'].map(title_mapping).astype(int)
将连续特征转化为离散特征
将连续的数据 分为几组\几类
train_df['AgeBand'] =pd.cut(train_df['Age'],5)
train_df['AgeBand']
通过切割数据 可以将Age准确的分为五个组 并将不同组用1.2.3.4.5代替
for database in combine:
database.loc[ database['Age'] <16,'Age']=0
database.loc[(database['Age'] >=16)&(database['Age'] <32),'Age']=1
database.loc[(database['Age'] >=32)&(database['Age'] <48),'Age']=2
database.loc[(database['Age'] >=48)&(database['Age'] <64),'Age']=3
database.loc[ database['Age'] >=64,'Age']=4
train_df.drop(['AgeBand'],axis=1,inplace=True)
combine=[train_df,test_df]
train_df
利用qcut处理Fare的数据
train_df['FareBand']=pd.qcut(train_df['Fare'],4,duplicates='drop')
test_df['Fare'] = test_df['Fare'].fillna(test_df['Fare'].mean())
for dataset in combine:
dataset.loc[ dataset['Fare'] 7.91,'Fare']=0
dataset.loc[(dataset['Fare'] >7.91)&(dataset['Fare'] 14.454),'Fare']=1
dataset.loc[(dataset['Fare'] >14.454)&(dataset['Fare'] 31),'Fare']=2
dataset.loc[dataset['Fare'] >31,'Fare']=3
dataset['Fare'] = dataset['Fare'].astype(int)
train_df.drop(['FareBand'],axis=1,inplace=True)
combine=[train_df,test_df]
train_df
合并特征
将SibSp(兄弟姐妹配偶)和Parch(父母)两组数据合并为一组Family
后借助family判断是否为一个人出行IsAlong
for database in combine:
database['Family']=database['SibSp'] + database['Parch'] + 1
for database in combine:
database['IsAlong']=1
database.loc[database['Family']>1,'IsAlong']=0
train_df.drop(['SibSp','Parch','Family'],inplace=True,axis=1)
test_df.drop(['SibSp','Parch','Family'],inplace=True,axis=1)
train_df
制作模型
X_train = train_df.drop('Survived',axis=1)
Y_train = train_df['Survived']
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train,Y_train)
Y_pred = logreg.predict(test_df)
acc_log = round(logreg.score(X_train,Y_train)*100)
from sklearn.svm import SVC
svc = SVC()
svc.fit(X_train,Y_train)
Y_prd = svc.predict(test_df)
acc_svc = round(svc.score(X_train,Y_train)*100,2)
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,Y_train)
Y_pred = decision_tree.predict(test_df)
acc_decision_tree = round(decision_tree.score(X_train,Y_train)*100,2)
from sklearn.ensemble import RandomForestClassifier
random_forest = RandomForestClassifier()
random_forest.fit(X_train,Y_train)
Y_pred = random_forest.predict(test_df)
acc_random_forest = round(random_forest.score(X_train,Y_train)*100,2)
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,Y_train)
Y_pred = knn.predict(test_df)
acc_knn = round(knn.score(X_train,Y_train)*100,2)
from sklearn.naive_bayes import GaussianNB
gaussian = GaussianNB()
gaussian.fit(X_train,Y_train)
Y_pred = gaussian.predict(test_df)
acc_gaussian = round(gaussian.score(X_train,Y_train)*100,2)
将模型根据评分进行排名
models = pd.DataFrame({
'Modle':['Logistic Regression','Support Vector Machines','KNN','Naive Bayes','Decision Tree','Random Forest'],
'Score':[acc_log,acc_svc,acc_knn,acc_gaussian,acc_decision_tree,acc_random_forest]
})
models.sort_values(by='Score',ascending=False)
模型预测
有以上模型的排名情况 可选择决策树对数据进行预测
from sklearn.tree import DecisionTreeClassifier
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train,Y_train)
Y_pred = decision_tree.predict(test_df)
acc_decision_tree = round(decision_tree.score(X_train,Y_train)*100,2)
submission = pd.DataFrame({
'PassengerId':PassengerId,
'Survived':Y_pred
})
submission.to_csv('submission.csv',index=False)
submission
Original: https://blog.csdn.net/An_0330/article/details/121201245
Author: ximu VS code
Title: python 泰坦尼克号存活率分析
相关阅读
Title: pytorch安装详细步骤
文章目录
(一)win—配置tensorflow-GPU
直接查看这条链接即可:win-配置tf-GPU
本人用的conda和tensorflow-GPU版本下载:提取码:98ot
环境:win10+anaconda
注:anaconda安装步骤略,以下步骤默认anaconda已安装。
(二)安装 pytorch
2.1 创建虚拟环境
conda create --name pytorch python=3.8.1
注意,这里的 pytorch 是虚拟环境的名称,可随意取
。3.8.1是我机器上的python版本,可结合自己安装的python版本灵活变换。
activate pytorch
2.2正式安装pytorch
****:pytorch 官网链接
注意:如果笔记本有独立显卡(NVIDIA)的话,可以如上选择对应的CUDA版本,否则选择CPU。
- 本人采用的方法
在pytorch虚拟环境下,创建.condarc文件
在虚拟环境中输入如下命令:
conda config --set show_channel_urls yes
- 之后在自己电脑的用户文件下找到一个.condarc文件
用记事本打开这个文件,然后用如下代码代替其中的内容:
channels:
- defaults
show_channel_urls: true
channel_alias: https://mirrors.tuna.tsinghua.edu.cn/anaconda
default_channels:
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/pro
- https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2
custom_channels:
conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud
记住.condarc配置好后一定要保存。
- 或者使用清华镜像源
在浏览pytorch安装帮助的相关帖子时有人说清华源停止镜像了,但是现在清华源已经恢复提供镜像了,所以还是可以用的。附上清华源Anaconda镜像使用帮助,大家可以读一下这段帮助(不读也没关系下面给出详细步骤)。
conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
复制这条命令在 Anaconda Prompt 里输入。
2.3 验证是否安装成功
是否安装成功分两个方面。
- 一个是在prompt里面
(1)在命令行左边为 pytorch 环境中,输入 python
(2)之后,输入 import torch,如果没有报错,意味着 PyTorch 已经顺利安装了。
- 一个是在jupyter notebook里面调用
首先,在菜单中打开Anaconda Prompt,然后安装插件。
命令行输入:
conda install nb_conda
然后进入创建的pytorch环境,命令行输入:
conda install ipykernel
安装成功的样子如下:
进入Anaconda中可查看,多了pytorch框架:
(三)本文参考链接如下:(感谢各位大佬)
WIN10下pytorch环境配置(安装了半天的血泪史)
WIn10+Anaconda环境下安装PyTorch(避坑指南)
win10下使用anaconda安装pytorch(清华镜像)
如何让Jupyter Notebook支持pytorch
Original: https://blog.csdn.net/weixin_54546190/article/details/120754242
Author: ☞源仔
Title: pytorch安装详细步骤

【5G NR】SA下N26接口、EPS Fallback语音服务

在运动控制系统中如何快速入门EtherCAT总线?

【BUG】MMCV的坑:ImportError: /xxxx/mmcv/_ext.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN6caffe

详解机器学习高维数据降维方法

jetson nx安装cuda+cudnn+tensorflow-gpu

基于Avaya H323 IP录音系统的解决方案

调用讯飞平台应用商店技能完成人机交互功能(二)

TaxoNN: ensemble of neural networks on stratified microbiome data for disease prediction阅读报告

如何使用GoldWave软件将文字转换为语音

NLP-文本摘要:Rouge评测方法【Rouge-1、Rouge-2、Rouge-L、Rouge-W、Rouge-S】

35、T5L 迪文屏C51开发之音频播放

用树莓派做一个语音机器人

python baidu语音转文字

【论文阅读】CLIP:Learning Transferable Visual Models From Natural Language Supervision —— 多模态,视觉,预训练模型
