前提条件:
1、有一些python编程经验。
2、熟悉python主要科学库,特别是:numpy,pandas和matplotlib。
3、最好使用Jupyter 编程。(没有的话,建议下载Anaconda。里面有。)
一、下载数据:
1、 下载一个压缩文件housing.tgz即可,其包含housing.csv(已经包含书有数据。),用 tax xzf housing.tgz 来解压提取CSV文件。
import os
import tarfile
import urllib.request
DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
if not os.path.isdir(housing_path):
os.makedirs(housing_path)
tgz_path = os.path.join(housing_path, "housing.tgz")
urllib.request.urlretrieve(housing_url, tgz_path)
housing_tgz = tarfile.open(tgz_path)
housing_tgz.extractall(path=housing_path)
housing_tgz.close()
之后应用函数就好了。 Jupyter 最好用谷歌浏览器,搞不好会报错(没有网站访问权限)。
fetch_housing_data()
2、使用pandas加载数据,返回包含所用数据的DF 对象。
import pandas as pd
def load_housing_data(housing_path=HOUSING_PATH):
csv_path=os.path.join(housing_path,"housing.csv")
return pd.read_csv(csv_path)
load_housing_data(HOUSING_PATH)
查看数据结构:
housing = load_housing_data()
housing.head()
housing.info()
housing.describe()
%matplotlib inline
import matplotlib.pyplot as plt
housing.hist(bins=50,figsize=(20,15))
plt.show()
3、创建测试集(一般为数据集的百分之20,数据集越大,比例越小。)
import numpy as np
np.random.seed(42)
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
train_set, test_set = split_train_test(housing, 0.2)
len(train_set)
len(test_set)
from zlib import crc32
def test_set_check(identifier, test_ratio):
return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32
def split_train_test_by_id(data, test_ratio, id_column):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
return data.loc[~in_test_set], data.loc[in_test_set]
import hashlib
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def test_set_check(identifier, test_ratio, hash=hashlib.md5):
return bytearray(hash(np.int64(identifier)).digest())[-1] < 256 * test_ratio
housing_with_id = housing.reset_index()
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
test_set.head()
4、用Scikit-Learn 随机拆分 和 分层抽样出的数据测试集:
from sklearn.model_selection import train_test_split
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
test_set.head()
housing["median_income"].hist()
housing["income_cat"] = pd.cut(housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5])
housing["income_cat"].value_counts()
housing["income_cat"].hist()
from sklearn.model_selection import StratifiedShuffleSplit
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index]
strat_test_set = housing.loc[test_index]
strat_test_set["income_cat"].value_counts() / len(strat_test_set)
housing["income_cat"].value_counts() / len(housing)
5、接下来对三种测试集进行比较。
def income_cat_proportions(data):
return data["income_cat"].value_counts() / len(data)
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
compare_props = pd.DataFrame({
"Overall": income_cat_proportions(housing),
"Stratified": income_cat_proportions(strat_test_set),
"Random": income_cat_proportions(test_set),
}).sort_index()
compare_props["Rand. %error"] = 100 * compare_props["Random"] / compare_props["Overall"] - 100
compare_props["Strat. %error"] = 100 * compare_props["Stratified"] / compare_props["Overall"] - 100
compare_props
在得到结果后,只有随机的结果才会有一定的偏差。我们可以将其删除并按原样恢复数据:
[En]
After the results are obtained, only random ones will have a certain deviation. We can delete it and restore the data as it is:
for set_ in (strat_train_set, strat_test_set):
set_.drop("income_cat", axis=1, inplace=True)
**二、数据探索
前提(为了不损坏数据,copy一下吧。)**
housing = strat_train_set.copy()
1、将地理数据可视化:
housing.plot(kind="scatter", x="longitude", y="latitude")
housing.plot(kind="scatter",x="longitude",y="latitude",alpha=0.1)
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population", figsize=(10,7),
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)
plt.legend()
2、寻找相关性:
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
from pandas.plotting import scatter_matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
housing.plot(kind="scatter", x="median_income", y="median_house_value",
alpha=0.1)
plt.axis([0, 16, 0, 550000])
save_fig("income_vs_house_value_scatterplot")
3、试验不同属性的组合(特征提取):
housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
housing.plot(kind="scatter", x="rooms_per_household", y="median_house_value",
alpha=0.2)
plt.axis([0, 5, 0, 520000])
plt.show()
housing.describe()
三、数据准备
先回到一个干净的训练集(copy())^ ^
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()
1、数据清理(对残缺的数据,我进行的是补充完整训练数据的中位数。):
sample_incomplete_rows = housing[housing.isnull().any(axis=1)].head()
sample_incomplete_rows
median = housing["total_bedrooms"].median()
sample_incomplete_rows["total_bedrooms"].fillna(median, inplace=True)
sample_incomplete_rows
2、Scikit-Learn的设计:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy="median")
housing_num = housing.drop("ocean_proximity", axis=1)
imputer.fit(housing_num)
imputer.statistics_
housing_num.median().values
X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns ,index=housing_num.index )
housing_tr.loc[sample_incomplete_rows.index.values]
imputer.strategy
housing_tr = pd.DataFrame(X, columns=housing_num.columns,
index=housing_num.index)
housing_tr.head()
3、处理文本和分类属性:、
前面我们只处理了数值属性。现在看一下文本属性。
housing_cat = housing[["ocean_proximity"]]
housing_cat.head(10)
from sklearn.preprocessing import OrdinalEncoder
ordinal_encoder =OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
housing_cat_encoded[:10]
ordinal_encoder.categories_
from sklearn.preprocessing import OneHotEncoder
cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
housing_cat_1hot
housing_cat_1hot.toarray()
cat_encoder.categories_
4、自定义转换器
from sklearn.base import BaseEstimator,TransformerMixin
rooms_ix , bedrooms_ix, population_ix , households_ix =3,4,5,6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin ):
def __init__ (self, add_bedrooms_per_room = True):
self.add_bedrooms_per_room=add_bedrooms_per_room
def fit(self,X, y = None):
return self
def transform(self , X):
rooms_per_household = X [: , rooms_ix] / X[:,households_ix]
population_per_household = X[:,population_ix] / X [:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:,bedrooms_ix] / X[:,rooms_ix]
return np.c_[X,rooms_per_household , population_per_household,bedrooms_per_room ]
else:
return np.c_[X,rooms_per_household, population_per_household ]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room= False)
housing_extra_attribs= attr_adder.transform(housing.values)
5、特征缩放:
col_names = "total_rooms", "total_bedrooms", "population", "households"
rooms_ix, bedrooms_ix, population_ix, households_ix = [
housing.columns.get_loc(c) for c in col_names]
housing_extra_attribs = pd.DataFrame(
housing_extra_attribs,
columns=list(housing.columns)+["rooms_per_household", "population_per_household"],
index=housing.index)
housing_extra_attribs.head()
6、转换流水线:
(数据的转换需要正确的顺序来执行)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num",num_pipeline , num_attribs ),
("cat" , OneHotEncoder(),cat_attribs ),
])
housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
housing_prepared.shape
四、选择和训练模型 :
开始准备机器学习算法:
共训练了线性回归模型、决策树和随机森林。使用测试集来评估训练后的泛化效果较好。
[En]
A total of linear regression models, decision trees and random forests are trained. It is better to use the test set to evaluate the generalization after training.
1、训练和评估训练集:
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared,housing_labels)
some_data =housing.iloc[:5]
some_labels = housing_labels.iloc[: 5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:",lin_reg.predict(some_data_prepared))
from sklearn.metrics import mean_squared_error
housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
lin_rmse
但是这个结果也并不是太好看(68628.198)有点大。让我们再看一下决策树:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels,housing_predictions)
tree_rmse = np.sqrt(tree_mse)
tree_rmse
结果为(0.0) 大概严重过拟合了。
2、交叉验证更好的评估:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(tree_reg,housing_prepared , housing_labels ,
scoring="neg_mean_squared_error", cv=10)
tree_rmse_scores = np.sqrt(-scores)
def display_scores(scores):
print("Scores:",scores)
print("Mean:", scores.mean())
print("Standard deviation:", scores.std())
display_scores(tree_rmse_scores)
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores)
然后,您会发现决策树确实是过度拟合的,并且比线性回归执行得更差。让我们再次尝试随机森林:
[En]
Then you will find that the decision tree is indeed overfitted and performs worse than linear regression. Let's try the random forest again:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared,housing_labels)
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
forest_rmse
from sklearn.model_selection import cross_val_score
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,
scoring="neg_mean_squared_error", cv=10)
forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores)
五、微调模型:
1、网格搜索:
调整超参数:
from sklearn.model_selection import GridSearchCV
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]
forest_reg = RandomForestRegressor(random_state=42)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error',
return_train_score=True)
grid_search.fit(housing_prepared, housing_labels)
grid_search.best_params_
grid_search.best_estimator_
cvres = grid_search.cv_results_
for mean_score ,params in zip(cvres["mean_test_score"],cvres["params"]):
print(np.sqrt(-mean_score),params)
2、随机搜索:(适合那种超参数比较大范围的)。
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint
param_distribs = {
'n_estimators': randint(low=1, high=200),
'max_features': randint(low=1, high=8),
}
forest_reg = RandomForestRegressor(random_state=42)
rnd_search = RandomizedSearchCV(forest_reg, param_distributions=param_distribs,
n_iter=10, cv=5, scoring='neg_mean_squared_error', random_state=42)
rnd_search.fit(housing_prepared, housing_labels)
rnd_search.best_params_
cvres = rnd_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
print(np.sqrt(-mean_score), params)
3、分析最佳模型及其误差:
feature_importances = grid_search.best_estimator_.feature_importances_
feature_importances
将重要性分数显示在对应属性旁边:
extra_attribs = ["rooms_per_hhold", "pop_per_hhold", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs
sorted(zip(feature_importances, attributes), reverse=True)
4、通过测试集评估系统:
到目前为止,我们终于有了一个好的系统。让我们做一个最后的评估。成败取决于这一点。
[En]
Up to now, we finally have a good system. Let's make a final assessment. Success or failure depends on this.
评估最终模型
final_model = grid_search.best_estimator_
x_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()
x_test_prepared = full_pipeline.transform(x_test)
final_predictions = final_model.predict(x_test_prepared)
final_mse = mean_squared_error(y_test , final_predictions)
final_rmse = np.sqrt(final_mse)
final_rmse
结果还不错,但是存在的泛化误差的危害性还是比较大的。
为此计算泛化误差的0.95置信区间:
from scipy import stats
confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
loc=squared_errors.mean(),
scale=stats.sem(squared_errors)))
六、启动!!
full_pipeline_with_predictor = Pipeline([
("preparation", full_pipeline),
("linear", LinearRegression())
])
full_pipeline_with_predictor.fit(housing, housing_labels)
full_pipeline_with_predictor.predict(some_data)
保存训练好的模型,以后还能用。^^
my_model = full_pipeline_with_predictor
import joblib
joblib.dump(my_model, "my_model.pkl")
my_model_loaded = joblib.load("my_model.pkl")
结束语:
我是从一本名为《机器学习实践》的书中学到的。以上内容基本上就是上面的内容。下面将提到这一点。
[En]
I learned it from a book called "Machine Learning practice". The above is basically the above. It will be mentioned below.
鄙人不才,分析不是很全面,如有一些错误,请评论指正,感谢!
完整代码: 这个是我敲的
或者:原作者敲的
最后:
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, 作者: Aurelien Geron(法语) , 又 O Reilly 出版, 书号 978-1-492-03264-9。
建议买一本,很不错。🆗
Original: https://blog.csdn.net/qq_51153436/article/details/121527662
Author: 看到我你要笑一下
Title: 机器学习实例(预测房价中位数)(附代码)
相关阅读
Title: Anaconda安装tensorflow和keras包
Title: Anaconda安装tensorflow和keras包
1.背景
在Anaconda中无法直接安装这两个包,安装过程异常漫长。
2.准备工作
添加清华源
1.在Anaconda prompt中(可利用全局搜索查找)运行 conda config命令,然后寻找到用户目录下的.condarc文件(如果没找到的话可能是隐藏了,可右键设置显示隐藏文件)
2.将该文件内容更改为
channels:
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/peterjc123/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
- http://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
show_channel_urls: true
ssl_verify: true
若按上述步骤更改文件后显示invaliderror,则先删除.condarc文件,然后运行conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/bioconda/
最后在新创的.condarc文件中重复上述2操作修改文件内容
3.在Anaconda prompt中运行
conda clean -i
//清除索引缓存,保证用的是镜像站提供的索引
3.建立虚拟环境
1.创建环境
打开Anaconda prompt中输入
conda create --name tensorflow python=3.6
//"tensorflow"是你建立的conda虚拟环境的名字
//创建时若出现旋转横线停止旋转的情况可能是因为上述命令行中输入错误
//不论你之前安装的python版本是什么,在这里建议用3.6,兼容性更强,并且因为需要python,tensorflow和keras版本匹配才能顺利安装使用后面所安装的tensorflow和keras均为与python3.6适配的版本,若不使用python3.6创建虚拟环境,请务必百度与自己安装python版本适配的tensorflow和keras,在此特别强调tensorflow和keras必须版本适配否则不能顺利安装使用
2.进入环境
conda activate tensorflow
//进入名为"tensorflow"的虚拟环境
3.安装包
安装tensorflow
pip install tensorflow==2.0.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
//上面两行是一个命令行
//安装过程中底行会出现Proceed:y/n:只需要输入一个y即可
测试tensorflow是否安装成功
1.首先创建一个新的工程包
2.配置环境
//参考路径,具体以自己的安装路径为准
3.在project下创建一个test.py,内容如下
import tensorflow as tf
A = tf.constant([[1, 2], [3, 4]])
B = tf.constant([[5, 6], [7, 8]])
C = tf.matmul(A, B)
print©
print(tf. version) //注意这里version两侧分别是两条下划 线
4.若运行结果如下则安装成功
; 安装keras
pip install keras==2.3.1 -i https://pypi.tuna.tsinghua.edu.cn/simple
//上面两行是一个命令行
测试keras是否安装成功
1.进入python环境
python
2.导入keras包
import keras
3.若结果为如下图则安装成功
; 4.可能遇到的问题
1.官网安装Anaconda速度很慢
建议到清华镜像:https://mirrors.tuna.tsinghua.edu.cn/安装
//镜像列表->anaconda->archive->(摁一下date旁边的箭头可将版本从新到旧排列然后安装最上面即最新版本即可)
2.非首次安装Anaconda后发现大量文件缺失
删除用户目录下的.condarc文件,卸载旧文件包重新下载
3.测试tensorflow包时出现如下错误
点击错误报告中的蓝色链接,进入对应的文件,找到对应的行,改成图。
[En]
Click the blue link in the error report to enter the corresponding file, find the corresponding line, and change it to the figure.
; 4.安装包时出现如下错误
该错误是因为安装的tensorflow与keras版本不匹配导致,请卸载原tensorflow包和keras包,百度与自己虚拟环境python版本适配的tensorflow包和keras包,注意tensorflow必须要与keras版本适配
5.导入问题
在python环境中导入tensorflow若出现报错"importerror cannot found name "或者等了一会都完成不了可以重新进入python环境试试import tensorflow as tf
Original: https://blog.csdn.net/qq_48078689/article/details/124273084
Author: gcj_future
Title: Anaconda安装tensorflow和keras包Original: https://blog.csdn.net/qq_48078689/article/details/124273084
Author: gcj_future
Title: Anaconda安装tensorflow和keras包

基于PreSCAN& Matlab/Simulink的智能驾驶联合仿真【详细图文】

目标检测研究综述

对于pix2pix的介绍以及实现

语音情感特征MFCC

基于python的数字印刷体识别_不告诉你我用了它配合Python简简单单开发OCR识别,带你识别手写体、印刷体、身份证等N种,附代码!…

SDL开发笔记(二):音频基础介绍、使用SDL播放音频

tensorflow2的GPU版本安装

【VisionMaster】 N点标定

VOLTE实战经验

数据分析-深度学习-前馈神经网络-分类-Tensorflow

Ubuntu20.04+cuda+cudnn+ancadacon+tensorflow+opencv从零安装完整教程

Modeling Mention Dependencies for Document-Level Relation Extraction

机器视觉——目标跟踪

微信小程序同声传译开发(语音识别、语音输入转文字)开发教程
