数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

人工智能55

当时认为,最初的分析逻辑混乱,模型单一,所以从这两个角度进行了改进,继续分析。

[En]

At that time, it was considered that the logic of the initial analysis was confused and the model was single, so it was improved from these two angles to continue the analysis.

未进行再次加工,代码见资源中的exploration2

问题背景及重述

这一想法源于以下背景:在当前的租赁市场中,租赁方往往处于弱势地位,这不仅需要中介费,还会带来额外的风险,其中之一来自一些黑色中介机构为了节省成本而不履行保护信息的义务,甚至为了自己的利润提供虚假信息。因此,在链家筛选出的一系列客观数据中,标注为《一定要好看的房子》的主观推荐格外耀眼。事实上,租房者并没有能力总结和筛选很多客观的数据,所以他们很大程度上依赖于一种推荐,那就是“一定要看好房源”。这两点结合在一起,就引出了问题的终极目的,那就是之前报告中所说的是不是真的好?

[En]

The idea originated from the following background: in the current rental market, the renting party is often in a weak position, which requires not only intermediary fees but also additional risks, one of which comes from some black intermediaries who do not fulfill their obligations to protect information in order to save costs and even provide false information for their own profits. therefore, it is particularly dazzling to see the subjective recommendation marked "must look good house" in a series of objective data screened by Lianjia. In fact, renters do not have the ability to summarize and screen a lot of objective data, so they rely heavily on the recommendation of "must look good at housing". The combination of these two points leads to the ultimate goal of the question, whether the label "must take good care of the house" is true and reliable, that is, whether what was said in the previous report is really good?

接下来的问题是一般认为的"必看好房"应该是什么样的?重新翻查了链家自己的解释,链家在二手房交易板块对"必看好房"标签解释为性价比高的稀缺好房。之前报告中将目标问题写的比较复杂"在其它条件相同时具有价格优势",也就是性价比高的意思。那么问题的主要目标,"必看好房"是否真的好,此时可以被直观表述为探索"必看好房"是否真的性价比较高。
事实上,显然有两点非常令人困惑。首先,以前没有讨论过的稀缺性。当用单一指数分析“必看房源”时发现,并不是所谓的稀缺性,甚至相反,比如在价格指数中,特别是几条房源价格高低的信息,都没有标注为“必看房源”。在区位指数上,房源数量较少的崇明也出现了房源均未被标注为“必看好房源”的现象。还有一点是,通过综合排名在租房信息首页呈现的20条信息中,只有5条被标记为“必看”,占比25%,这与“必看”的房源占比24.87%基本持平,说明“必看”在综合排名上并不占优势,性价比较高再次打上了问号。

[En]

In fact, there are obviously two points that are very confusing. first, the scarcity that has not been discussed before. when analyzing "must-see housing" with a single index, it is found that it is not so-called scarcity, or even on the contrary, such as in the price index, especially high or low prices of several pieces of housing information, is not marked as "must-see housing". In the location index, Chongming, which has a small number of housing, also appears the phenomenon that none of the housing is marked as "must see good housing". Another point is that of the 20 messages presented on the first page of rental information through comprehensive ranking, only 5 are marked as "must-see", accounting for 25%, which is basically the same as "must-see" accounting for 24.87% of all housing sources, indicating that "must-see" does not have an advantage in comprehensive ranking, putting a question mark again for its higher performance-to-price ratio.

基本思路

回到主要问题上,"必看好房"是否具有更高性价比?最为直接的思路是将性价比量化出来,把样本分为是"必看好房"与不是"必看好房"两组,比较性价比均值。更加严谨的做法则是把链家采集的租房数据看作对租房市场的抽样,而若将是否为必看好房视为两类的话,那么该问题,"必看好房"是否具有更高性价比,等价于对两组不同总体的样本的均值差进行假设检验,
上述方法的难点在于,我们不是业内人士,也没有性价比的评价体系。事实上,如果链家的同事们也有类似的合适的构建性价比指数的方法,那么我们就可以直接用上面的方法来判断链家的“必看好房”是否有更高的性价比。

[En]

The difficulty of the above methods is that we are not people in the industry, and there is no evaluation system for the ratio of performance to price. In fact, if Lianjia's colleagues have a similar suitable method for constructing the index of performance-to-price ratio, then we can directly use the above methods to judge whether Lianjia's "must-see good house" has a higher performance-to-price ratio.

注意到比较时实际已经将"必看好房"和非"必看好房"看作两类不同总体的样本而非一类样本某个特征的两个不同值了。
由于目前还没有合适的性价比评价指标,所以认为从性价比的定义出发,在房屋的各种因素都相同的情况下,比较“必看好房”和非“必看好房”的价格水平。但问题是,对于一套“必看房”来说,除了价格之外,根本找不到一套与其因素完全相同的非“必看房”,价格也无从比较。因此,我希望用非必看房的数据来训练一个价格预测模型,并以价格作为目标标签。用其他的特征值来代入,相当于在价格之外的各种因素中生成了一个非必看房,我们还预测,除了它的价格之外,把这个价格和这个必看房的价格进行比较,来比较性价比,从而验证结论。

[En]

*Since there is no suitable evaluation index of performance-to-price ratio, it is considered that starting from the definition of performance-to-price ratio, when all kinds of factors of the house are the same, compare the price level of "must-see good house" and non-"must-see good house". But the problem is that for a "must-see house", it is impossible to find a non-"must-see house" with exactly the same factors as it except the price, and there is no way to compare the price. Therefore, I hope to use the non-"must-see house" data to train a price prediction model and take the price as the target label. Substituting the other characteristic values of the "must-see house" is equivalent to generating a non-must-see house with all kinds of factors other than the price, and we also predict that in addition to its price, compare this price with the price of the "must-see house" to compare the performance-to-price ratio, so as to verify the conclusion.

以上是之前的做法,其实并不一定要选择价格预测的方法。以面积为预测对象,构建预测模型,在除面积外的其他因素相同的情况下,比较“必看”和非“必看”房屋的面积。当然,我也可以将预测目标更改为分类值,例如方向。人们普遍认为,在其他条件相同的情况下朝南更划算,所以此时需要一个分类模式,但我认为这并不好。究其原因,是分类模型在这里的说明力不强,比如南北对比没那么令人信服,性价比到底有多差,也没有之前回归的概念。

[En]

The above is the previous practice, in fact, it is not necessary to choose the method of price prediction. I take the area as the target of the forecast to build a prediction model to compare the area of "must-see" and non-"must-see" houses with the same other factors except area. Of course, it is also possible for me to change the prediction target to a classification value, for example, orientation. It is generally believed that it is more cost-effective to face south when other conditions are the same, so a classification model is needed at this time, but I think this is not good. The reason is that the explanation of the classification model is not strong here, such as the comparison between the north and the south is less convincing, and there is no concept of how much the performance-to-price ratio is as bad as in the previous regression.*

验证结论

下面尝试换方法(假设检验)、换预测目标(面积)、换模型(KNN)验证是否能有相近结论。

假设检验

回到第一步的思路,试着构建一个成本效益指数,此时类似的两组样本可能会被用来进行假设检验来验证结论。

[En]

Back to the first step of the train of thought, try to build a cost-effective index, at this time similar to the two groups of samples of hypothesis testing may be used to verify the conclusion.

首先,汲取之前的经验,将数据范围限定于租金在25000 元/月以下的房源,其次构建一个自己认可的性价比指标,对某个房源的性价比指标记为
v = (该区域平均每平方租金+该街道平均每平方租金)/(该房源每平方租金2)
此时得到"必看好房"与非"必看好房"性价比指标v1,v2。对应均值、方差、样本量分别为:(1.117614,1.094020) , (0.08818, 0.14374), (3596,9849), 计算Z 值为3.77对应单边检验的置信概率大于99.99%,即超过99.99%认为"必看好房"性价比更高,可以验证我们回归模型的结论。
分析效果出乎意料的原因可能是,在构建性价比指数时,只采用了被认为对性价比影响较大的因素,放大了影响。

[En]

The unexpected reason for the analysis effect may be that only the factors considered to have a great influence on the performance-to-price index were adopted in the construction of the performance-to-price index, which magnified the influence.*

KNN

利用KNN模型验证时,此处仅需将KNN代替线性回归模型进行预测即可,得到的学习曲线如下,发现k取6时R2最高。(5折交叉验证)

数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

将K=6时KNN 评估指标与线性回归的评估指标一同比较如下图,数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

由上表知,KNN得到的预测模型效果弱于LR,但都能验证结论,即"必看好房"确实更具性价比。

但同时注意到,两者模型的偏差相近,但百分比却明显不同让人疑惑,两模型得到的"必看好房"预测价格与实际价格的描点图如下,从中可见LR对于低价格房源价格预测偏低,KNN对于低价格房源价格预测偏高,这可能是样本分布不均引起的,KNN对于低价格房源预测时会接近样本较多的中间价格房源的点,从而导致低价格房源价格预测偏高。
数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

; 面积预测

根据思路中的分析,同样可以通过,利用非"必看好房"数据训练一个面积预测模型,将面积作为目标标签,把"必看好房"的其它特征值代入相当于生成一个除面积外各类因素与它完全相同的非"必看好房",并且我们还预测除了它的面积,把这个面积与"必看好房"的实际面积比较来评价性价比的高低,从而验证结论。
同样地,我们记录非"必看好房"面积预测模型回归系数R2 = 0.7579,"必看好房"预测面积与实际面积的平均偏差(预测面积高为正)-1.5355平方米,即预测面积比实际面积少1.5355平方米,即"必看好房"更具性价比,可以验证结论。面积作为客观数据,对其预测不大符合常理,这里仅用做验证结论。

模型精度

通过不同方式对结论进行验证后,初步目标已基本实现,但仍希望回到之前的价格预测模型,探索预测模型能否继续提高精度。

[En]

After verifying the conclusion through different ways, the initial goal has been basically achieved, but we still hope to go back to the previous price forecasting model to explore whether the forecasting model can continue to improve the accuracy.

特征工程

特征工程中遇到的问题类似于以往的假设检验,即没有构建高性价比指标的经验,所以我们仍然从减少无关指标和控制之间可能的多重共线性入手。

[En]

The problem encountered in feature engineering is similar to that in previous hypothesis testing, that is, there is no experience in the construction of cost-effective indicators, so we still start with reducing the possible multicollinearity between irrelevant indicators and control.

首先考虑可能存在的多重共线性的问题,发现三个楼层分类特征和卧室数量、卫生间、客厅数量特征对应的方差膨胀系数较高,基本与相关性分析中发现的较高相关性指标一致,保留一个楼层分类特征与卧室数量特征外,其它特征去除后重新训练,发现模型R2并没有直接提高。
其次可以对方差过滤时的阈值进行探索,原模型取方差阈值为0.02 _0.98对应模型具有34个特征,R2 为 0.6681,将阈值降低至0.01_0.99时对应模型具有66个, R2为0.7046,但发现偏差值随之下降。
(这一部分结果没达到预期,但没有想到好方法。)

特征重要性

由于我们需要探索有意义的偏差值来验证结论,并且没有正则化,所以此时结论的验证已经完成。从评估模型来看,我们在正规化之后进行了再培训。目的是探索租金价格预测模型的重要特征。

[En]

Because we need to explore the meaningful deviation value in order to verify the conclusion, and there is no regularization, the verification of the conclusion has been completed at this time. From the point of view of the evaluation model, we retrain after regularization. The goal is to explore the important features of the rental price forecasting model.

在正则化训练后的回归模型中,特征的重要性是其特征参数的绝对值。重要度前十名的特征及其参数的绝对值如下。显而易见,除面积和楼面外,其他所有分类指标都是区域分类指标。面积和面积的十大分类指标也符合这样的现象,即我们设计的性价比指标可以很好地验证假设检验中的结论,但出乎意料的是,楼层数也是重要的特征之一。的确,在特征工程中,它的影响是不容忽视的。

[En]

In the regression model trained after regularization, the importance of feature is the absolute value of its characteristic parameters. The characteristics of the top ten of importance and the absolute values of their parameters are as follows. It is obvious that except for area and floor, all the other classification indicators are regional classification indicators. The top ten classification indicators of area and area are also in line with the phenomenon that the cost-effective index designed by ourselves can well verify the conclusion in the hypothesis test, but unexpectedly, the number of floors is also one of the important characteristics. Indeed, its influence should not be ignored in feature engineering.

数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

Original: https://blog.csdn.net/weixin_43840683/article/details/122717584
Author: weixin_43840683
Title: 数据挖掘与数据分析项目链家租房数据(三)进一步探索与归纳

相关文章
LBP算法及其改进算法 人工智能

LBP算法及其改进算法

LBP LBP指局部二值模式,英文全称:Local Binary Pattern,是一种用来描述图像局部特征的算子,LBP特征具有灰度不变性和旋转不变性等显著优点。它是由T. Ojala, M.Pie...
使用OpenCV进行特征(颜色、形状)提取 人工智能

使用OpenCV进行特征(颜色、形状)提取

图像处理 图像处理所做的只是从图像中提取有用的信息,从而减少数据量,但保留描述图像特征的像素。 下面从图像中提取颜色、形状和纹理特征的方法开始 1. 颜色 每次处理图像项目时,图像的色彩空间都会成为最...
python绘制语谱图(手动实现) 人工智能

python绘制语谱图(手动实现)

1 原理分析 在获取语谱图数据之前,我们需要先了解短时傅里叶变换。语音信号是典型的非平稳信号,但是由于其非平稳性由发声器官的物理运动过程而产生,这种过程是相对变换缓慢的,在10~30ms以内可以认为是...
R-CNN 人工智能

R-CNN

注意:本课程已从maskr benchmark更新到了Detectron2 Mask R 是一种基于深度学习的图像实例分割方法,可对物体进行目标检测和像素级分割。 本课程将手把手地教大家使用Label...
Lyra编码器基础环境搭建 人工智能

Lyra编码器基础环境搭建

Lyra介绍 Google最近开源了一种语音压缩的新型超低比特率编解码器,这种编码器的最大特点是基于机器学习原理,能够使用最少的数据来重建语音,这是和传统AAC和Opus编码原理的本质区别,这种基于机...