2022-11-13Python118

tsfresh是开源的提取时序数据特征的python包，能够提取出超过64种特征，堪称提取时序特征的瑞士军刀。最近有需求，所以一直在看，目前还没有中文文档，有些特征含义还是很难懂的，我把我已经看懂的一部分放这，没看懂的我只写了标题，待我看懂我添加注解。

时间序列的平方和

参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

返回序列x的连续变化的绝对值之和

参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

计算聚合函数f_agg(例如方差或者均值)处理后的自相关性，在一定程度可以衡量数据的周期性质，表示滞后值，如果某个计算出的值比较大，表示改时序数据具有周期性质。

n是时间序列的长度，是方差，表示均值
参数：x(pandas.Series) 需要计算特征的时间序列
返回值：特征值
返回值类型：float
函数类型：简单

对时序分块聚合后（max, min, mean, meidan），然后聚合后的值做线性回归，算出 pvalue(),rvalue(相关系数), intercept(截距), slope(斜率), stderr(拟合的标准差)
Parameters: x (pandas.Series) – the time series to calculate the feature of
param (list) – contains dictionaries {"attr": x, "chunk_len": l, "f_agg": f} with x, f an string and l an int
Returns: the different feature values
Return type: pandas.Series

近似熵，用来衡量一个时间序列的周期性、不可预测性和波动性

自回归模型系数，

滞后lag的自相关系数

把整个序列按值均分成max_bins个桶，然后把每个值放进相应的桶中，然后求熵。

表示落在第k个桶中的数占总体的比例。
这个特征是为了衡量样本值分布的均匀度。
参数：x(pandas.Series) 需要计算特征的时间序列
max_bins (int) 桶的数量
返回值：特征值
返回值类型：float
函数类型：简单

等同于

衡量时序数据的非线性性

先用ql和qh两个分位数在x中确定出一个区间，然后在这个区间里计算时序数据的均值、绝对值、连续变化值。

Parameters:
x (pandas.Series) – 时序数据
ql (float) – 分位数的下限
qh (float) – 分位数的上线
isabs (bool) – 使用使用绝对值
f_agg (str, name of a numpy function (e.g. mean, var, std, median)) – numpy自带的聚合函数（均值，方差，标准差，中位数）

用来评估时间序列的复杂度，越复杂的序列有越多的谷峰。

大于均值的数的个数

小于均值的数的个数

Calculates the sum of squares of chunk i out of N chunks expressed as a ratio with the sum of squares over the whole series.

Takes as input parameters the number num_segments of segments to divide the series into and segment_focus which is the segment number (starting at zero) to return a feature on.

If the length of the time series is not a multiple of the number of segments, the remaining data points are distributed on the bins starting from the first. For example, if your time series consists of 8 entries, the first two bins will contain 3 and the last two values, e.g. [ 0., 1., 2.], [ 3., 4., 5.] and [ 6., 7.].

Note that the answer for num_segments = 1 is a trivial "1" but we handle this scenario in case somebody calls it. Sum of the ratios should be 1.0.

x (numpy.ndarray) – the time series to calculate the feature of
param – contains dictionaries {"num_segments": N, "segment_focus": i} with N, i both ints

the feature values

list of tuples (index, data)

Returns the spectral centroid (mean), variance, skew, and kurtosis of the absolute fourier transform spectrum.

x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {"aggtype": s} where s str and in ["centroid", "variance", "skew", "kurtosis"]

the different feature values

pandas.Series

This function is of type: combiner

Calculates the fourier coefficients of the one-dimensional discrete Fourier Transform for real input by fast fourier transformation algorithm

The resulting coefficients will be complex, this feature calculator can return the real part (attr=="real"), the imaginary part (attr=="imag), the absolute value (attr=""abs) and the angle in degrees (attr=="angle).

x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {"coeff": x, "attr": s} with x int and x >= 0, s str and in ["real", "imag", "abs", "angle"]

the different feature values

pandas.Series

This function is of type: combiner

最大值第一次出现的位置

最小值第一次出现的位置

Coefficients of polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model

as described by [1].

For short time-series this method is highly dependent on the parameters.

References

[1] Friedrich et al. (2000): Physics Letters A 271, p. 217-222
Extracting model equations from experimental data

x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {"m": x, "r": y, "coeff": z} with x being positive integer, the order of polynom to fit for estimating fixed points of dynamics, y positive float, the number of quantils to use for averaging and finally z, a positive integer corresponding to the returned coefficient

the different feature values

pandas.Series

有没有重复值

最大值有没有重复

最小值有没有重复

标准差是否大于r乘以最大值减最小值

最大值最后出现的位置

最小值最后出现的位置

x的长度

Calculate a linear least-squares regression for the values of the time series versus the sequence from 0 to length of the time series minus one. This feature assumes the signal to be uniformly sampled. It will not use the time stamps to fit the model. The parameters control which of the characteristics are returned.

Possible extracted attributes are "pvalue", "rvalue", "intercept", "slope", "stderr", see the documentation of linregress for more information.

x (numpy.ndarray) – the time series to calculate the feature of
param (list) – contains dictionaries {"attr": x} with x an string, the attribute name of the regression model

the different feature values

pandas.Series

This function is of type: combiner

大于均值的最长连续子序列长度

小于均值的最长连续子序列长度

Largest fixed point of dynamics :math:argmax_x {h(x)=0}` estimated from polynomial h(x), which has been fitted to the deterministic dynamics of Langevin model

as described by

Friedrich et al. (2000): Physics Letters A 271, p. 217-222 Extracting model equations from experimental data
For short time-series this method is highly dependent on the parameters.

x (numpy.ndarray) – the time series to calculate the feature of
m (int) – order of polynom to fit for estimating fixed points of dynamics
r (float) – number of quantils to use for averaging

Largest fixed point of deterministic dynamics

float

最大值

连续变化值绝对值的均值

连续变化值的均值

中位数

最小值

Calculates the number of crossings of x on m. A crossing is defined as two sequential values where the first value is lower than m and the next is greater, or vice-versa. If you set m to zero, you will get the number of zero crossings.

x (numpy.ndarray) – the time series to calculate the feature of
m (float) – the threshold for the crossing

the value of this feature

int

This feature calculator searches for different peaks in x. To do so, x is smoothed by a ricker wavelet and for widths ranging from 1 to n. This feature calculator returns the number of peaks that occur at enough width scales and with sufficiently high Signal-to-Noise-Ratio (SNR)

x (numpy.ndarray) – the time series to calculate the feature of
n (int) – maximum width to consider

the value of this feature

int

峰值个数

len(different values occurring more than once) / len(different values)
出现超过1次的值的个数/总的取值的个数（重复值只算一个）

出现超过1次的值的个数/总个数

返回x中q的分位数，q% 小于分位数。

x中在min和max之间的数的个数

取值大于r倍标准差的比例

把 x unique后的长度除以x原始长度 len(set(x))/len(x)

标准差

出现过多次的点的个数

出现过多次的值的和

所有值的和

相当于

x中值等于value的计数

方差是否大于标准差

Original: https://blog.51cto.com/xindoo/5484742
Author: xindoo
Title: python tsfresh特征中文详解

Title: Python采集网站ip代理, 检测IP代理是否可用

开发环境

Python 3.8
Pycharm

模块使用

requests >>> pip install requests
parsel >>> pip install parsel

代理ip结构

proxies_dict = {
    "http": "http://" + ip:端口,
    "https": "http://" + ip:端口,
}

代码实现步骤:

1. 导入模块

# 导入数据请求模块
import requests  # 数据请求模块 第三方模块 pip install requests
# 导入 正则表达式模块
import re  # 内置模块
# 导入数据解析模块
import parsel  # 数据解析模块 第三方模块 pip install parsel  >>> 这个是scrapy框架核心组件

2. 发送请求, 对于目标网址发送请求 https://www.kuaidaili.com/free/

url = f'https://www.kuaidaili.com/free/inha/{page}/'  # 确定请求url地址
# 用requests模块里面get 方法 对于url地址发送请求, 最后用response变量接收返回数据
response = requests.get(url)

3. 获取数据, 获取服务器返回响应数据(网页源代码)

print(response.text)

4. 解析数据, 提取我们想要的数据内容

解析数据方式方法：

正则: 可以直接提取字符串数据内容
xpath: 根据标签节点提取数据内容
css选择器: 根据标签属性提取数据内容

哪一种方面用那种, 那是喜欢用那种

正则表达式提取数据内容

正则提取数据 re.findall() 调用模块里面的方法
正则遇事不决 .*? 可以匹配任意字符(除了换行符\n以外) re.S

ip_list = re.findall('(.*?)', response.text, re.S)
port_list = re.findall('(.*?)', response.text, re.S)
print(ip_list)
print(port_list)

css选择器:

css选择器提取数据需要把获取下来html字符串数据(response.text) 进行转换

# #list > table > tbody > tr > td:nth-child(1)
# //*[@id="list"]/table/tbody/tr/td[1]
selector = parsel.Selector(response.text) # 把html 字符串数据转成 selector 对象
ip_list = selector.css('#list tbody tr td:nth-child(1)::text').getall()
port_list = selector.css('#list tbody tr td:nth-child(2)::text').getall()
print(ip_list)
print(port_list)

xpath 提取数据

selector = parsel.Selector(response.text) # 把html 字符串数据转成 selector 对象
ip_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[1]/text()').getall()
port_list = selector.xpath('//*[@id="list"]/table/tbody/tr/td[2]/text()').getall()

提取ip

for ip, port in zip(ip_list, port_list):
    # print(ip, port)
    proxy = ip + ':' + port
    proxies_dict = {
        "http": "http://" + proxy,
        "https": "http://" + proxy,
    }
    print(proxies_dict)

5. 检测ip质量

try:
    response = requests.get(url=url, proxies=proxies_dict, timeout=1)
    if response.status_code == 200:
        print('当前代理IP: ', proxies_dict,  '可以使用')
        lis_1.append(proxies_dict)
except:
    print('当前代理IP: ', proxies_dict,  '请求超时, 检测不合格')

print('获取的代理IP数量: ', len(lis))
print('获取可用的IP代理数量: ', len(lis_1))
print('获取可用的IP代理: ', lis_1)

总共爬取了150个，最后测试出只有一个是能用的，所以还是付费的好

Original: https://www.cnblogs.com/qshhl/p/15834836.html
Author: 松鼠爱吃饼干
Title: Python采集网站ip代理, 检测IP代理是否可用

一	二	三	四	五	六	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

python tsfresh特征中文详解

相关阅读

Title: Python采集网站ip代理, 检测IP代理是否可用

开发环境

模块使用

代理ip结构

代码实现步骤:

1. 导入模块

2. 发送请求, 对于目标网址发送请求 https://www.kuaidaili.com/free/

3. 获取数据, 获取服务器返回响应数据(网页源代码)

4. 解析数据, 提取我们想要的数据内容

正则表达式提取数据内容

css选择器:

xpath 提取数据

提取ip

5. 检测ip质量

总共爬取了150个，最后测试出只有一个是能用的，所以还是付费的好

100天精通Python（数据分析篇）——第68天：Pandas数据清洗函数大全

Python代码加速100倍，针对Excel自动化处理的加速实战！

Python+Socket实现多人聊天室，功能：好友聊天、群聊、图片、表情、文件等

python–飞机大战

【紧急情况】：回宿舍放下书包的我，花了20分钟敲了一个抢购脚本

[Python]实现短信验证码的发送

python学生成绩管理系统【完整版】

Python or html爱心代码（听说最近很火）

学生信息管理系统（Python）完整版

自动化测试——selenium（完结篇)

【Python】向量叉积和凸包 | 引射线法 | 判断点是否在多边形内部 | 葛立恒扫描法 | Cross Product and Convex Hul

Python图像处理【3】Python图像处理库应用

Anaconda超详细安装教程（Windows环境下）

Python爬虫详解（一看就懂）

11月编程排行榜来了，Python依旧占据榜首

Python安装教程-史上最全

＜人生重开模拟器＞——《Python项目实战》

python一键采集高质量陪玩，心动主播随心选……

Python 入门的60个基础练习

值得苦练的100道Python经典练手题，（附详细答案）

机器学习算法、Python、数据分析、学习资料 & 面试大汇总（免费送）