Kats是一个用于分析时间序列数据的工具箱,是一个轻量级、易于使用和可推广的框架,用于执行时间序列分析。 时间序列分析是工业数据科学和工程工作的重要组成部分,从理解关键统计数据和特征,检测回归和异常,预测未来趋势。 Kats旨在为时间序列分析提供一站式服务,包括检测、预测、特征提取/嵌入、多元分析等。
几个明显的优势:
但是没有更多的深度学习模型,例如seq2seq、DeepAR (出自:时序数据分析工具Kats:01基础用法)
不知道是不是笔者的window笔记本的问题,按照kats出现的很多问题
安装Kats时候,会报错:
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
一般是按照prophet的时候会出现:
其实是可以 直接跳过Microsoft Visual C++
,安装:
conda install pystan==2.19.1.1
其中,pystan一定要通过conda安装,才能解决Microsoft Visual C++
不然无法解决
另外报错:
SyntaxError: future feature annotations is not defined
这是版本问题,可见issues Kats只支持3.7 / 3.8 ,笔者是3.6,所以报错了.
切换成py3.7之后,出现了以下的问题:
ImportError: cannot import name 'utils' from partially initialized module 'kats' (most likely due to a circular import)
到了文件夹一看,package安装之后,很多配套组件没有在文件夹之中, 于是笔者手动复制进去:
复制之后,又报错,
ModuleNotFoundError: No module named 'deprecated'
所以又需要重新安装一波新package 这下总算ok了…
时序数据分析工具Kats:01基础用法 已经把官方文档其中那份最基础的翻译了,kats_101_basics.ipynb
数据来源:data
有两种点的类型:离群点、突变点(changepoint)
先来看看CUSUMDetector
几个参数:
来看例子:
# 生成数据
# import packages
from kats.detectors.cusum_detection import CUSUMDetector
# synthesize data with simulation
np.random.seed(10)
df_increase_decrease = pd.DataFrame(
{
'time': pd.date_range('2019-01-01', periods=60),
'increase':np.concatenate([np.random.normal(1,0.2,30), np.random.normal(2,0.2,30)]),
'decrease':np.concatenate([np.random.normal(1,0.3,50), np.random.normal(0.5,0.3,10)]),
}
)
# 突变点监测1
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["increase"])
plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()
# 突变点监测2
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','decrease']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["decrease"])
plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()
“Bayesian Online Changepoint Detection” (Adams & McKay, 2007)
贝叶斯在线变化点检测(BOCPD)是一种检测时间序列中持续的突然变化的方法
例子:
from kats.utils.simulator import Simulator
sim = Simulator(n=450, start='2020-01-01', freq='H')
ts_bocpd = sim.level_shift_sim(noise=0.05, seasonal_period=1)
# plot the simulated data
ts_bocpd.plot(cols=['value'])
from kats.detectors.bocpd import BOCPDetector, BOCPDModelType, TrendChangeParameters
# Initialize the detector
detector = BOCPDetector(ts_bocpd)
changepoints = detector.detector(
model=BOCPDModelType.NORMAL_KNOWN_MODEL # this is the default choice
)
# Plot the data
plt.xticks(rotation=45)
detector.plot(changepoints)
plt.show()
cp = changepoints[0]
cp
具体的突变区间有哪些:
BOCPDChangePoint(start_time: 2020-01-05T03:00:00.000000000, end_time: 2020-01-05T03:00:00.000000000, confidence: 0.9752663452994146, model: <BOCPDModelType.NORMAL_KNOWN_MODEL: 1>, ts_name: value)
RobustStatDetector主要是突变点的检测,但是可以检测多个突变点。 所以效果又跟BOCPDetector区间类似
# import package
from kats.detectors.robust_stat_detection import RobustStatDetector
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = RobustStatDetector(tsd)
change_points = detector.detector()
plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()
change_point = change_points[0]
change_point
结果:
TimeSeriesChangePoint(start_time: 2019-01-31T00:00:00.000000000, end_time: 2019-01-31T00:00:00.000000000, confidence: 0.9932728251020368, metric: -0.32345014754902635, index: 30)
# 生成数据
# load the air_passenger data set
air_passengers_outlier_df = pd.read_csv("~/Kats/kats/data/air_passengers.csv")
air_passengers_outlier_df.columns = ["time", "value"]
# manually add outlier on the date of '1950-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1950-12-01','value']*=5
# manually add outlier on the date of '1959-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1959-12-01', 'value']*=4
# transform the outlier data into `TimeSeriesData` Object
air_passengers_outlier_ts = TimeSeriesData(air_passengers_outlier_df)
# 离群点
from kats.detectors.outlier import OutlierDetector
ts_outlierDetection = OutlierDetector(air_passengers_outlier_ts, 'additive') # call OutlierDetector
ts_outlierDetection.detector() # apply OutlierDetector
# 离群点
ts_outlierDetection.outliers[0]
结果输出:
[Timestamp('1950-12-01 00:00:00'),
Timestamp('1959-11-01 00:00:00'),
Timestamp('1959-12-01 00:00:00')]
这种异常检测方法对于检测跨多个时间序列的异常非常有用。异常是基于偏离预测稳态行为检测。一个度量系统的稳态行为是通过使用向量自回归(VAR)模型建模时间序列之间的线性相关性来预测的。
from kats.detectors.outlier import MultivariateAnomalyDetector, MultivariateAnomalyDetectorType
from kats.models.var import VARParams
import warnings
warnings.filterwarnings(action='ignore',category=FutureWarning)
# 生成数据
# Load the demo data into a TimeSeriesData Object
multi_anomaly_df = pd.read_csv("~/Kats/kats/data/multivariate_anomaly_simulated_data.csv")
multi_anomaly_ts = TimeSeriesData(multi_anomaly_df)
# Plot the data
multi_anomaly_ts.plot(cols=multi_anomaly_ts.value.columns.tolist())
# VAR 检测
np.random.seed(10)
params = VARParams(maxlags=2)
d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=60)
anomaly_score_df = d.detector()
d.plot()
# 检测
anomalies = d.get_anomaly_timepoints(alpha=0.05)
anomalies
检测的结果为:
[Timestamp('2020-01-23 00:00:00'),
Timestamp('2020-01-24 00:00:00'),
Timestamp('2020-01-25 00:00:00'),
Timestamp('2020-01-26 00:00:00'),
Timestamp('2020-01-27 00:00:00'),
Timestamp('2020-01-28 00:00:00')]
相关的是:
d.get_anomalous_metrics(anomalies[0], top_k=3)
然后使用get_abnormous_metrics函数,我们可以查看导致多变量异常的顶部指标。 在我们发现的异常时间的情况下,我们可以验证最大的异常分数来自指标5和6。
趋势检测试图识别时间序列中显著和长期的变化。 趋势检测不是识别变化点,而是识别逐渐和长期变化的窗口。
MKDetector
是我们在Kats中包含的趋势检测算法,它基于非参数Mann-Kendall (MK)检验。
MKDetector
本质上做的是将MK应用于一个大小调整的窗口(由detector
方法中的window_size
参数指定),并返回该测试具有统计意义的每个窗口的结束点。
趋势窗口是基于窗口内时间序列的增加或减少的单调性来检测的,而不是窗口内时间序列值变化的幅度。
from kats.utils.simulator import Simulator
sim = Simulator(n=365, start='2020-01-01', freq='D')
tsd = sim.trend_shift_sim(noise=200, seasonal_period=7, seasonal_magnitude=0.007,
cp_arr=[250], intercept = 10000, trend_arr=[40,-20])
# plot the simulated data
tsd.plot(cols=['value'])
plt.show()
# 寻找趋势向下的离群点
from kats.detectors.trend_mk import MKDetector
detector = MKDetector(data=tsd, threshold=.8)
# run detector
detected_time_points = detector.detector(direction='down', window_size=20, freq='weekly')
# plot the results
detector.plot(detected_time_points)
plt.show()
# 结果
cp = detected_time_points[0]
cp
输出的结果:
MKChangePoint(start_time: 2020-10-03 00:00:00, end_time: 2020-10-03 00:00:00, confidence: 0.9999998244557637, detector_type: kats.detectors.trend_mk.MKDetector, is_multivariate: False, trend_direction: decreasing, Tau: -0.8526315789473684)
import sys
sys.path.append("../")
import numpy as np
import pandas as pd
import pprint
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from kats.consts import TimeSeriesData
from statsmodels.tsa.seasonal import STL
from kats.utils.simulator import Simulator
from sklearn.preprocessing import StandardScaler
from kats.tsfeatures.tsfeatures import TsFeatures
import warnings
warnings.simplefilter(action='ignore')
# 数据生成
sim = Simulator(n=90, freq="D", start = "2021-01-01") # simulate 90 days of data
random_seed = 100
# generate 10 TimeSeriesData with arima_sim
np.random.seed(random_seed) # setting numpy seed
arima_sim_list = [sim.arima_sim(ar=[0.1, 0.05], ma = [0.04, 0.1], d = 1) for _ in range(10)]
# generate 10 TimeSeriesData with trend shifts
trend_sim_list = [
sim.trend_shift_sim(
cp_arr = [30, 60, 75],
trend_arr=[3, 15, 2, 8],
intercept=30,
noise=50,
seasonal_period=7,
seasonal_magnitude=np.random.uniform(10, 100),
random_seed=random_seed
) for _ in range(10)
]
# generate 10 TimeSeriesData with level shifts
level_shift_list = [
sim.level_shift_sim(
cp_arr = [30, 60, 75],
level_arr=[1.35, 1.05, 1.35, 1.2],
noise=0.05,
seasonal_period=7,
seasonal_magnitude=np.random.uniform(0.1, 1.0),
random_seed=random_seed
) for _ in range(10)
]
ts_list = arima_sim_list + trend_sim_list + level_shift_list
# 时序
ts = ts_list[0]
# plot the time series
ts.plot(cols=['value'])
plt.xticks(rotation = 45)
plt.show()
# 时序特征
# Step 1. initiate TsFeatures
model = TsFeatures()
# Step 2. use .transform() method, and apply on the target time series data
output_features = model.transform(ts)
output_features
特征包括:
{'length': 90,
'mean': -4.973228083549793,
'var': 50.69499812650379,
'entropy': 0.2742447620827895,
'lumpiness': 10.258210327109449,
'stability': 45.07760417461487,
'flat_spots': 1,
'hurst': 0.41884368965647256,
'std1st_der': 0.8773588739369633,
'crossing_points': 5,
'binarize_mean': 0.43333333333333335,
'unitroot_kpss': 0.41641147078333146,
'heterogeneity': 73.29527168434541,
'histogram_mode': -11.841676172131818,
'linearity': 0.8346355269096618,
'trend_strength': 0.9853025999592567,
'seasonality_strength': 0.3521955818150291,
'spikiness': 0.00020455870537077636,
'peak': 1,
'trough': 6,
'level_shift_idx': 23,
'level_shift_size': 0.7134342301151566,
'y_acf1': 0.9597578784708428,
'y_acf5': 4.0361834721280365,
'diff1y_acf1': 0.1830233735938267,
'diff1y_acf5': 0.0794760417768679,
'diff2y_acf1': -0.4816907863327952,
'diff2y_acf5': 0.24476824866108501,
'y_pacf5': 0.9862593061001357,
'diff1y_pacf5': 0.07981792144706332,
'diff2y_pacf5': 0.36145785941160113,
'seas_acf1': 0.8149983814152568,
'seas_pacf1': 0.03034496255047374,
'firstmin_ac': 53,
'firstzero_ac': 30,
'holt_alpha': 0.9999999850969992,
'holt_beta': 0.13656079676064467,
'hw_alpha': nan,
'hw_beta': nan,
'hw_gamma': nan}
其中包括:
Cluster Similar Time Series,这么多生成的特征,可以先PCA降维,然后进行聚类
# performing dimension reduction on the time series samples
ls_features = ['lumpiness', 'entropy', 'seasonality_strength', 'stability', 'level_shift_size']
df_dataset = df_features[ls_features]
x_2d = PCA(n_components=2).fit_transform(X=StandardScaler().fit_transform(df_dataset[ls_features].values))
df_dataset['pca_component_1'] = x_2d[:,0]
df_dataset['pca_component_2'] = x_2d[:,1]
# 聚类
# Plot the PCA projections of each time series
plt.figure(figsize = (15,8))
# Plot ARIMA time series in red
ax = df_dataset.iloc[0:10].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='red')
# Plot trend shift time series in green
df_dataset.iloc[10:20].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='green', ax=ax)
# Plot level shift time series in yellow
df_dataset.iloc[20:].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='yellow', ax=ax)
plt.title('Visualization of the dimension reduced time series samples')
plt.legend(['ARIMA', 'Trend Shift', 'Level Shift'])
plt.show()
聚类结果:
目前支持以下 10 种基本预测模型:
每个模型都遵循 sklearn 模型 API 模式:我们创建模型类的一个实例,然后调用它的 fit 和 predict 方法。 下面简答介绍下 Prophet 和 Theta 模型的示例。 后续有对 Kats 预测的更深入介绍。
Prophet就不过多介绍了,Facebook最经典和广泛应用的时序预测包。目前已发布v1.0版本,但是kats里采用的是0.7。
# 导入prophet模型和参数类
from kats.models.prophet import ProphetModel, ProphetParams
# 创建参数实例,和原生的prophet的参数一致,周期性叠加选择乘法模型
params = ProphetParams(seasonality_mode='multiplicative', interval_width=0.80)
# 创建一个模型实例
model = ProphetModel(data=air_passengers_ts, params=params)
# 模型训练
model.fit()
# 预测未来30个月的数据
forecast = model.predict(steps=30, freq='MS')
# 画图展示
model.plot()
# 预测结果是个pd.DataFrame,fcst:预测值,fcst_lower预测下界,fcst_upper预测上届
forecast.head()
Theta 方法(Assimakopoulos 和 Nikolopoulos,2000 年)是一种拟合两条 Theta 线的单变量预测方法:
然后将它们组合起来构建预测。
在预测之前,我们首先检测时间序列的季节性,如果检测到季节性,则对其进行去季节性化,然后对计算出的预测进行重新季节性化
Hyndman 和 Billah (2003) 表明 Theta 方法和带有漂移的简单指数平滑效果接近。
在 Kats 中,我们使用这个底层模型来计算 ThetaModel 的预测区间。
使用方法和ProphetModel类似:它的参数初始化模型,然后调用 fit 和 predict 方法。
# 导入模型和模型参数
from kats.models.theta import ThetaModel, ThetaParams
# 创建参数实例,并配置季节参数
params = ThetaParams(m=12)
# 创建模型实例
model_theta = ThetaModel(data=air_passengers_ts, params=params)
# 模型训练
model_theta.fit()
# 模型预测
forecast_thera = model_theta.predict(steps=30, alpha=0.2)
# 预测结果可视化展示
model_theta.plot()
# 展示模型预测结果
forecast_thera.head()
试了下LSTM的模型,结果一般,可调的参数有点少啊
# LSTM 模型
from kats.models.lstm import LSTMModel, LSTMParams
# 创建参数实例
params = LSTMParams(hidden_size=16, time_window=24, num_epochs=100)
# 创建模型实例
lstm = LSTMModel(data=air_passengers_ts, params=params)
# 模型训练
lstm.fit()
# 模型预测
fcst = lstm.predict(steps=30)
# 预测结果可视化展示
lstm.plot()
# 展示模型预测结果
fcst.head()