Kats时间序列开源库的使用笔记

悟乙己

发布于 2022-05-11 09:42:19

1.1K0

发布于 2022-05-11 09:42:19

Kats是一个用于分析时间序列数据的工具箱，是一个轻量级、易于使用和可推广的框架，用于执行时间序列分析。时间序列分析是工业数据科学和工程工作的重要组成部分，从理解关键统计数据和特征，检测回归和异常，预测未来趋势。 Kats旨在为时间序列分析提供一站式服务，包括检测、预测、特征提取/嵌入、多元分析等。

几个明显的优势：

1.封装了大量基础算法，sklearn友好的API
2.支持多预测模型融合
3.支持多指标预测
4.特征工程可以用来训练其他的异常检测分类器，例如SVM、XGB、DNN等

但是没有更多的深度学习模型，例如seq2seq、DeepAR (出自：时序数据分析工具Kats：01基础用法)

文章目录

1 Kats的千辛万苦安装之路
2 案例实践
- 2.1 时序数据分析工具Kats：01基础用法
- 2.2 异常、突变点监测-Detection
  - 2.2.1 CUSUMDetector 异常点
  - 2.2.2 BOCPDetector 异常区间检测
  - 2.2.3 RobustStatDetector多突变点检测
- 2.3 离群点检测
  - 2.3.1 OutlierDetector
  - 2.3.2 MultivariateAnomalyDetector
- 2.4 Trend detection 趋势检测
3 特征抽取-TsFeatures
- 3.1 特征抽取
- 3.2 根据特征进行聚类
4 预测建模
- 4.1 Prophet Model
- 4.2 Theta Model
- 4.3 LSTM

1 Kats的千辛万苦安装之路

不知道是不是笔者的window笔记本的问题，按照kats出现的很多问题

安装Kats时候，会报错：

error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

一般是按照prophet的时候会出现：其实是可以直接跳过Microsoft Visual C++，安装:

conda install pystan==2.19.1.1

其中，pystan一定要通过conda安装，才能解决Microsoft Visual C++ 不然无法解决

另外报错：

SyntaxError: future feature annotations is not defined

这是版本问题，可见issues Kats只支持3.7 / 3.8 ，笔者是3.6，所以报错了.

切换成py3.7之后，出现了以下的问题：

ImportError: cannot import name 'utils' from partially initialized module 'kats' (most likely due to a circular import)

到了文件夹一看，package安装之后，很多配套组件没有在文件夹之中，于是笔者手动复制进去：

复制之后，又报错，

ModuleNotFoundError: No module named 'deprecated'

所以又需要重新安装一波新package 这下总算ok了…

2 案例实践

2.1 时序数据分析工具Kats：01基础用法

时序数据分析工具Kats：01基础用法已经把官方文档其中那份最基础的翻译了，kats_101_basics.ipynb

数据来源：data

2.2 异常、突变点监测-Detection

有两种点的类型：离群点、突变点(changepoint)

kats_202_detection.ipynb

2.2.1 CUSUMDetector 异常点

先来看看CUSUMDetector几个参数：

threshold: float, significance level;
max_iter: int, maximum iteration in finding the changepoint;
delta_std_ratio: float, the mean delta has to be larger than this parameter times std of the data to be considered as a change;
min_abs_change: int, minimal absolute delta between mu0 and mu1
start_point: int, the start idx of the changepoint, None means the middle of the time series;
change_directions: list[str], a list contain either or both ‘increase’ and ‘decrease’ to specify what type of change to be detected;监测递增突变、还是递减突变，还是两者都需要，默认是Both
interest_window: list[int, int], a list containing the start and end of the interest window where we will look for a change point. Note that the llr will still be calculated using all data points;监测区间，可能只想监测某一个区间的异常值
magnitude_quantile: float, the quantile for magnitude comparison, if none, will skip the magnitude comparison;
magnitude_ratio: float, comparable ratio;
magnitude_comparable_day: float, maximal percentage of days can have comparable magnitude to be considered as regression;
return_all_changepoints: bool, return all the changepoints found, even the insignificant ones.

来看例子：

# 生成数据
# import packages
from kats.detectors.cusum_detection import CUSUMDetector

# synthesize data with simulation
np.random.seed(10)
df_increase_decrease = pd.DataFrame(
    {
        'time': pd.date_range('2019-01-01', periods=60),
        'increase':np.concatenate([np.random.normal(1,0.2,30), np.random.normal(2,0.2,30)]),
        'decrease':np.concatenate([np.random.normal(1,0.3,50), np.random.normal(0.5,0.3,10)]),
    }
)

# 突变点监测1
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["increase"])

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

# 突变点监测2
tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','decrease']])
detector = CUSUMDetector(tsd)
change_points = detector.detector(change_directions=["decrease"])

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

2.2.2 BOCPDetector 异常区间检测

“Bayesian Online Changepoint Detection” (Adams & McKay, 2007)

贝叶斯在线变化点检测(BOCPD)是一种检测时间序列中持续的突然变化的方法

例子：

from kats.utils.simulator import Simulator

sim = Simulator(n=450, start='2020-01-01', freq='H')
ts_bocpd = sim.level_shift_sim(noise=0.05, seasonal_period=1)

# plot the simulated data
ts_bocpd.plot(cols=['value'])

from kats.detectors.bocpd import BOCPDetector, BOCPDModelType, TrendChangeParameters

# Initialize the detector
detector = BOCPDetector(ts_bocpd)


changepoints = detector.detector(
    model=BOCPDModelType.NORMAL_KNOWN_MODEL # this is the default choice
)

# Plot the data
plt.xticks(rotation=45)
detector.plot(changepoints)
plt.show()

cp = changepoints[0]
cp

具体的突变区间有哪些：

BOCPDChangePoint(start_time: 2020-01-05T03:00:00.000000000, end_time: 2020-01-05T03:00:00.000000000, confidence: 0.9752663452994146, model: <BOCPDModelType.NORMAL_KNOWN_MODEL: 1>, ts_name: value)

2.2.3 RobustStatDetector多突变点检测

RobustStatDetector主要是突变点的检测，但是可以检测多个突变点。所以效果又跟BOCPDetector区间类似

# import package
from kats.detectors.robust_stat_detection import RobustStatDetector

tsd = TimeSeriesData(df_increase_decrease.loc[:,['time','increase']])
detector = RobustStatDetector(tsd)
change_points = detector.detector()

plt.xticks(rotation=45)
detector.plot(change_points)
plt.show()

change_point = change_points[0]
change_point

结果：

TimeSeriesChangePoint(start_time: 2019-01-31T00:00:00.000000000, end_time: 2019-01-31T00:00:00.000000000, confidence: 0.9932728251020368, metric: -0.32345014754902635, index: 30)

2.3 离群点检测

2.3.1 OutlierDetector

# 生成数据
# load the air_passenger data set
air_passengers_outlier_df = pd.read_csv("~/Kats/kats/data/air_passengers.csv")
air_passengers_outlier_df.columns = ["time", "value"]

# manually add outlier on the date of '1950-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1950-12-01','value']*=5
# manually add outlier on the date of '1959-12-01'
air_passengers_outlier_df.loc[air_passengers_outlier_df.time == '1959-12-01', 'value']*=4

# transform the outlier data into `TimeSeriesData` Object
air_passengers_outlier_ts = TimeSeriesData(air_passengers_outlier_df)

# 离群点
from kats.detectors.outlier import OutlierDetector

ts_outlierDetection = OutlierDetector(air_passengers_outlier_ts, 'additive') # call OutlierDetector
ts_outlierDetection.detector() # apply OutlierDetector

# 离群点
ts_outlierDetection.outliers[0]

结果输出：

[Timestamp('1950-12-01 00:00:00'),
 Timestamp('1959-11-01 00:00:00'),
 Timestamp('1959-12-01 00:00:00')]

2.3.2 MultivariateAnomalyDetector

这种异常检测方法对于检测跨多个时间序列的异常非常有用。异常是基于偏离预测稳态行为检测。一个度量系统的稳态行为是通过使用向量自回归(VAR)模型建模时间序列之间的线性相关性来预测的。

from kats.detectors.outlier import MultivariateAnomalyDetector, MultivariateAnomalyDetectorType
from kats.models.var import VARParams

import warnings
warnings.filterwarnings(action='ignore',category=FutureWarning)

# 生成数据
# Load the demo data into a TimeSeriesData Object
multi_anomaly_df = pd.read_csv("~/Kats/kats/data/multivariate_anomaly_simulated_data.csv")
multi_anomaly_ts = TimeSeriesData(multi_anomaly_df)

# Plot the data
multi_anomaly_ts.plot(cols=multi_anomaly_ts.value.columns.tolist())

# VAR	检测
np.random.seed(10)
params = VARParams(maxlags=2)
d = MultivariateAnomalyDetector(multi_anomaly_ts, params, training_days=60)
anomaly_score_df = d.detector()
d.plot()

# 检测
anomalies = d.get_anomaly_timepoints(alpha=0.05)
anomalies

检测的结果为：

[Timestamp('2020-01-23 00:00:00'),
 Timestamp('2020-01-24 00:00:00'),
 Timestamp('2020-01-25 00:00:00'),
 Timestamp('2020-01-26 00:00:00'),
 Timestamp('2020-01-27 00:00:00'),
 Timestamp('2020-01-28 00:00:00')]

2.4 Trend detection 趋势检测

趋势检测试图识别时间序列中显著和长期的变化。趋势检测不是识别变化点，而是识别逐渐和长期变化的窗口。

MKDetector 是我们在Kats中包含的趋势检测算法，它基于非参数Mann-Kendall (MK)检验。 MKDetector 本质上做的是将MK应用于一个大小调整的窗口(由detector方法中的window_size参数指定)，并返回该测试具有统计意义的每个窗口的结束点。趋势窗口是基于窗口内时间序列的增加或减少的单调性来检测的，而不是窗口内时间序列值变化的幅度。

from kats.utils.simulator import Simulator

sim = Simulator(n=365, start='2020-01-01', freq='D')
tsd = sim.trend_shift_sim(noise=200, seasonal_period=7, seasonal_magnitude=0.007,
                              cp_arr=[250], intercept = 10000, trend_arr=[40,-20])

# plot the simulated data
tsd.plot(cols=['value'])
plt.show()

# 寻找趋势向下的离群点
from kats.detectors.trend_mk import MKDetector

detector = MKDetector(data=tsd, threshold=.8)
# run detector
detected_time_points = detector.detector(direction='down', window_size=20, freq='weekly')
# plot the results
detector.plot(detected_time_points)
plt.show()


# 结果
cp = detected_time_points[0]
cp

输出的结果：

MKChangePoint(start_time: 2020-10-03 00:00:00, end_time: 2020-10-03 00:00:00, confidence: 0.9999998244557637, detector_type: kats.detectors.trend_mk.MKDetector, is_multivariate: False, trend_direction: decreasing, Tau: -0.8526315789473684)

3 特征抽取-TsFeatures

3.1 特征抽取

import sys
sys.path.append("../")

import numpy as np
import pandas as pd
import pprint
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from kats.consts import TimeSeriesData
from statsmodels.tsa.seasonal import STL
from kats.utils.simulator import Simulator
from sklearn.preprocessing import StandardScaler
from kats.tsfeatures.tsfeatures import TsFeatures

import warnings
warnings.simplefilter(action='ignore')

# 数据生成
sim = Simulator(n=90, freq="D", start = "2021-01-01") # simulate 90 days of data
random_seed = 100

# generate 10 TimeSeriesData with arima_sim
np.random.seed(random_seed) # setting numpy seed
arima_sim_list = [sim.arima_sim(ar=[0.1, 0.05], ma = [0.04, 0.1], d = 1) for _ in range(10)]

# generate 10 TimeSeriesData with trend shifts
trend_sim_list = [
    sim.trend_shift_sim(
        cp_arr = [30, 60, 75],
        trend_arr=[3, 15, 2, 8],
        intercept=30,
        noise=50,
        seasonal_period=7,
        seasonal_magnitude=np.random.uniform(10, 100),
        random_seed=random_seed
    ) for _ in range(10)
]


# generate 10 TimeSeriesData with level shifts
level_shift_list = [
    sim.level_shift_sim(
        cp_arr = [30, 60, 75],
        level_arr=[1.35, 1.05, 1.35, 1.2],
        noise=0.05,
        seasonal_period=7,
        seasonal_magnitude=np.random.uniform(0.1, 1.0),
        random_seed=random_seed
    ) for _ in range(10)
]

ts_list = arima_sim_list + trend_sim_list + level_shift_list


# 时序
ts = ts_list[0]

# plot the time series
ts.plot(cols=['value'])
plt.xticks(rotation = 45)
plt.show()

# 时序特征
# Step 1. initiate TsFeatures
model = TsFeatures()

# Step 2. use .transform() method, and apply on the target time series data
output_features = model.transform(ts)
output_features

特征包括：

{'length': 90,
 'mean': -4.973228083549793,
 'var': 50.69499812650379,
 'entropy': 0.2742447620827895,
 'lumpiness': 10.258210327109449,
 'stability': 45.07760417461487,
 'flat_spots': 1,
 'hurst': 0.41884368965647256,
 'std1st_der': 0.8773588739369633,
 'crossing_points': 5,
 'binarize_mean': 0.43333333333333335,
 'unitroot_kpss': 0.41641147078333146,
 'heterogeneity': 73.29527168434541,
 'histogram_mode': -11.841676172131818,
 'linearity': 0.8346355269096618,
 'trend_strength': 0.9853025999592567,
 'seasonality_strength': 0.3521955818150291,
 'spikiness': 0.00020455870537077636,
 'peak': 1,
 'trough': 6,
 'level_shift_idx': 23,
 'level_shift_size': 0.7134342301151566,
 'y_acf1': 0.9597578784708428,
 'y_acf5': 4.0361834721280365,
 'diff1y_acf1': 0.1830233735938267,
 'diff1y_acf5': 0.0794760417768679,
 'diff2y_acf1': -0.4816907863327952,
 'diff2y_acf5': 0.24476824866108501,
 'y_pacf5': 0.9862593061001357,
 'diff1y_pacf5': 0.07981792144706332,
 'diff2y_pacf5': 0.36145785941160113,
 'seas_acf1': 0.8149983814152568,
 'seas_pacf1': 0.03034496255047374,
 'firstmin_ac': 53,
 'firstzero_ac': 30,
 'holt_alpha': 0.9999999850969992,
 'holt_beta': 0.13656079676064467,
 'hw_alpha': nan,
 'hw_beta': nan,
 'hw_gamma': nan}

其中包括：

STL (Seasonal and Trend decomposition using Loess) Features
- Strength of Seasonality 周期强度
- Strength of Trend 趋势强度
- Spikiness 尖刺
- Linearity 线性度
Amount of Level Shift 数量级偏移
Presence of Flat Segments 扁平段
ACF and PACF Features 稳定性特征：自相关和偏自相关
Hurst Exponent 趋势持续性特征：赫斯特指数
ARCH Statistic 拱形统计

3.2 根据特征进行聚类

Cluster Similar Time Series，这么多生成的特征，可以先PCA降维，然后进行聚类

# performing dimension reduction on the time series samples
ls_features = ['lumpiness', 'entropy', 'seasonality_strength', 'stability', 'level_shift_size']
df_dataset = df_features[ls_features]
x_2d = PCA(n_components=2).fit_transform(X=StandardScaler().fit_transform(df_dataset[ls_features].values))
df_dataset['pca_component_1'] = x_2d[:,0]
df_dataset['pca_component_2'] = x_2d[:,1]

# 聚类
# Plot the PCA projections of each time series
plt.figure(figsize = (15,8))
# Plot ARIMA time series in red
ax = df_dataset.iloc[0:10].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='red')
# Plot trend shift time series in green
df_dataset.iloc[10:20].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='green', ax=ax)
# Plot level shift time series in yellow
df_dataset.iloc[20:].plot(x='pca_component_1', y='pca_component_2', kind='scatter', color='yellow', ax=ax)

plt.title('Visualization of the dimension reduced time series samples')
plt.legend(['ARIMA', 'Trend Shift', 'Level Shift'])
plt.show()

聚类结果：

4 预测建模

参考：时序数据分析工具Kats：01基础用法

目前支持以下 10 种基本预测模型：

Linear
Quadratic
ARIMA
SARIMA
Holt-Winters
Prophet
AR-Net
LSTM
Theta
VAR

每个模型都遵循 sklearn 模型 API 模式：我们创建模型类的一个实例，然后调用它的 fit 和 predict 方法。下面简答介绍下 Prophet 和 Theta 模型的示例。后续有对 Kats 预测的更深入介绍。

4.1 Prophet Model

Prophet就不过多介绍了，Facebook最经典和广泛应用的时序预测包。目前已发布v1.0版本，但是kats里采用的是0.7。

# 导入prophet模型和参数类
from kats.models.prophet import ProphetModel, ProphetParams

# 创建参数实例，和原生的prophet的参数一致，周期性叠加选择乘法模型
params = ProphetParams(seasonality_mode='multiplicative', interval_width=0.80)

# 创建一个模型实例
model = ProphetModel(data=air_passengers_ts, params=params)

# 模型训练
model.fit()

# 预测未来30个月的数据
forecast = model.predict(steps=30, freq='MS')

# 画图展示
model.plot()

# 预测结果是个pd.DataFrame,fcst:预测值，fcst_lower预测下界，fcst_upper预测上届
forecast.head()

4.2 Theta Model

Theta 方法（Assimakopoulos 和 Nikolopoulos，2000 年）是一种拟合两条 Theta 线的单变量预测方法：

线性插值（称为 Theta=0线）
二阶差分（称为 Theta=2线）

然后将它们组合起来构建预测。

在预测之前，我们首先检测时间序列的季节性，如果检测到季节性，则对其进行去季节性化，然后对计算出的预测进行重新季节性化

Hyndman 和 Billah (2003) 表明 Theta 方法和带有漂移的简单指数平滑效果接近。

在 Kats 中，我们使用这个底层模型来计算 ThetaModel 的预测区间。

使用方法和ProphetModel类似：它的参数初始化模型，然后调用 fit 和 predict 方法。

# 导入模型和模型参数
from kats.models.theta import ThetaModel, ThetaParams

# 创建参数实例，并配置季节参数
params = ThetaParams(m=12)

# 创建模型实例
model_theta = ThetaModel(data=air_passengers_ts, params=params)

# 模型训练
model_theta.fit()

# 模型预测
forecast_thera = model_theta.predict(steps=30, alpha=0.2)

# 预测结果可视化展示
model_theta.plot()

# 展示模型预测结果
forecast_thera.head()

4.3 LSTM

试了下LSTM的模型，结果一般，可调的参数有点少啊

# LSTM 模型
from kats.models.lstm import LSTMModel, LSTMParams
# 创建参数实例
params = LSTMParams(hidden_size=16, time_window=24, num_epochs=100)

# 创建模型实例
lstm = LSTMModel(data=air_passengers_ts, params=params)

# 模型训练
lstm.fit()

# 模型预测
fcst = lstm.predict(steps=30)

# 预测结果可视化展示
lstm.plot()

# 展示模型预测结果
fcst.head()