压力是身体和心灵对要求或挑战性情况的自然反应。它是身体对外部压力或内部思想和感受做出反应的方式。压力可能由多种因素引发,例如工作压力、经济困难、人际关系问题、健康问题或重大生活事件。
由数据科学和机器学习驱动的压力检测见解旨在预测个人或人群的压力水平。通过分析各种数据源,例如生理测量、行为数据和环境因素,预测模型可以识别与压力相关的模式和风险因素。
这种积极主动的方法可以实现及时干预和量身定制的支持。压力预测在医疗保健领域具有潜力,可以实现早期检测和个性化干预,也可以在职业环境中优化工作环境。它还可以为公共卫生举措和政策决策提供信息。这些模型具有预测压力的能力,为改善个人和社区的福祉和增强复原力提供了宝贵的见解。
使用机器学习进行压力检测涉及收集、清理和预处理数据。应用特征工程技术来提取有意义的信息,或创建可以捕获与压力相关的模式的新特征。这可能涉及提取统计测量、频域分析或时间序列分析以捕获压力的生理或行为指标。提取或设计相关特征以增强性能。
研究人员通过利用标记数据对压力水平进行分类来训练逻辑回归、支持向量机、决策树、随机森林或神经网络等机器学习模型。他们使用准确度、精确度、召回率和 F1 分数等指标来评估模型的性能。将经过训练的模型集成到现实世界的应用程序中可以实现实时压力监控。持续监控、更新和用户反馈对于提高准确性至关重要。
在处理与压力相关的敏感个人数据时,考虑道德问题和隐私问题至关重要。应遵循适当的知情同意、数据匿名化和安全数据存储程序,以保护个人的隐私和权利。道德考虑、隐私和数据安全在整个过程中都很重要。基于机器学习的压力检测可以实现早期干预、个性化压力管理和改善福祉。
“stress”数据集包含与压力水平相关的信息。如果没有数据集的特定结构和列,我可以提供数据的总体概述。
数据集可能包含表示定量测量的数值变量,例如年龄、血压、心率或在量表上测量的压力水平。它还可能包括代表定性特征的分类变量,例如性别、职业类别或分为不同类别(低、中、高)的压力水平。
# Array
import numpy as np
# Dataframe
import pandas as pd
#Visualization
import matplotlib.pyplot as plt
import seaborn as sns
# warnings
import warnings
warnings.filterwarnings('ignore')
#Data Reading
stress_c= pd.read_csv('/human-stress-prediction/Stress.csv')
# Copy
stress=stress_c.copy()
# Data
stress.head()
下面的函数允许你快速评估数据类型并找出缺失值或空值。在处理大型数据集或执行数据清理和预处理任务时,此摘要非常有用。
# Info
stress.info()
使用代码stress.isnull().sum()检查“stress”数据集中的空值并计算每列中空值的总和。
# Checking null values
stress.isnull().sum()
生成有关“stress”数据集的统计信息。通过编译此代码,你将获得数据集中每个数字列的描述性统计信息的摘要。
# Statistical Information
stress.describe()
探索性数据分析 (EDA) 是理解和分析数据集的关键步骤。它可以让我们直观地探索和总结数据中的主要特征、模式和关系
lst=['subreddit','label']
plt.figure(figsize=(15,12))
for i in range(len(lst)):
plt.subplot(1,2,i+1)
a=stress[lst[i]].value_counts()
lbl=a.index
plt.title(lst[i]+'_Distribution')
plt.pie(x=a,labels=lbl,autopct="%.1f %%")
plt.show()
Matplotlib 和 Seaborn 库为“stress”数据集创建计数图。它可视化不同 Reddit 子版块中压力实例的计数,并用不同颜色区分压力标签。
plt.figure(figsize=(20,12))
plt.title('Subreddit wise stress count')
plt.xlabel('Subreddit')
sns.countplot(data=stress,x='subreddit',hue='label',palette='gist_heat')
plt.show()
文本预处理是指将原始文本数据转换为适合分析或建模任务的更干净、结构化的格式的过程。它特别涉及去除噪声、标准化文本和提取相关特征的一系列步骤。这里我添加了与此文本处理相关的所有库。
# Regular Expression
import re
# Handling string
import string
# NLP tool
import spacy
nlp=spacy.load('en_core_web_sm')
from spacy.lang.en.stop_words import STOP_WORDS
# Importing Natural Language Tool Kit for NLP operations
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from wordcloud import WordCloud, STOPWORDS
from nltk.corpus import stopwords
from collections import Counter
文本预处理中使用的一些常见技术包括:
#defining function for preprocessing
def preprocess(text,remove_digits=True):
text = re.sub('\W+',' ', text)
text = re.sub('\s+',' ', text)
text = re.sub("(?<!\w)\d+", "", text)
text = re.sub("-(?!\w)|(?<!\w)-", "", text)
text=text.lower()
nopunc=[char for char in text if char not in string.punctuation]
nopunc=''.join(nopunc)
nopunc=' '.join([word for word in nopunc.split()
if word.lower() not in stopwords.words('english')])
return nopunc
# Defining a function for lemitization
def lemmatize(words):
words=nlp(words)
lemmas = []
for word in words:
lemmas.append(word.lemma_)
return lemmas
#converting them into string
def listtostring(s):
str1=' '
return (str1.join(s))
def clean_text(input):
word=preprocess(input)
lemmas=lemmatize(word)
return listtostring(lemmas)
# Creating a feature to store clean texts
stress['clean_text']=stress['text'].apply(clean_text)
stress.head()
机器学习模型构建是创建可以学习模式并根据数据做出预测或决策的数学表示或模型的过程。它涉及使用标记数据集训练模型,然后使用该模型对新的、没见过的数据进行预测。
从可用数据中选择或创建相关特征。特征工程旨在从原始数据中提取有意义的信息,帮助模型有效地学习模式。
# Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Model Building
from sklearn.model_selection import GridSearchCV,StratifiedKFold,
KFold,train_test_split,cross_val_score,cross_val_predict
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn import preprocessing
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import StackingClassifier,RandomForestClassifier,
AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
#Model Evaluation
from sklearn.metrics import confusion_matrix,classification_report,
accuracy_score,f1_score,precision_score
from sklearn.pipeline import Pipeline
# Time
from time import time
# Defining target & feature for ML model building
x=stress['clean_text']
y=stress['label']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
根据问题的性质和数据的特征选择合适的机器学习算法或模型架构。不同的模型,例如决策树、支持向量机或神经网络,具有不同的优点和缺点。
使用标记数据训练所选模型。此步骤涉及将训练数据提供给模型,并允许其学习特征与目标变量之间的模式和关系。
# Self-defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by Logistic regression
def model_lr_tf(x_train, x_test, y_train, y_test):
global acc_lr_tf,f1_lr_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = LogisticRegression()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_lr_tf=accuracy_score(y_test,y_pred)
f1_lr_tf=f1_score(y_test,y_pred,average='weighted')
print('Time :',time()-t0)
print('Accuracy: ',acc_lr_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_lr_tf
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by MultinomialNB
def model_nb_tf(x_train, x_test, y_train, y_test):
global acc_nb_tf,f1_nb_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = MultinomialNB()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_nb_tf=accuracy_score(y_test,y_pred)
f1_nb_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_nb_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_nb_tf
# Self defining function to convert the data into vector form by tf idf
# vectorizer and classify and create model by Decision Tree
def model_dt_tf(x_train, x_test, y_train, y_test):
global acc_dt_tf,f1_dt_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = DecisionTreeClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_dt_tf=accuracy_score(y_test,y_pred)
f1_dt_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_dt_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
return y_test,y_pred,acc_dt_tf
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by KNN
def model_knn_tf(x_train, x_test, y_train, y_test):
global acc_knn_tf,f1_knn_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = KNeighborsClassifier()
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_knn_tf=accuracy_score(y_test,y_pred)
f1_knn_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_knn_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
# Self defining function to convert the data into vector form by tf idf
#vectorizer and classify and create model by Random Forest
def model_rf_tf(x_train, x_test, y_train, y_test):
global acc_rf_tf,f1_rf_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = RandomForestClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_rf_tf=accuracy_score(y_test,y_pred)
f1_rf_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_rf_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
# Self defining function to convert the data into vector form by tf idf
# vectorizer and classify and create model by Adaptive Boosting
def model_ab_tf(x_train, x_test, y_train, y_test):
global acc_ab_tf,f1_ab_tf
# Text to vector transformation
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
ovr = AdaBoostClassifier(random_state=1)
#fitting training data into the model & predicting
t0 = time()
ovr.fit(x_train, y_train)
y_pred = ovr.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_ab_tf=accuracy_score(y_test,y_pred)
f1_ab_tf=f1_score(y_test,y_pred,average='weighted')
print('Time : ',time()-t0)
print('Accuracy: ',acc_ab_tf)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
模型评估是机器学习中评估训练模型的性能和有效性的关键步骤。它涉及衡量多个模型对未见数据的推广效果以及它是否满足预期目标。
评估训练模型在测试数据上的性能。计算准确度、精确度、召回率和 F1 分数等评估指标,以评估模型在压力检测方面的有效性。模型评估可以深入了解模型的优点、缺点及其对预期任务的适用性。
# Evaluating Models
print('********************Logistic Regression*********************')
print('\n')
model_lr_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Multinomial NB*********************')
print('\n')
model_nb_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Decision Tree*********************')
print('\n')
model_dt_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************KNN*********************')
print('\n')
model_knn_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Random Forest Bagging*********************')
print('\n')
model_rf_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
print('********************Adaptive Boosting*********************')
print('\n')
model_ab_tf(x_train, x_test, y_train, y_test)
print('\n')
print(30*'==========')
print('\n')
这是机器学习中的关键一步,用于确定给定任务的最佳性能模型。在比较模型时,重要的是要有一个明确的目标。无论是最大化准确性、优化速度还是优先考虑可解释性,评估指标和技术都应与特定目标保持一致。
一致性是模型性能比较的关键。在所有模型中使用一致的评估指标可确保进行公平且有意义的比较。在所有模型中一致地将数据划分为训练集、验证集和测试集也很重要。通过确保模型在相同的数据子集上进行评估,研究人员可以公平地比较它们的性能。
考虑到上述因素,研究人员可以进行全面、公平的模型性能比较,这将有助于针对当前的具体问题做出明智的模型选择决策。
# Creating tabular format for better comparison
tbl=pd.DataFrame()
tbl['Model']=pd.Series(['Logistic Regreesion','Multinomial NB',
'Decision Tree','KNN','Random Forest','Adaptive Boosting'])
tbl['Accuracy']=pd.Series([acc_lr_tf,acc_nb_tf,acc_dt_tf,acc_knn_tf,
acc_rf_tf,acc_ab_tf])
tbl['F1_Score']=pd.Series([f1_lr_tf,f1_nb_tf,f1_dt_tf,f1_knn_tf,
f1_rf_tf,f1_ab_tf])
tbl.set_index('Model')
# Best model on the basis of F1 Score
tbl.sort_values('F1_Score',ascending=False)
交叉验证确实是一种有价值的技术,有助于在训练机器学习模型时避免过度拟合。它通过使用多个数据子集进行训练和测试来提供对模型性能的稳健评估。它通过估计模型在未见过的数据上的性能来帮助评估模型的泛化能力。
# Using cross validation method to avoid overfitting
import statistics as st
vector = TfidfVectorizer()
x_train_v = vector.fit_transform(x_train)
x_test_v = vector.transform(x_test)
# Model building
lr =LogisticRegression()
mnb=MultinomialNB()
dct=DecisionTreeClassifier(random_state=1)
knn=KNeighborsClassifier()
rf=RandomForestClassifier(random_state=1)
ab=AdaBoostClassifier(random_state=1)
m =[lr,mnb,dct,knn,rf,ab]
model_name=['Logistic R','MultiNB','DecTRee','KNN','R forest','Ada Boost']
results, mean_results, p, f1_test=list(),list(),list(),list()
#Model fitting,cross-validating and evaluating performance
def algor(model):
print('\n',i)
pipe=Pipeline([('model',model)])
pipe.fit(x_train_v,y_train)
cv=StratifiedKFold(n_splits=5)
n_scores=cross_val_score(pipe,x_train_v,y_train,scoring='f1_weighted',
cv=cv,n_jobs=-1,error_score='raise')
results.append(n_scores)
mean_results.append(st.mean(n_scores))
print('f1-Score(train): mean= (%.3f), min=(%.3f)) ,max= (%.3f),
stdev= (%.3f)'%(st.mean(n_scores), min(n_scores),
max(n_scores),np.std(n_scores)))
y_pred=cross_val_predict(model,x_train_v,y_train,cv=cv)
p.append(y_pred)
f1=f1_score(y_train,y_pred, average = 'weighted')
f1_test.append(f1)
print('f1-Score(test): %.4f'%(f1))
for i in m:
algor(i)
# Model comparison By Visualizing
fig=plt.subplots(figsize=(20,15))
plt.title('MODEL EVALUATION BY CROSS VALIDATION METHOD')
plt.xlabel('MODELS')
plt.ylabel('F1 Score')
plt.boxplot(results,labels=model_name,showmeans=True)
plt.show()
由于两种方法中模型的 F1 分数非常相似。所以现在我们正在应用留一法来构建性能最佳的模型。
x=stress['clean_text']
y=stress['label']
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=1)
vector = TfidfVectorizer()
x_train = vector.fit_transform(x_train)
x_test = vector.transform(x_test)
model_lr_tf=LogisticRegression()
model_lr_tf.fit(x_train,y_train)
y_pred=model_lr_tf.predict(x_test)
# Model Evaluation
conf=confusion_matrix(y_test,y_pred)
acc_lr=accuracy_score(y_test,y_pred)
f1_lr=f1_score(y_test,y_pred,average='weighted')
print('Accuracy: ',acc_lr)
print('F1 Score: ',f1_lr)
print(10*'===========')
print('Confusion Matrix: \n',conf)
print(10*'===========')
print('Classification Report: \n',classification_report(y_test,y_pred))
数据集包含标记为有压力或无压力的文本消息或文档。该代码循环遍历两个标签,使用 WordCloud 库为每个标签创建词云并显示词云可视化。每个词云代表各自类别中最常用的单词,单词越大表示频率越高。
颜色图('winter', 'autumn', 'magma', 'viridis', 'plasma')的选择决定了词云的配色方案。生成的可视化结果提供了压力和非压力消息或文档相关的最常用单词的简明表示。
以下是表示通常与压力检测相关的压力和非压力单词的词云:
for label, cmap in zip([0,1],
['winter', 'autumn', 'magma', 'viridis', 'plasma']):
text = stress.query('label == @label')['text'].str.cat(sep=' ')
plt.figure(figsize=(12, 9))
wc = WordCloud(width=1000, height=600, background_color="#f8f8f8", colormap=cmap)
wc.generate_from_text(text)
plt.imshow(wc)
plt.axis("off")
plt.title(f"Words Commonly Used in ${label}$ Messages", size=20)
plt.show()
新的输入数据经过预处理并提取特征以匹配模型的期望。然后使用预测函数根据提取的特征生成预测。最后,根据进一步分析或决策的需要打印或使用预测。
data=["""I don't have the ability to cope with it anymore. I'm trying,
but a lot of things are triggering me, and I'm shutting down at work,
just finding the place I feel safest, and staying there for an hour
or two until I feel like I can do something again. I'm tired of watching
my back, tired of traveling to places I don't feel safe, tired of
reliving that moment, tired of being triggered, tired of the stress,
tired of anxiety and knots in my stomach, tired of irrational thought
when triggered, tired of irrational paranoia. I'm exhausted and need
a break, but know it won't be enough until I journey the long road
through therapy. I'm not suicidal at all, just wishing this pain and
misery would end, to have my life back again."""]
data=vector.transform(data)
model_lr_tf.predict(data)
输出:
array([1])
data=["""In case this is the first time you're reading this post...
We are looking for people who are willing to complete some
online questionnaires about employment and well-being which
we hope will help us to improve services for assisting people
with mental health difficulties to obtain and retain employment.
We are developing an employment questionnaire for people with
personality disorders; however we are looking for people from all
backgrounds to complete it. That means you do not need to have a
diagnosis of personality disorder – you just need to have an
interest in completing the online questionnaires. The questionnaires
will only take about 10 minutes to complete online. For your
participation, we’ll donate £1 on your behalf to a mental health
charity (Young Minds: Child & Adolescent Mental Health, Mental
Health Foundation, or Rethink)"""]
data=vector.transform(data)
model_lr_tf.predict(data)
输出:
array([0])
机器学习技术在预测压力水平方面的应用为心理健康提供了个性化的见解。通过分析数值测量(血压、心率)和分类特征(例如性别、职业)等各种因素,机器学习模型可以学习模式并对个人压力水平进行预测。机器学习能够准确检测和监控压力水平,有助于制定主动策略和干预措施来管理和增强心理健康。
我们探讨了在压力预测中使用机器学习的见解。
总之,这种压力预测分析为使用机器学习的压力水平及其预测提供了有价值的见解。利用研究结果开发压力管理工具和干预措施,促进整体福祉并提高生活质量。
Q1. 数据驱动的压力检测有哪些好处?
答:
Q2. 哪些类型的文本数据可用于数据驱动的压力检测?
答:
Q3. 数据驱动的压力检测存在哪些挑战?
答: