首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >专栏 >AI/机器学习常用公式的LaTex代码汇总

AI/机器学习常用公式的LaTex代码汇总

作者头像
blmoistawinde
发布2020-05-26 17:12:40
发布2020-05-26 17:12:40
3.6K00
代码可运行
举报
运行总次数:0
代码可运行

在写AI/机器学习相关的论文或者博客的时候经常需要用到LaTex的公式,然而作为资深“伸手党”的我在网上搜索的时候,居然没有找到相关现成资源@-@

那么,我就把自己经常会遇到的公式整理如下,以NLP和一些通用指标函数为主。有需要的可以自取,当然发现有问题或者遗漏的也欢迎指正和补充。

Classical ML Equations in LaTeX

A collection of classical ML equations in Latex . Some of them are provided with simple notes and paper link. Hopes to help writings such as papers and blogs.

Better viewed at https://blmoistawinde.github.io/ml_equations_latex/

  • Classical ML Equations in LaTeX
    • Model
      • RNNs(LSTM, GRU)
      • Attentional Seq2seq
        • Bahdanau Attention
        • Luong(Dot-Product) Attention
      • Transformer
        • Scaled Dot-Product attention
        • Multi-head attention
      • Generative Adversarial Networks(GAN)
        • Minmax game objective
      • Variational Auto-Encoder(VAE)
        • Reparameterization trick
    • Activations
      • Sigmoid
      • Softmax
      • Relu
    • Loss
      • Regression
        • Mean Absolute Error(MAE)
        • Mean Squared Error(MSE)
        • Huber loss
      • Classification
        • Cross Entropy
        • Negative Loglikelihood
        • Hinge loss
        • KL/JS divergence
      • Regularization
        • L1 regularization
        • L2 regularization
    • Metrics
      • Classification
        • Accuracy, Precision, Recall, F1
        • Sensitivity, Specificity and AUC
      • Regression
      • Clustering
        • (Normalized) Mutual Information (NMI)
      • Ranking
        • (Mean) Average Precision(MAP)
      • Similarity/Relevance
        • Cosine
        • Jaccard
        • Pointwise Mutual Information(PMI)
    • Notes
    • Reference

Model

RNNs(LSTM, GRU)

encoder hidden state hth_tht​ at time step ttt ht=RNNenc(xt,ht−1)h_t = RNN_{enc}(x_t, h_{t-1})ht​=RNNenc​(xt​,ht−1​)

decoder hidden state sts_tst​ at time step ttt

st=RNNdec(yt,st−1)s_t = RNN_{dec}(y_t, s_{t-1})st​=RNNdec​(yt​,st−1​)

代码语言:javascript
代码运行次数:0
运行
复制
h_t = RNN_{enc}(x_t, h_{t-1})
s_t = RNN_{dec}(y_t, s_{t-1})

The RNNencRNN_{enc}RNNenc​, RNNdecRNN_{dec}RNNdec​ are usually either

Attentional Seq2seq

The attention weight αij\alpha_{ij}αij​, the iiith decoder step over the jjjth encoder step, resulting in context vector cic_ici​

ci=∑j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij}h_jci​=j=1∑Tx​​αij​hj​

αij=exp⁡(eij)∑k=1Txexp⁡(eik) \alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})} αij​=∑k=1Tx​​exp(eik​)exp(eij​)​

eik=a(si−1,hj) e_{ik} = a(s_{i-1}, h_j) eik​=a(si−1​,hj​)

代码语言:javascript
代码运行次数:0
运行
复制
c_i = \sum_{j=1}^{T_x} \alpha_{ij}h_j

\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_{k=1}^{T_x} \exp(e_{ik})}

e_{ik} = a(s_{i-1}, h_j)

aaa is an specific attention function, which can be

Bahdanau Attention

Paper: Neural Machine Translation by Jointly Learning to Align and Translate

eik=vTtanh(W[si−1;hj])e_{ik} = v^T tanh(W[s_{i-1}; h_j])eik​=vTtanh(W[si−1​;hj​])

代码语言:javascript
代码运行次数:0
运行
复制
e_{ik} = v^T tanh(W[s_{i-1}; h_j])
Luong(Dot-Product) Attention

Paper: Effective Approaches to Attention-based Neural Machine Translation

If sis_isi​ and hjh_jhj​ has same number of dimension.

eik=si−1Thje_{ik} = s_{i-1}^T h_jeik​=si−1T​hj​

otherwise

eik=si−1TWhje_{ik} = s_{i-1}^T W h_jeik​=si−1T​Whj​

代码语言:javascript
代码运行次数:0
运行
复制
e_{ik} = s_{i-1}^T h_j

e_{ik} = s_{i-1}^T W h_j

Finally, the output oio_ioi​ is produced by:

st=tanh(W[st−1;yt;ct])s_t = tanh(W[s_{t-1};y_t;c_t])st​=tanh(W[st−1​;yt​;ct​]) ot=softmax(Vst)o_t = softmax(Vs_t)ot​=softmax(Vst​)

代码语言:javascript
代码运行次数:0
运行
复制
s_t = tanh(W[s_{t-1};y_t;c_t])
o_t = softmax(Vs_t)

Transformer

Paper: Attention Is All You Need

Scaled Dot-Product attention

Attention(Q,K,V)=softmax(QKTdk)V Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})VAttention(Q,K,V)=softmax(dk​​QKT​)V

代码语言:javascript
代码运行次数:0
运行
复制
Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V

where dk\sqrt{d_k}dk​​ is the dimension of the key vector kkk and query vector qqq .

Multi-head attention

MultiHead(Q,K,V)=Concat(head1,...,headh)WOMultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^OMultiHead(Q,K,V)=Concat(head1​,...,headh​)WO

where headi=Attention(QWiQ,KWiK,VWiV) head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i) headi​=Attention(QWiQ​,KWiK​,VWiV​)

代码语言:javascript
代码运行次数:0
运行
复制
MultiHead(Q, K, V) = Concat(head_1, ..., head_h)W^O

head_i = Attention(Q W^Q_i, K W^K_i, V W^V_i)

Generative Adversarial Networks(GAN)

Paper: Generative Adversarial Networks

Minmax game objective

min⁡Gmax⁡DEx∼pdata(x)[log⁡D(x)]+Ez∼pgenerated(z)[1−log⁡D(G(z))] \min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] + \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}] Gmin​Dmax​Ex∼pdata​(x)​[logD(x)]+Ez∼pgenerated​(z)​[1−logD(G(z))]

代码语言:javascript
代码运行次数:0
运行
复制
\min_{G}\max_{D}\mathbb{E}_{x\sim p_{\text{data}}(x)}[\log{D(x)}] +  \mathbb{E}_{z\sim p_{\text{generated}}(z)}[1 - \log{D(G(z))}]

Variational Auto-Encoder(VAE)

Paper: Auto-Encoding Variational Bayes

Reparameterization trick

To produce a latent variable z such that z∼qμ,σ(z)=N(μ,σ2)z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)z∼qμ,σ​(z)=N(μ,σ2), we sample ϵ∼N(0,1)\epsilon \sim \mathcal{N}(0,1)ϵ∼N(0,1), than z is produced by

z=μ+ϵ⋅σz = \mu + \epsilon \cdot \sigmaz=μ+ϵ⋅σ

代码语言:javascript
代码运行次数:0
运行
复制
z \sim q_{\mu, \sigma}(z) = \mathcal{N}(\mu, \sigma^2)
\epsilon \sim \mathcal{N}(0,1)
z = \mu + \epsilon \cdot \sigma

Above is for 1-D case. For a multi-dimensional (vector) case we use:

ϵ⃗∼N(0,I) \vec{\epsilon} \sim \mathcal{N}(0, \textbf{I}) ϵ∼N(0,I)

z⃗∼N(μ⃗,σ2I) \vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I}) z∼N(μ​,σ2I)

代码语言:javascript
代码运行次数:0
运行
复制
\epsilon \sim \mathcal{N}(0, \textbf{I})
\vec{z} \sim \mathcal{N}(\vec{\mu}, \sigma^2 \textbf{I})

Activations

Sigmoid

Related to Logistic Regression. For single-label/multi-label binary classification.

σ(z)=11+e−z\sigma(z) = \frac{1} {1 + e^{-z}}σ(z)=1+e−z1​

代码语言:javascript
代码运行次数:0
运行
复制
\sigma(z) = \frac{1} {1 + e^{-z}}

Softmax

For multi-class single label classification.

σ(zi)=ezi∑j=1Kezj for i=1,2,…,K\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,Kσ(zi​)=∑j=1K​ezj​ezi​​ for i=1,2,…,K

代码语言:javascript
代码运行次数:0
运行
复制
\sigma(z_i) = \frac{e^{z_{i}}}{\sum_{j=1}^K e^{z_{j}}} \ \ \ for\ i=1,2,\dots,K

Relu

Relu(z)=max(0,z)Relu(z) = max(0, z)Relu(z)=max(0,z)

代码语言:javascript
代码运行次数:0
运行
复制
Relu(z) = max(0, z)

Loss

Regression

Below xxx and yyy are DDD dimensional vectors, and xix_ixi​ denotes the value on the iiith dimension of xxx.

Mean Absolute Error(MAE)

∑i=1D∣xi−yi∣\sum_{i=1}^{D}|x_i-y_i|i=1∑D​∣xi​−yi​∣

代码语言:javascript
代码运行次数:0
运行
复制
\sum_{i=1}^{D}|x_i-y_i|
Mean Squared Error(MSE)

∑i=1D(xi−yi)2\sum_{i=1}^{D}(x_i-y_i)^2i=1∑D​(xi​−yi​)2

代码语言:javascript
代码运行次数:0
运行
复制
\sum_{i=1}^{D}(x_i-y_i)^2
Huber loss

It’s less sensitive to outliers than the MSE as it treats error as square only inside an interval.

Lδ={12(y−y^)2if∣(y−y^)∣<δδ((y−y^)−12δ)otherwise L_{\delta}= \left\{\begin{matrix} \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y}) \right | < \delta\\ \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise \end{matrix}\right. Lδ​={21​(y−y^​)2δ((y−y^​)−21​δ)​if∣(y−y^​)∣<δotherwise​

代码语言:javascript
代码运行次数:0
运行
复制
L_{\delta}=
    \left\{\begin{matrix}
        \frac{1}{2}(y - \hat{y})^{2} & if \left | (y - \hat{y})  \right | < \delta\\
        \delta ((y - \hat{y}) - \frac1 2 \delta) & otherwise
    \end{matrix}\right.

Classification

Cross Entropy
  • In binary classification, where the number of classes MMM equals 2, Binary Cross-Entropy(BCE) can be calculated as:

−(ylog⁡(p)+(1−y)log⁡(1−p))-{(y\log(p) + (1 - y)\log(1 - p))}−(ylog(p)+(1−y)log(1−p))

  • If M>2M > 2M>2 (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

−∑c=1Myo,clog⁡(po,c)-\sum_{c=1}^My_{o,c}\log(p_{o,c})−c=1∑M​yo,c​log(po,c​)

代码语言:javascript
代码运行次数:0
运行
复制
-{(y\log(p) + (1 - y)\log(1 - p))}

-\sum_{c=1}^My_{o,c}\log(p_{o,c})

M - number of classes log - the natural log y - binary indicator (0 or 1) if class label c is the correct classification for observation o p - predicted probability observation o is of class c

Negative Loglikelihood

NLL(y)=−log⁡(p(y))NLL(y) = -{\log(p(y))}NLL(y)=−log(p(y))

Minimizing negative loglikelihood

min⁡θ∑y−log⁡(p(y;θ))\min_{\theta} \sum_y {-\log(p(y;\theta))}θmin​y∑​−log(p(y;θ))

is equivalent to Maximum Likelihood Estimation(MLE).

max⁡θ∏yp(y;θ)\max_{\theta} \prod_y p(y;\theta)θmax​y∏​p(y;θ)

Here p(y)p(y)p(y) is a scaler instead of vector. It is the value of the single dimension where the ground truth yyy lies. It is thus equivalent to cross entropy (See wiki).\

代码语言:javascript
代码运行次数:0
运行
复制
NLL(y) = -{\log(p(y))}

\min_{\theta} \sum_y {-\log(p(y;\theta))}

\max_{\theta} \prod_y p(y;\theta)
Hinge loss

Used in Support Vector Machine(SVM).

max(0,1−y⋅y^)max(0, 1 - y \cdot \hat{y})max(0,1−y⋅y^​)

代码语言:javascript
代码运行次数:0
运行
复制
max(0, 1 - y \cdot \hat{y})
KL/JS divergence

KL(y^∣∣y)=∑c=1My^clog⁡y^cycKL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}KL(y^​∣∣y)=c=1∑M​y^​c​logyc​y^​c​​

JS(y^∣∣y)=12(KL(y∣∣y+y^2)+KL(y^∣∣y+y^2))JS(\hat{y} || y) = \frac{1}{2}(KL(y||\frac{y+\hat{y}}{2}) + KL(\hat{y}||\frac{y+\hat{y}}{2}))JS(y^​∣∣y)=21​(KL(y∣∣2y+y^​​)+KL(y^​∣∣2y+y^​​))

代码语言:javascript
代码运行次数:0
运行
复制
KL(\hat{y} || y) = \sum_{c=1}^{M}\hat{y}_c \log{\frac{\hat{y}_c}{y_c}}

v

Regularization

The ErrorErrorError below can be any of the above loss.

L1 regularization

A regression model that uses L1 regularization technique is called Lasso Regression.

Loss=Error(Y−Y^)+λ∑1n∣wi∣Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|Loss=Error(Y−Y)+λ1∑n​∣wi​∣

代码语言:javascript
代码运行次数:0
运行
复制
Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n |w_i|
L2 regularization

A regression model that uses L1 regularization technique is called Ridge Regression.

Loss=Error(Y−Y^)+λ∑1nwi2Loss = Error(Y - \widehat{Y}) + \lambda \sum_1^n w_i^{2}Loss=Error(Y−Y)+λ1∑n​wi2​

代码语言:javascript
代码运行次数:0
运行
复制
Loss = Error(Y - \widehat{Y}) +  \lambda \sum_1^n w_i^{2}

Metrics

Some of them overlaps with loss, like MAE, KL-divergence.

Classification

Accuracy, Precision, Recall, F1

Accuracy=TP+TFTP+TF+FP+FNAccuracy = \frac{TP+TF}{TP+TF+FP+FN}Accuracy=TP+TF+FP+FNTP+TF​

Precision=TPTP+FPPrecision = \frac{TP}{TP+FP}Precision=TP+FPTP​

Recall=TPTP+FNRecall = \frac{TP}{TP+FN}Recall=TP+FNTP​

F1=2∗Precision∗RecallPrecision+Recall=2∗TP2∗TP+FP+FNF1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}F1=Precision+Recall2∗Precision∗Recall​=2∗TP+FP+FN2∗TP​

代码语言:javascript
代码运行次数:0
运行
复制
Accuracy = \frac{TP+TF}{TP+TF+FP+FN}
Precision = \frac{TP}{TP+FP}
Recall = \frac{TP}{TP+FN}
F1 = \frac{2*Precision*Recall}{Precision+Recall} = \frac{2*TP}{2*TP+FP+FN}
Sensitivity, Specificity and AUC

Sensitivity=Recall=TPTP+FNSensitivity = Recall = \frac{TP}{TP+FN}Sensitivity=Recall=TP+FNTP​

Specificity=TNFP+TNSpecificity = \frac{TN}{FP+TN}Specificity=FP+TNTN​

代码语言:javascript
代码运行次数:0
运行
复制
Sensitivity = Recall = \frac{TP}{TP+FN}
Specificity = \frac{TN}{FP+TN}

AUC is calculated as the Area Under the SensitivitySensitivitySensitivity(TPR)-(1−Specificity)(1-Specificity)(1−Specificity)(FPR) Curve.

Regression

MAE, MSE, equation above.

Clustering

(Normalized) Mutual Information (NMI)

The Mutual Information is a measure of the similarity between two labels of the same data. Where ∣Ui∣|U_i|∣Ui​∣ is the number of the samples in cluster UiU_iUi​ and ∣Vi∣|V_i|∣Vi​∣ is the number of the samples in cluster ViV_iVi​ , the Mutual Information between cluster UUU and VVV is given as:

MI(U,V)=∑i=1∣U∣∑j=1∣V∣∣Ui∩Vj∣Nlog⁡N∣Ui∩Vj∣∣Ui∣∣Vj∣ MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N} \log\frac{N|U_i \cap V_j|}{|U_i||V_j|} MI(U,V)=i=1∑∣U∣​j=1∑∣V∣​N∣Ui​∩Vj​∣​log∣Ui​∣∣Vj​∣N∣Ui​∩Vj​∣​

代码语言:javascript
代码运行次数:0
运行
复制
MI(U,V)=\sum_{i=1}^{|U|} \sum_{j=1}^{|V|} \frac{|U_i\cap V_j|}{N}
\log\frac{N|U_i \cap V_j|}{|U_i||V_j|}

Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation). In this function, mutual information is normalized by some generalized mean of H(labels_true) and H(labels_pred)), See wiki.

Skip RI, ARI for complexity.

Also skip metrics for related tasks (e.g. modularity for community detection[graph clustering], coherence score for topic modeling[soft clustering]).

Ranking

Skip nDCG (Normalized Discounted Cumulative Gain) for its complexity.

(Mean) Average Precision(MAP)

Average Precision is calculated as:

AP=∑n(Rn−Rn−1)Pn\text{AP} = \sum_n (R_n - R_{n-1}) P_nAP=n∑​(Rn​−Rn−1​)Pn​

代码语言:javascript
代码运行次数:0
运行
复制
\text{AP} = \sum_n (R_n - R_{n-1}) P_n

where RnR_nRn​ and PnP_nPn​ are the precision and recall at the nnnth threshold,

MAP is the mean of AP over all the queries.

Similarity/Relevance

Cosine

Cosine(x,y)=x⋅y∣x∣∣y∣Cosine(x,y) = \frac{x \cdot y}{|x||y|}Cosine(x,y)=∣x∣∣y∣x⋅y​

代码语言:javascript
代码运行次数:0
运行
复制
Cosine(x,y) = \frac{x \cdot y}{|x||y|}
Jaccard

Similarity of two sets UUU and VVV.

Jaccard(U,V)=∣U∩V∣∣U∪V∣Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}Jaccard(U,V)=∣U∪V∣∣U∩V∣​

代码语言:javascript
代码运行次数:0
运行
复制
Jaccard(U,V) = \frac{|U \cap V|}{|U \cup V|}
Pointwise Mutual Information(PMI)

Relevance of two events xxx and yyy.

PMI(x;y)=log⁡p(x,y)p(x)p(y)PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}PMI(x;y)=logp(x)p(y)p(x,y)​

代码语言:javascript
代码运行次数:0
运行
复制
PMI(x;y) = \log{\frac{p(x,y)}{p(x)p(y)}}

For example, p(x)p(x)p(x) and p(y)p(y)p(y) is the frequency of word xxx and yyy appearing in corpus and p(x,y)p(x,y)p(x,y) is the frequency of the co-occurrence of the two.

Notes

This repository now only contains simple equations for ML. They are mainly about deep learning and NLP now due to personal research interests.

For time issues, elegant equations in traditional ML approaches like SVM, SVD, PCA, LDA are not included yet.

Moreover, there is a trend towards more complex metrics, which have to be calculated with complicated program (e.g. BLEU, ROUGE, METEOR), iterative algorithms (e.g. PageRank), optimization (e.g. Earth Mover Distance), or even learning based (e.g. BERTScore). They thus cannot be described using simple equations.

Reference

Pytorch Documentation

Scikit-learn Documentation

Machine Learning Glossary

Wikipedia

https://blog.floydhub.com/gans-story-so-far/

https://ermongroup.github.io/cs228-notes/extras/vae/

Thanks for a-rodin’s solution to show Latex in Github markdown, which I have wrapped into latex2pic.py.

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020/05/21 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • Classical ML Equations in LaTeX
    • Model
      • RNNs(LSTM, GRU)
      • Attentional Seq2seq
      • Transformer
      • Generative Adversarial Networks(GAN)
      • Variational Auto-Encoder(VAE)
    • Activations
      • Sigmoid
      • Softmax
      • Relu
    • Loss
      • Regression
      • Classification
      • Regularization
    • Metrics
      • Classification
      • Regression
      • Clustering
      • Ranking
      • Similarity/Relevance
    • Notes
    • Reference
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档