前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >RDKit | 基于PCA探索化学空间

RDKit | 基于PCA探索化学空间

作者头像
DrugAI
修改2021-01-29 23:21:45
1.2K0
修改2021-01-29 23:21:45
举报
文章被收录于专栏:DrugAI

基于主成分分析(PCA,Principal Component Analysis)和聚类分析探索化学空间。

分析化合物数据库,发现化合物之间的共同描述符(物理化学特性)。


导入库

代码语言:javascript
复制
import osimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom matplotlib import gridspec
from rdkit import Chem, DataStructsfrom rdkit.Chem import Descriptors,Crippen
from sklearn.decomposition import PCAfrom sklearn.preprocessing import StandardScaler
import matplotlib.cm as cmfrom sklearn.metrics import silhouette_samples, silhouette_scorefrom sklearn.cluster import KMeans

载入分子数据

该库包含超过10 000 000个SMILES。可以将.smiles文件作为文本文件读取,将10000个分子保存在pandas中。

代码语言:javascript
复制
#Loading the moleculesdata=pd.DataFrame()with open('mol_parent.smi','r') as file:    for index,line in enumerate(file):        if 0<index<=10000:            data.loc[index,'SMILES']=line.split()[0]            data.loc[index,'Id']=line.split()[1]
代码语言:javascript
复制
data.head(10)

分子描述符计算

使用RDKit来计算2D和3D分子描述符。

代码语言:javascript
复制
# calculate the descriptors and add them to dataframefor i in data.index:    mol=Chem.MolFromSmiles(data.loc[i,'SMILES'])    data.loc[i,'MolWt']=Descriptors.ExactMolWt (mol)    data.loc[i,'TPSA']=Chem.rdMolDescriptors.CalcTPSA(mol) #Topological Polar Surface Area    data.loc[i,'nRotB']=Descriptors.NumRotatableBonds (mol) #Number of rotable bonds    data.loc[i,'HBD']=Descriptors.NumHDonors(mol) #Number of H bond donors    data.loc[i,'HBA']=Descriptors.NumHAcceptors(mol) #Number of H bond acceptors    data.loc[i,'LogP']=Descriptors.MolLogP(mol) #LogP
代码语言:javascript
复制
data.head(10)

计算分子描述符的PCA

选择将用于PCA计算的的值即计算的描述符。

代码语言:javascript
复制
descriptors = data.loc[:, ['MolWt', 'TPSA', 'nRotB', 'HBD','HBA', 'LogP']].values

一个非常重要的步骤是描述符的标度化。

代码语言:javascript
复制
descriptors_std = StandardScaler().fit_transform(descriptors)

计算PCA

代码语言:javascript
复制
pca = PCA()descriptors_2d = pca.fit_transform(descriptors_std)
代码语言:javascript
复制
descriptors_pca= pd.DataFrame(descriptors_2d)descriptors_pca.index = data.indexdescriptors_pca.columns = ['PC{}'.format(i+1) for i in descriptors_pca.columns]descriptors_pca.head(10)

检查解释方差,查看PCA中每个组的解释方差

代码语言:javascript
复制
print(pca.explained_variance_ratio_) print(sum(pca.explained_variance_ratio_))

[0.40340106 0.31163646 0.15708923 0.07870552 0.03381616 0.01535157]

1.0

代码语言:javascript
复制
plt.rcParams['axes.linewidth'] = 1.5plt.figure(figsize=(8,6))fig, ax = plt.subplots(figsize=(8,6))
var=np.cumsum(np.round(pca.explained_variance_ratio_, decimals=3)*100)plt.plot([i+1 for i in range(len(var))],var,'k-',linewidth=2)plt.xticks([i+1 for i in range(len(var))])plt.ylabel('% Variance Explained',fontsize=16,fontweight='bold')plt.xlabel('Pincipal Component (PC)',fontsize=16,fontweight='bold')ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)plt.tight_layout()plt.tick_params ('both',width=2,labelsize=12)
代码语言:javascript
复制
fig = plt.figure(figsize=(8,6))ax = fig.add_subplot(111)
ax.plot(descriptors_pca['PC1'],descriptors_pca['PC2'],'o',color='k')ax.set_title ('Principal Component Analysis',fontsize=16,fontweight='bold',family='sans-serif')ax.set_xlabel ('PC1',fontsize=14,fontweight='bold')ax.set_ylabel ('PC2',fontsize=14,fontweight='bold')
plt.tick_params ('both',width=2,labelsize=12)
plt.tight_layout()plt.show()

K-means聚类和主要特征识别

将PCA值从-1重新缩放到1,因为想要在特征(描述符)的协方差周期内分析数据。

代码语言:javascript
复制
# This normalization will be performed just for PC1 and PC2, but can be done for all the components.scale1 = 1.0/(max(descriptors_pca['PC1']) - min(descriptors_pca['PC1']))scale2 = 1.0/(max(descriptors_pca['PC2']) - min(descriptors_pca['PC2']))
# add the new values to our PCA coloumdescriptors_pca['PC1_normalized']=[i*scale1 for i in descriptors_pca['PC1']]descriptors_pca['PC2_normalized']=[i*scale2 for i in descriptors_pca['PC2']]
代码语言:javascript
复制
descriptors_pca.head(10)
代码语言:javascript
复制
fig = plt.figure(figsize=(8,6))ax = fig.add_subplot(111)
ax.plot(descriptors_pca['PC1_normalized'],descriptors_pca['PC2_normalized'],'o',color='k')ax.set_title ('Principal Component Analysis',fontsize=16,fontweight='bold',family='sans-serif')ax.set_xlabel ('PC1',fontsize=14,fontweight='bold')ax.set_ylabel ('PC2',fontsize=14,fontweight='bold')
plt.tick_params ('both',width=2,labelsize=12)
plt.tight_layout()plt.show()

K-means聚类

K-means聚类是一种算法,其中必须定义聚类的数量。然而,为了在数学上基于分布为一组点选择多个聚类,可以应用不同的算法。

代码语言:javascript
复制
range_n_clusters = [2, 3, 4, 5, 6, 7,8,9,10]for n_clusters in range_n_clusters:    fig, (ax1,ax2)= plt.subplots(1, 2)    fig.set_size_inches(8, 4)
    kmeans = KMeans(n_clusters=n_clusters, random_state=10)    cluster_labels = kmeans.fit_predict(descriptors_pca[['PC1_normalized','PC2_normalized']])    silhouette_avg = silhouette_score(descriptors_pca[['PC1_normalized','PC2_normalized']], cluster_labels)    print("For n_clusters =", n_clusters,          "The average silhouette_score is :", silhouette_avg)
    sample_silhouette_values = silhouette_samples(descriptors_pca[['PC1_normalized','PC2_normalized']], cluster_labels)
    y_lower = 10
    for i in range(n_clusters):        # Aggregate the silhouette scores for samples belonging to        # cluster i, and sort them        ith_cluster_silhouette_values = \            sample_silhouette_values[cluster_labels == i]
        ith_cluster_silhouette_values.sort()
        size_cluster_i = ith_cluster_silhouette_values.shape[0]        y_upper = y_lower + size_cluster_i
        color = cm.nipy_spectral(float(i) / n_clusters)        ax1.fill_betweenx(np.arange(y_lower, y_upper),                          0, ith_cluster_silhouette_values,                          facecolor=color, edgecolor=color, alpha=0.7)
        # Label the silhouette plots with their cluster numbers at the middle        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        # Compute the new y_lower for next plot        y_lower = y_upper + 10  # 10 for the 0 samples
    ax1.set_title("The silhouette plot for the various clusters.")    ax1.set_xlabel("The silhouette coefficient values")    ax1.set_ylabel("Cluster label")
    # The vertical line for average silhouette score of all the values    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    # 2nd Plot showing the actual clusters formed    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)    ax2.scatter(descriptors_pca['PC1_normalized'], descriptors_pca['PC2_normalized'],                 marker='.', s=30, lw=0, alpha=0.7,c=colors, edgecolor='k')

    # Labeling the clusters    centers = kmeans.cluster_centers_    # Draw white circles at cluster centers    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',                c="white", alpha=1, s=200, edgecolor='k')
    for i, c in enumerate(centers):        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,                    s=50, edgecolor='k')
    ax2.set_title("The visualization of the clustered data.")    ax2.set_xlabel("PC1")    ax2.set_ylabel("PC2")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "                  "with n_clusters = %d" % n_clusters),                 fontsize=14, fontweight='bold')

plt.show()

Silhouette_score越高,群集分布越好。

n_clusters = 2平均silhouette_score是:0.4205650178373521

代码语言:javascript
复制
kmeans = KMeans(n_clusters=2, random_state=10) # We define the best number of clustersclusters = kmeans.fit(descriptors_pca[['PC1_normalized','PC2_normalized']]) #PC1 vs PC2 (normalized values)
代码语言:javascript
复制
descriptors_pca['Cluster_PC1_PC2'] = pd.Series(clusters.labels_, index=data.index)descriptors_pca.head(10)

绘制PC1与PC2数据, 每个群集将具有不同的颜色,将找到每个主要组件的主要特征。

代码语言:javascript
复制
plt.rcParams['axes.linewidth'] = 1.5plt.figure(figsize=(10,8))
fig, ax = plt.subplots(figsize=(7,7))
color_code={ 0:        'magenta',\             1.0:   'orange',\             2.0:      'cyan',\             3.0:           'c',\             4.0:        'm',\             5.0:        'y',\             6.0:        'darkorange',             7.0:       'k',             }
for i in descriptors_pca.index:         ax.plot(descriptors_pca.loc[i].at['PC1_normalized'],descriptors_pca.loc[i].at['PC2_normalized'],                    c=color_code[descriptors_pca.loc[i].at['Cluster_PC1_PC2']],                    marker='o',markersize=8,markeredgecolor='k',alpha=0.3)

plt.xlabel ('PC1',fontsize=14,fontweight='bold')ax.xaxis.set_label_coords(0.98, 0.45)plt.ylabel ('PC2',fontsize=14,fontweight='bold')ax.yaxis.set_label_coords(0.45, 0.98)plt.tick_params ('both',width=2,labelsize=12)ax.spines['left'].set_position(('data', 0))ax.spines['bottom'].set_position(('data', 0))ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)
lab=['MolWt', 'TPSA', 'nRotB', 'HBD','HBA', 'LogP'] #Feature labels
l=np.transpose(pca.components_[0:2, :]) ## We will get the components eigenvectors (main features) for PC1 and PC2
n = l.shape[0]for i in range(n):    plt.arrow(0, 0, l[i,0], l[i,1],color= 'k',alpha=0.6,linewidth=1.2,head_width=0.025)    plt.text(l[i,0]*1.25, l[i,1]*1.25, lab[i], color = 'k',va = 'center', ha = 'center',fontsize=11)
circle = plt.Circle((0,0), 1, color='gray', fill=False,clip_on=True,linewidth=1.5,linestyle='--')ax.add_artist(circle)plt.xlim(-1.2,1.2)plt.ylim(-1.2,1.2)plt.tight_layout()plt.savefig("fig1.png", dpi=300)plt.show()

可以确定与PC1(MolWt、nRotB、LogP)和PC2(HBD,TPSA,HBA)正相关和负相关的特征。此外,由于向量长度,可以看到LogP是“最重要的”特征(描述符)。

合并保存数据

代码语言:javascript
复制
data=data.join(descriptors_pca)data.head(10)data.to_csv('moldscPCA.csv',index=None)

参考资料

http://www.rdkit.org

http://www.rdkit.org/docs/index.html

本文参与 腾讯云自媒体同步曝光计划,分享自微信公众号。
原始发表:2019-09-15,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 DrugAI 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 基于主成分分析(PCA,Principal Component Analysis)和聚类分析探索化学空间。
    • 导入库
      • 载入分子数据
        • 分子描述符计算
          • 计算分子描述符的PCA
            • K-means聚类和主要特征识别
              • K-means聚类
                • 合并保存数据
                领券
                问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档