今天分享一篇中医药相关的纯生信文章。没有测序也没有湿实验,只分析公共数据库的数据发表在了Phytomedicine上。我们来看他是做了什么工作。
总的来说,作者用单细胞+网络药理学+MR+分子对接,打了一套组合拳解析出了桂枝芍药知母汤(GZSYZM)在治疗类风湿性关节炎(RA)的作用机制。
原文链接: https://doi.org/10.1016/j.phymed.2024.156332
我们看一下AI coder能不能复现他做的工作,为了尽可能还原文章中的结果,我们将文章中方法部分提炼总结,作为AI coder的记忆文件。
# CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
## Project Overview
This is a bioinformatics project focused on rheumatoid arthritis (RA) gene expression analysis. The goal is to identify potential disease genes through differential expression analysis and weighted gene co-expression network analysis (WGCNA), ultimately generating five diagnostic figures as shown in Figure 1 of the accompanying research paper.
## Data Sources and Structure
### GEO Datasets
- **GSE205962**: Bulk RNA-seq data (GPL16043 platform)
- **GSE100191**: Bulk RNA-seq data (GPL13497 platform)
- **GSE17755**: Validation dataset using peripheral blood samples (GPL1291 platform)
- **GSE159117**: Single-cell RNA-seq data for scRNA-seq analysis
### Directory Structure
data/
├── GSE100191/ # Bulk RNA-seq expression data
├── GSE17755/ # Validation dataset (peripheral blood)
├── GSE159117/ # Single-cell RNA-seq data
└── GSE205962/ # Primary bulk RNA-seq dataset
workflow/ # Processing scripts (currently empty - needs population)
result/ # Output figures and analysis results (currently empty)
## Analysis Workflow
The project follows a multi-step bioinformatics pipeline:
1. **Data Preprocessing**
- Batch effect removal using ComBat() function from sva package
- Probe annotation mapping (GPL16043, GPL13497, GPL1219 platforms)
- Gene symbol ID assignment with maximum expression retention for multiple probes
2. **Differential Expression Analysis**
- Uses limma package for RA vs healthy control (HC) comparison
- Thresholds: p < 0.05, |log2 FC| ≥ 0.585
- Fold change calculation as ratio of disease to normal expression means
3. **WGCNA Analysis**
- Scale-free co-expression network construction
- Module identification and consolidation
- Gene significance (GS) and module membership (MM) calculation
- Thresholds: MM ≥ 0.6, GS ≥ 0.1
4. **Integration**
- Merge WGCNA and DEGs results to identify potential disease genes
## Expected Outputs
Generate five diagnostic figures (Figure 1):
- A: Volcano plot of differential expression analysis
- B: Heatmap of differential expression analysis
- C: Soft threshold selection and cluster dendrogram
- D: WGCNA correlation heatmap (modules vs HC/RA traits)
- E: Inter-module correlation heatmap with disease trait associations
## Technical Requirements
### R Environment
- R version 4.3.3
- Required packages:
- sva (ComBat functionfor batch effect removal)
- limma (differential expression analysis)
- WGCNA (weighted gene co-expression network analysis)
### Data Processing Notes
- Multiple microarray probes mapping to same gene: retain maximum expression value
- Statistical methods: Spearman's rank correlation, Mantle's test
- Expression filtering: genes with average FPKM > mean threshold
## Development Guidelines
- **Reference Material**: Use article.pdf as the authoritative sourcefor methodology when implementation details are unclear
- **Script Location**: Place all final processing scripts in the `workflow/` directory
- **Output Location**: Save all generated figures and results in the `result/` directory
- **Language**: Comments and documentation should be in Chinese as specified in the original requirements
- **Intermediate Files**: Do not save intermediate processing files or script revisions - only keep final working versions
## Data File Formats
- **CEL files**: Affymetrix microarray data (.CEL.gz)
- **Series Matrix**: GEO series expression matrices (.txt.gz)
- **Platform Files**: Probe annotation files (.txt, .annot)
- **Single-cell**: 10X Genomics H5 format for scRNA-seq data
进入AI coder之后会有提示:Run /init to create a CLAUDE.md file with instructions for Claude
/init
:AI coder自带的指令,作用是使用 CLAUDE.md 指南初始化项目。如果你自定义了CLAUDE.md的内容,执行该指令,会重新根据当前目录中的文件内容梳理、校验、补充。对于梳理之后的内容需要人为的做一下判断,看是否有与初衷相悖的执行!
处理好先行条件,就可以命令AI coder 开始执行了,首先需要对GSE100191和GSE62059两组GEO数据分别做注释,然后整合去批次。
为了节省时间,这两组数据和平台注释文件是我提前下载好的。得益于AI coder帮忙写的一个加速下载小工具,实现了下载文件速率倍增。适配所有系统,感兴趣的朋友,关注公众号私信客服领取。
AI coder创建一个任务列表来跟踪整个流程,它会纠正我的拼写错误,自主检查数据。
经过初步处理,在整合两个数据集的时候发现有0个共同基因,然后AI coder开始自行寻找原因,修正脚本,之后找到原因是:表达数据的探针ID与注释文件的ID不匹配。
经过对原始数据的检索以及对执行脚本的修正,最终完成了第一步!
Tips:AI coder纠错的这个过程最好人为查看一下生成的脚本内容以及它选择的处理方式。因为AI coder会为了降低时间复杂度创造模拟数据来分析,那这样即使分析完成也不会得到真实的结果!
接下来,我用整合后的数据用于后续的差异表达分析,然后绘制火山图和热图!
然后我们接着做加权基因共表达网络分析(WGCNA)。利用WGCNA识别基因共表达模块,筛选与RA疾病表型高度相关的基因模块和模块内枢纽基因,挖掘潜在生物学功能群体。
鉴定潜在的类风湿性关节炎基因。A:差异表达分析的火山图。基因将被标记为 |log2 FC|> 1;B:差异表达分析热图;C:软阈值选择和聚类树状图。基因根据分层聚类分为不同的模块。D:基于模块与HC/RA性状相关性的加权基因共表达网络分析热图。E:基于Spearman秩相关检验和Mantle检验的模块间相关性以及模块与疾病性状关联的热图。
本文详细阐述了AI coder基于DEGs结合WGCNA方法对RA潜在基因进行初步筛选的流程,这一阶段的分析工作已完成。由于篇幅限制,后续步骤的处理与深入探讨将在下一篇文章中呈现。