我有一个ID列表,上面有用分号分隔的冗长的描述。下面是一个ID及其描述的示例。
ID Description
O95831 activation of cysteine-type endopeptidase activity involved in apoptotic process; apoptotic DNA fragmentation; apoptotic process; cell redox homeostasis; chromosome condensation; DNA catabolic process; intrinsic apoptotic signaling pathway in response to endoplasmic reticulum stress; mitochondrial respiratory chain complex I assembly; NAD(P)H oxidase activity; neuron apoptotic process; neuron differentiation; oxidoreductase activity, acting on NAD(P)H; positive regulation of apoptotic process; regulation of apoptotic DNA fragmentation
问题:想出了一种文本挖掘方法,在描述中提到了“线粒体”或“线粒体”或“线粒体”。regex对解决这个问题有用吗?或者哪些其他的方法可能有用呢?
预期结果:“线粒体”一词描述的提取
O95831 ;mitochondrial respiratory chain complex I assembly;
感谢你的帮助,
发布于 2014-11-24 16:54:42
您可以使用正则表达式,例如
(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)
捕获组1和2将包含
O95831 ;mitochondrial respiratory chain complex I assembly;
示例:http://regex101.com/r/mR8xA7/1
Python代码就像
>>> re.findall(r"""(\d+).*(.\s(?:mitochondria|mitochondrial|mitochondrion)[^;]+;)""", str)
[('095831', '; mitochondrial respiratory chain complex I assembly;')]
https://stackoverflow.com/questions/27109864
复制相似问题