我需要帮助解析一个非常长的文本文件,它看起来像:
NAME IMP4
DESCRIPTION small nucleolar ribonucleoprotein
CLASS Genetic Information Processing
Translation
Ribosome biogenesis in eukaryotes
DBLINKS NCBI-GI: 15529982
NCBI-GeneID: 92856
OMIM: 612981
///
NAME COMMD9
DESCRIPTION COMM domain containing 9
ORGANISM H.sapiens
DBLINKS NCBI-GI: 156416007
NCBI-GeneID: 29099
OMIM: 612299
///
.....
我想要获得一个结构化的csv文件,每行具有相同的列数,以便轻松地提取所需的信息。
首先,我试着这样做:
for line in a:
if '///' not in line:
b.write(''.join(line.replace('\n', '\t')))
else:
b.write('\n')
如下所示获取csv:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tCLASS Genetic Information Processing\t Translation\t Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982\t NCBI-GeneID: 92856\t
OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tDBLINKS NCBI-GI: 156416007\t NCBI-GeneID: 29099t\ OMIM: 612299
主要的问题是,像DBLINKS这样的字段,在原始文件中是在多行中,这样会导致在几个字段中拆分,而我需要将它们都放在一个字段中。此外,并不是所有字段都出现在每一行中,例如示例中的字段'CLASS‘和’example‘。
我想要获取的文件应该如下所示:
NAME IMP4\tDESCRIPTION small nucleolar ribonucleoprotein\tNA\tCLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes\tDBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9\tDESCRIPTION COMM domain containing 9\tORGANISM H.sapiens\tNA\tDBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
你能帮帮我吗?
发布于 2011-11-16 18:56:13
您可以使用itertools.groupby,一次将行收集到记录中,第二次将多行字段收集到迭代器中:
import csv
import itertools
def is_end_of_record(line):
return line.startswith('///')
class FieldClassifier(object):
def __init__(self):
self.field=''
def __call__(self,row):
if not row[0].isspace():
self.field=row.split(' ',1)[0]
return self.field
fields='NAME DESCRIPTION ORGANISM CLASS DBLINKS'.split()
with open('data','r') as f:
for end_of_record, lines in itertools.groupby(f,is_end_of_record):
if not end_of_record:
classifier=FieldClassifier()
record={}
for fieldname, row in itertools.groupby(lines,classifier):
record[fieldname]='; '.join(r.strip() for r in row)
print('\t'.join(record.get(fieldname,'NA') for fieldname in fields))
收益率
NAME IMP4 DESCRIPTION small nucleolar ribonucleoprotein NA CLASS Genetic Information Processing; Translation; Ribosome biogenesis in eukaryotes DBLINKS NCBI-GI: 15529982; NCBI-GeneID: 92856; OMIM: 612981
NAME COMMD9 DESCRIPTION COMM domain containing 9 ORGANISM H.sapiens NA DBLINKS NCBI-GI: 156416007; NCBI-GeneID: 29099; OMIM: 612299
上面是输出,你会看到它打印出来。它与您发布的期望输出相匹配,假设您显示的是该输出的repr
。
对所用工具的引用:
__call__
method的generator expression
发布于 2011-11-16 19:13:38
此脚本会将您的文本文件转换为有效的CSV文件(例如,可以用Excel读取):
import sys
from sets import Set
if len(sys.argv) < 2:
print 'Usage: %s <input-file> <output-file>' % sys.argv[0]
sys.exit(1)
entries = []
entry = {}
# Read the input file
with open(sys.argv[1]) as input:
lines = input.readlines()
for line in lines:
# Check for beginning of new entry
if line.strip() == '///':
if len(entry) > 0:
entries.append(entry)
entry = {}
continue
# Check for presense of key
possible_key = line[:13].strip()
if possible_key != '':
key = possible_key
entry[key] = []
# Assemble the value
if key:
entry[key].append(line[13:].strip())
# Append the last entry
if len(entry) > 0:
entries.append(entry)
# 'entries' now contains a list of a dict of a list
# Find out all possible keys
all_keys = Set()
for entry in entries:
all_keys.union_update(entry.keys())
# Write all entries to the output file
with open(sys.argv[2], 'w') as output:
# The first line will contain the keys
output.write(','.join(['"%s"' % key for key in sorted(all_keys)]))
output.write('\r\n')
# Write each entry
for entry in entries:
output.write(','.join(['"%s"' % ';'.join(entry[key]) if key in entry else '' for key in sorted(all_keys)]))
output.write('\r\n')
https://stackoverflow.com/questions/8150192
复制相似问题