Quetiapine fumarate Drug substance This document
Povidone Binder USP
This line doesn't contain any medicine name.
Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv
Lactose monohydrate Diluent USNF
Magnesium stearate Lubricant USNF
下面是一个.txt文件中的示例数据:
我有一个要在.txt文件中匹配的药品名称列表,并提取两种药品之间存在的所有数据。( csv文件中的药品示例是'Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate' etc etc.)
substancecopy.csv是包含我在下面的代码中使用过的所有药物列表的文件。
我希望迭代文本文件的每一行,并创建从一种药物到另一种药物的组。
输出示例:
['Quetiapine fumarate Drug substance This document'],
['Povidone Binder USP'],
['Lactose monohydrate Diluent USNF'],
['Magnesium stearate Lubricant USNF']
考虑到'Quetiapine fumarate'
,'Povidone'
,'Lactose monohydrate
','Magnesium stearate'
在我的物质列表中。
有没有人能帮我在Python中做同样的事情?
尝试到现在:
import re
import pandas as pd
import csv
import os
file = open(r'C:\Users\substancecopy.csv', 'r')
oo=csv.reader(file)
allsub = []
for line in oo:
allsub.append(line)
flat_list = [item for sublist in allsub for item in sublist]
def extract(filename):
file=open(filename,encoding='utf-8')
file=file.read()
n=[]
for x in flat_list:
my_regex = r"^\s?" + re.escape(x)
#my_regex_new = r"\b" + re.escape(my_regex) + r"\b"
if re.search(my_regex,file,flags=re.IGNORECASE|re.MULTILINE):
n.append(x)
n.sort()
return n
我需要捕获从一种药物到另一种药物的所有文本,如示例输出中所示,这段代码没有实现
发布于 2020-01-28 10:34:05
以下方法在小数据集上看起来效果很好。然而,我会假设对于大型数据集,它可能效率不高,并且可能有更好的方法来执行这一点。
我采取的方法是基于您的问题的想法,即所有数据必须存储在药物名称之间。如果您只想存储与药物匹配的行,则可以执行以下操作;
result = [[row.strip()] for row in data for med in meds if med in row]
#[['Quetiapine fumarate Drug substance This document'], ['Povidone Binder USP'], ['Lactose monohydrate Diluent USNF'], ['Magnesium stearate Lubricant USNF']]
我将药物名称加载到一个列表中,您可能需要根据您的csv
进行调整。
meds = ['Quetiapine fumarate', 'Povidone', 'Magnesium stearate', 'Lactose monohydrate']
with open('1.txt', 'r') as file:
data = file.readlines()
result = [] # Empty list to store our findings
for idx, row in enumerate(data): # Get index and value of each line in the text file
count = 0 # Set a counter for each row, this is to determine if there are no matches
for med in meds:
if med in row and med not in data[idx-1]: # If medication is matched and the same medication is not found in the previous row
result.append([row.strip()])
else: # No match found on this medication, increase counter
count += 1
if count == len(meds): # If count == total medication, declare no match and append to previous row
result[-1].append(row.strip())
for i in result:
print(i)
#['Quetiapine fumarate Drug substance This document']
#['Povidone Binder USP', 'Povidone new line', "This line doesn't contain any medicine name.", 'Dibasic calcium phosphate dihydrate Diluent USP is not present in the csv']
#['Lactose monohydrate Diluent USNF']
#['Magnesium stearate Lubricant USNF']
我将Povidone new line
添加到测试文件中,以演示如果在同一行中发现相同的药物名称,则会将其附加到最后的结果中。
https://stackoverflow.com/questions/59941196
复制相似问题