我正在尝试解析一些数据,这些数据的格式如下,称为data
:
data = '(def-instance Adelphi
(expenses thous$:7-10)
(academic-emphasis biology))
(def-instance Arizona-State
(expenses thous$:4-7)
(academic-emphasis fine-arts))'
我想将数据分成一个列表,第一段在第一个条目中,第二段在第二个条目中,即:
['(def-instance Adelphi
(expenses thous$:7-10)
(academic-emphasis business-administration)
(academic-emphasis biology))',
'(def-instance Arizona-State
(expenses thous$:4-7)
(academic-emphasis fine-arts)']
我试着使用命令re.split(r'\(*(\([^()]*\)*)*\)',data)
,但是我有点不对劲,我不明白为什么。如果能帮上忙,我们将不胜感激。
发布于 2020-03-19 07:48:01
您可以通过迭代数据、搜索))
并根据找到的索引和值创建结果列表来实现这一点。
data = data.split('\n')
result = list()
prev = 0
for idx, value in enumerate(data):
if '))' in value:
result.append('\n'.join(data[prev:idx + 1]))
prev = idx + 1
这将输出以下内容:
print(result)
#['(def-instance Adelphi\n (state newyork)\n (control private)\n (no-of-students thous:5-10)\n (male:female ratio:30:70)\n (student:faculty ratio:15:1)\n (sat verbal 500)\n (sat math 475)\n (expenses thous$:7-10)\n (percent-financial-aid 60)\n (no-applicants thous:4-7)\n (percent-admittance 70)\n (percent-enrolled 40)\n (academics scale:1-5 2)\n (social scale:1-5 2)\n (quality-of-life scale:1-5 2)\n (academic-emphasis business-administration)\n (academic-emphasis biology))', '(def-instance Arizona-State\n (state arizona)\n (control state)\n (no-of-students thous:20+)\n (male:female ratio:50:50)\n (student:faculty ratio:20:1)\n (sat verbal 450)\n (sat math 500)\n (expenses thous$:4-7)\n (percent-financial-aid 50)\n (no-applicants thous:17+)\n (percent-admittance 80)\n (percent-enrolled 60)\n (academics scale:1-5 3)\n (social scale:1-5 4)\n (quality-of-life scale:1-5 5)\n (academic-emphasis business-education)\n (academic-emphasis engineering)\n (academic-emphasis accounting)\n (academic-emphasis fine-arts))']
在更新后的数据集上:
result
#['(def-instance Adelphi\n (expenses thous$:7-10)\n (academic-emphasis biology))', '(def-instance Arizona-State\n (expenses thous$:4-7)\n (academic-emphasis fine-arts))']
发布于 2020-03-19 07:52:42
拆分位置的一个共同点是,它们都以)
结束前一个'set‘,有一个换行符,然后下一个'set’以((
开始。这说明了使用后视和前视的方法:
import re
data = '''(def-instance Adelphi
(expenses thous$:7-10)
(academic-emphasis biology))
(def-instance Arizona-State
(expenses thous$:4-7)
(academic-emphasis fine-arts))'''
l = list(re.split(r'(?<=\)\))\s+(?=\()', data))
for item in l:
print (item)
print ()
输出(为清楚起见,在单独的行中打印):
(def-instance Adelphi
(expenses thous$:7-10)
(academic-emphasis biology))
(def-instance Arizona-State
(expenses thous$:4-7)
(academic-emphasis fine-arts))
https://stackoverflow.com/questions/60748759
复制相似问题