我在Pandas中有两个DataFrames。
存储在单元格中的数据如下:
DF1
- DatabaseId Integer
- DatabaseName String
DF2
- CreateString String我希望将列DataBaseID应用于DF2中的任何记录,其中DF1.DatabaseName存在于Create上下文中。
Example:
DatabaseName = "UserDB" CreateString = "This create string would fail"
DatabaseName = "UserDB" CreateString = "This create string has UserDB in it"第一个记录将失败,并且不会包含在结果集中。第二项记录将获得成功,并将出现在由此产生的记录中。
我已经研究过各种选项,包括.isin和.contains,但这些方法都不起作用。这似乎是一个“控制的”笛卡尔的加入与‘如果匹配找到成功’的条件。但我一直没能找到一种方法来做这件事,而且很有效率。
需要评估的总列表大小在100 K到500 K之间。
UPDATE添加了更多示例数据:
>>> DF1.head(10)
DatabaseID DatabaseName
0 DB1
1 DB2
2 DB3
3 DB4
...
>>> DF2.head(10)
CreateString
None
None
None
CREATE VIEW DB1.Table1 AS LOC…
None
REPLACE VIEW DB3.Table3...
CREATE VIEW DB3.Table10 AS SELE...
CREATE VIEW DB55.Table999 AS SELEC...
...
Desired Result
DatabaseID DatabaseName CreateText
0 DB1 CREATE VIEW DB1.Table1 AS LOC…
2 DB3 REPLACE VIEW DB3.Table3...
2 DB3 CREATE VIEW DB3.Table10 AS SELE...
...
etc...
...发布于 2016-07-07 19:10:29
更新:如何解析表名:
In [100]: df2['TableName'] = df2.CreateString.str.extract('\s+(\w+\.\w+)\s+', expand=True)
In [101]: df2
Out[101]:
CreateString DatabaseName TableName
0 None NaN NaN
1 None NaN NaN
2 None NaN NaN
3 CREATE VIEW DB1.Table1 AS LOC… DB1 DB1.Table1
4 None NaN NaN
5 REPLACE VIEW DB3.Table3 ... DB3 DB3.Table3
6 CREATE VIEW DB3.Table10 AS SELE... DB3 DB3.Table10
7 CREATE VIEW DB55.Table999 AS SELEC... DB55 DB55.Table999原始答案:
你可以这样做:
In [83]: df2['DatabaseName'] = df2.CreateString.str.extract('\s+(\w+)\.\w+\s+', expand=True)
In [84]: pd.merge(df2, df1, on='DatabaseName', how='left')
Out[84]:
CreateString DatabaseName DatabaseID
0 None NaN NaN
1 None NaN NaN
2 None NaN NaN
3 CREATE VIEW DB1.Table1 AS LOC… DB1 0.0
4 None NaN NaN
5 REPLACE VIEW DB3.Table3 ... DB3 2.0
6 CREATE VIEW DB3.Table10 AS SELE... DB3 2.0
7 CREATE VIEW DB55.Table999 AS SELEC... DB55 NaNhttps://stackoverflow.com/questions/38252178
复制相似问题