文章/答案/技术大牛

发布

社区首页 >问答首页 >基于多准则合并两个CSV文件

问基于多准则合并两个CSV文件
EN

Stack Overflow用户

提问于 2021-07-19 14:43:32

回答 1查看 124关注 0票数 0

我有两个CSV文件。它们有相同的列，但同一列中的每一行并不是唯一的，如下所示：

gpo_full.csv:
    Date           hearing_sub_type   topic      Specific_Date
    January,1997   Oversight          weather    January 12,1997
    June,2000      General            life       June 5,2000
    January,1997   General            forest     January 1,1997
    April,2001     Oversight          people     NaN 
    June,2000      Oversight          depressed  June 6,2000
    January,1997   General            weather    January 1,1997
    June,2000      Oversight          depressed  June 5,2000

CAP_cols.csv:
    majortopic   id     Chamber   topic           Date           Specific_Date
    21           79846  1         many forest     January,1997   January 1,1997
    4            79847  2         emotion         June,2000      June 6,2000
    13           79848  1         NaN             May,2001       NaN
    7            79849  2         good life       June,2000      June 5,2000
    21           79850  1         good weather    January,1997   January 1,1997
    25           79851  1         rain & cloudy   January,1997   January 12,1997
    6            79852  2         sad & depressed June,2000      June 5,2000

我想使用三个标准来匹配这些数据: Specific_Date、日期和主题。

首先，我想使用“日期”列对这些数据进行分组。接下来，我尝试使用"Specific_Date“列缩小范围，因为本专栏中丢失了一些数据。最后，我希望使用类似于word-embedding这样的类似词来使用"topic“列，以确保gpo_full中的哪些行可以与CAP_cols中的唯一行相对应。

我尝试使用“日期”列对数据进行分组，并将它们合并到JSON文件中。然而，我被困在实现的下一步，缩小范围的具体日期和主题。

我对这个输出的想法是：

{
"Date": "January,1997",
"Specific_Date": "January 12,1997"
"Topic": {"GPO": "weather", "CAP": "rain & cloudy"}
"GPO": {
    "hearing_sub_type": "Oversight",
    and other columns
}
"CAP": {
    "majortopic": "25",
    "id": "79851",
    "Chamber": "1"
}
},
{
"Date": "January,1997",
"Specific_Date": "January 1,1997"
"Topic": {"GPO": "forest", "CAP": "many forest"}
"GPO": {
    "hearing_sub_type": "General",
    and other columns
}
"CAP": {
    "majortopic": "21",
    "id": "79846",
    "Chamber": "1"
}
and similar for others}

我已经想了三天了，不知道。任何实现这一点的想法都是非常有帮助的！非常感谢！

python

json

csv

merge

word-embedding

回答 1

Stack Overflow用户

回答已采纳

发布于 2021-07-19 17:22:20

主题匹配有几个问题，所以您需要扩展我使用的match_topic()方法，但我添加了一些逻辑，以查看最终不匹配的内容。

results变量包含一个dict列表，您可以轻松地将其保存为JSON文件。

检查内联注释，以了解我所使用的逻辑的推理。

Sidenote:

如果我是你，我会稍微重组JSON。将topic作为一个键/值对放在GPO和CAP键下，对我来说比拥有一个单独的GPO和CAP键/值对的Topic密钥更有意义。

import csv
from pprint import pprint
import json


# load gpo_full.csv into a list of dict using
# csv.DictReader & list comprehension
with open("path/to/file/gpo_full.csv") as infile:
    gpo_full = [item for item in csv.DictReader(infile)]


# do the same for CAP_cols.csv
with open("path/to/file/CAP_cols.csv") as infile:
    cap_cols = [item for item in csv.DictReader(infile)]


def match_topic(gpo_topic: str, cap_topic: str) -> bool:
    """We need a function as some of the mapping is not simple

    Args:
        gpo_topic (str): gpo topic
        cap_topic (str): CAP topic

    Returns:
        bool: True if topics match
    """
    # this one is simple
    if gpo_topic in cap_topic:
        return True
    # you need to repeat the below conditional check
    # for each custom topic matching
    elif gpo_topic == "weather" and cap_topic == "rain & cloudy":
        return True 
    # example secondary topic matching
    elif gpo_topic == "foo" and cap_topic == "bar":
        return True 
    # finally return false for no matches
    return False


# we need this later
gpo_length = len(gpo_full)
results = []
cap_left_over = []
# do the actual mapping
# this could've been done above, but I separated it intentionally
for cap in cap_cols:
    found = False
    # first find the corresponding gpo
    for index, gpo in enumerate(gpo_full):
        if (
            gpo["Specific_Date"] == cap["Specific_Date"] # check by date
            and match_topic(gpo["topic"], cap["topic"]) # check if topics match
        ):
            results.append({
                "Date": gpo["Date"],
                "Specific_Date": gpo["Specific_Date"],
                "Topic": {
                    "GPO": gpo["topic"],
                    "CAP": cap["topic"]
                },
                "GPO": {
                    "hearing_sub_type": gpo["hearing_sub_type"]
                },
                "CAP": {
                    "majortopic": cap["majortopic"],
                    "id": cap["id"],
                    "Chamber": cap["Chamber"]
                }
            })
            # pop & break to remove the gpo item
            # this is so you're left over with a list of
            # gpo items that didn't match
            # it also speeds up further matches
            gpo_full.pop(index)
            found = True
            break
    # this is to check if there's stuff left over
    if not found:
        cap_left_over.append(cap)


with open('path/to/file/combined_json.json', 'w') as outfile:
    json.dump(results, outfile, indent=4)


pprint(results)
print(f'\nLength:\n  Results: {len(results)}\n  CAP: {len(cap)}\n  GPO: {gpo_length}')
print('\nLeftover GPO:')
pprint(gpo_full)
print('\nLeftover CAP:')
pprint(cap_left_over)

输出

我已经从输出中删除了pprint(results)，请参阅JSON。

Length:
  Results: 5
  CAP: 6
  GPO: 7

Leftover GPO:
[{'Date': 'April,2001',
  'Specific_Date': 'NaN ',
  'hearing_sub_type': 'Oversight',
  'topic': 'people'},
 {'Date': 'June,2000',
  'Specific_Date': 'June 6,2000',
  'hearing_sub_type': 'Oversight',
  'topic': 'depressed'}]

Leftover CAP:
[{'Chamber': '2',
  'Date': 'June,2000',
  'Specific_Date': 'June 6,2000',
  'id': '79847',
  'majortopic': '4',
  'topic': 'emotion'},
 {'Chamber': '1',
  'Date': 'May,2001',
  'Specific_Date': 'NaN',
  'id': '79848',
  'majortopic': '13',
  'topic': 'NaN'}]

path/to/file/gpo_full.csv

Date,hearing_sub_type,topic,Specific_Date
"January,1997",Oversight,weather,"January 12,1997"
"June,2000",General,life,"June 5,2000"
"January,1997",General,forest,"January 1,1997"
"April,2001",Oversight,people,NaN 
"June,2000",Oversight,depressed,"June 6,2000"
"January,1997",General,weather,"January 1,1997"
"June,2000",Oversight,depressed,"June 5,2000"

path/to/file/CAP_cols.csv

majortopic,id,Chamber,topic,Date,Specific_Date
21,79846,1,many forest,"January,1997","January 1,1997"
4,79847,2,emotion,"June,2000","June 6,2000"
13,79848,1,NaN,"May,2001","NaN"
7,79849,2,good life,"June,2000","June 5,2000"
21,79850,1,good weather,"January,1997","January 1,1997"
25,79851,1,rain & cloudy,"January,1997","January 12,1997"
6,79852,2,sad & depressed,"June,2000","June 5,2000"

path/to/file/combined_json.json

[
    {
        "Date": "January,1997",
        "Specific_Date": "January 1,1997",
        "Topic": {
            "GPO": "forest",
            "CAP": "many forest"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "21",
            "id": "79846",
            "Chamber": "1"
        }
    },
    {
        "Date": "June,2000",
        "Specific_Date": "June 5,2000",
        "Topic": {
            "GPO": "life",
            "CAP": "good life"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "7",
            "id": "79849",
            "Chamber": "2"
        }
    },
    {
        "Date": "January,1997",
        "Specific_Date": "January 1,1997",
        "Topic": {
            "GPO": "weather",
            "CAP": "good weather"
        },
        "GPO": {
            "hearing_sub_type": "General"
        },
        "CAP": {
            "majortopic": "21",
            "id": "79850",
            "Chamber": "1"
        }
    },
    {
        "Date": "January,1997",
        "Specific_Date": "January 12,1997",
        "Topic": {
            "GPO": "weather",
            "CAP": "rain & cloudy"
        },
        "GPO": {
            "hearing_sub_type": "Oversight"
        },
        "CAP": {
            "majortopic": "25",
            "id": "79851",
            "Chamber": "1"
        }
    },
    {
        "Date": "June,2000",
        "Specific_Date": "June 5,2000",
        "Topic": {
            "GPO": "depressed",
            "CAP": "sad & depressed"
        },
        "GPO": {
            "hearing_sub_type": "Oversight"
        },
        "CAP": {
            "majortopic": "6",
            "id": "79852",
            "Chamber": "2"
        }
    }
]

票数 1

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/68442446

复制

相似问题

问基于多准则合并两个CSV文件
EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多准则合并两个CSV文件EN

回答 1

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问基于多准则合并两个CSV文件
EN