前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Dedupe去重与实体对齐

Dedupe去重与实体对齐

作者头像
里克贝斯
发布2021-05-21 10:23:17
1.2K0
发布2021-05-21 10:23:17
举报
文章被收录于专栏:图灵技术域

简介

Dedupe是一个python库,使用机器学习对结构化数据快速执行模糊匹配,重复数据删除和实体对齐。

输入的数据:单文件csv表格

执行:用户在控制台根据提示标注少量相似数据即可

输出的数据:单文件csv表格,同时对相似的记录打上标签

Dedupe操作实例

  • 从名称和地址的电子表格中删除重复的条目
  • 具有客户信息的列表链接到具有订单历史记录的列表,即使没有唯一的客户ID
  • 收集竞选捐款的数据库,并找出同一人所做的捐款,即使每个记录的名称输入略有不同

Python库地址

https://github.com/dedupeio/dedupe

实例

原始csv文件:

下面的代码将对第三列name去重

代码

代码语言:javascript
复制
# Site: www.omegaxyz.com
# *_*coding:utf-8 *_*

import os
import csv
import logging
import optparse
import dedupe
import re

reg = r'<p>(.*?)</p>'

def readData(filename):
    """
    Remap columns for the following cases:
    - Lat and Long are mapped into a single LatLong tuple
    - Class and Coauthor are stored as delimited strings but mapped into
      tuples
    """
    data_d = {}
    with open(filename, 'r', encoding='utf-8') as f:
        reader = csv.DictReader(f)
        for idx, row in enumerate(reader):
            row = dict((k, v.lower()) for k, v in row.items())
            data_d[idx] = row
    return data_d


# These generators will give us the corpora setting up the Set
# distance metrics
def names(data):
    for record in data.values():
        yield record['name']

def types(data):
    for record in data.values():
        yield record['type']


if __name__ == '__main__':
    # ## Logging
    # Dedupe uses Python logging to show or suppress verbose output. Added
    # for convenience.  To enable verbose logging, run `python
    # patent_example.py -v`

    optp = optparse.OptionParser()
    optp.add_option('-v', '--verbose', dest='verbose', action='count',
                    help='Increase verbosity (specify multiple times for more)'
                    )
    (opts, args) = optp.parse_args()
    log_level = logging.WARNING

    if opts.verbose:
        if opts.verbose == 1:
            log_level = logging.INFO
        elif opts.verbose > 1:
            log_level = logging.DEBUG
    logging.getLogger().setLevel(log_level)

    input_file = 'resource_all.csv'

    output_file = 'resource_all_output.csv'
    settings_file = 'resource_all_settings.json'
    training_file = 'resource_all_training.json'
    print('importing data ...')
    data_d = readData(input_file)

    if os.path.exists(settings_file):
        print('reading from', settings_file)
        with open(settings_file, 'rb') as sf:
            deduper = dedupe.StaticDedupe(sf, num_cores=2)
    else:
        # Define the fields dedupe will pay attention to
        '''
            {'field': 'type',
             'variable name': 'type',
             'type': 'Text',
             'corpus': types(data_d),
             'has missing': False},
        '''
        fields = [
            {'field': 'name',
             'variable name': 'name Text',
             'type': 'Text',
             'corpus': names(data_d),
             'has missing': False},
        ]

        # Create a new deduper object and pass our data model to it.
        deduper = dedupe.Dedupe(fields, num_cores=2)
        # If we have training data saved from a previous run of dedupe,
        # look for it an load it in.
        if os.path.exists(training_file):
            print('reading labeled examples from ', training_file)
            with open(training_file) as tf:
                deduper.prepare_training(data_d, training_file=tf)
        else:
            deduper.prepare_training(data_d)
        # ## Active learning

        # Starts the training loop. Dedupe will find the next pair of records
        # it is least certain about and ask you to label them as duplicates
        # or not.

        # use 'y', 'n' and 'u' keys to flag duplicates
        # press 'f' when you are finished
        print('starting active labeling...')
        dedupe.console_label(deduper)

        deduper.train()

        # When finished, save our training away to disk
        with open(training_file, 'w') as tf:
            deduper.write_training(tf)

        # Save our weights and predicates to disk.  If the settings file
        # exists, we will skip all the training and learning next time we run
        # this file.
        with open(settings_file, 'wb') as sf:
            deduper.write_settings(sf)
    clustered_dupes = deduper.partition(data_d, 0.5)

    print('# duplicate sets', len(clustered_dupes))

    # ## Writing Results

    # Write our original data back out to a CSV with a new column called
    # 'Cluster ID' which indicates which records refer to each other.

    cluster_membership = {}
    for cluster_id, (records, scores) in enumerate(clustered_dupes):
        for record_id, score in zip(records, scores):
            cluster_membership[record_id] = {
                "Cluster ID": cluster_id,
                "confidence_score": score
            }

    with open(output_file, 'w', encoding='utf-8') as f_output, open(input_file, encoding='utf-8') as f_input:
        reader = csv.DictReader(f_input)
        fieldnames = ['Cluster ID', 'confidence_score'] + reader.fieldnames

        writer = csv.DictWriter(f_output, fieldnames=fieldnames)
        writer.writeheader()

        for row_id, row in enumerate(reader):
            row.update(cluster_membership[row_id])
            writer.writerow(row)

标注少量数据

程序会自动跳出两行的name段内容,根据你的认知标注这两个name是否为同一个实体,选项包括yes, no, unsure, finish

生成的csv

可以看到多了两列,一列是聚类号,相同的聚类号为相似实体,还有一列为置信度。

相关文章

本文参与 腾讯云自媒体同步曝光计划,分享自作者个人站点/博客。
原始发表:2020-07-24,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体同步曝光计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 简介
  • 实例
    • 相关文章
    相关产品与服务
    文件存储
    文件存储(Cloud File Storage,CFS)为您提供安全可靠、可扩展的共享文件存储服务。文件存储可与腾讯云服务器、容器服务、批量计算等服务搭配使用,为多个计算节点提供容量和性能可弹性扩展的高性能共享存储。腾讯云文件存储的管理界面简单、易使用,可实现对现有应用的无缝集成;按实际用量付费,为您节约成本,简化 IT 运维工作。
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档