前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >使用Python快速比较和替换键值对

使用Python快速比较和替换键值对

原创
作者头像
用户11021319
发布2024-07-22 17:34:56
930
发布2024-07-22 17:34:56
  1. 问题背景

您需要在多个文件中替换所有特定字符串的实例。例如,您有一个包含 60728 个键值对的映射词典,需要处理多达 50 个文件,每个文件大约有 250000 行,并且需要在每行中替换多个键。

  1. 解决方案

方法一:使用正则表达式

代码语言:javascript
复制
import sys, re, time, hashlib

class Regex:

    # Regex implementation of find/replace for a massive word list.

    def __init__(self, mappings):
        self._mappings = mappings

    def replace_func(self, matchObj):
        key = matchObj.group(0)
        if self._mappings.has_key(key):
            return self._mappings[key]
        else:
            return key

    def replace_all(self, filename):
        text = ''
        with open(filename, 'r+') as fp
            text = fp.read()
        text = re.sub("[a-zA-Z]+", self.replace_func, text)
        fp = with open(filename, "w") as fp:
            fp.write(text)

# mapping dictionary of (find, replace) tuples defined 
mappings = {'original-1': 'replace-1', 'original-2': 'replace-2'}

# initialize regex class with mapping tuple dictionary
r = Regex(mappings)

# replace file
r.replace_all( 'file' )

方法二:使用多进程

代码语言:javascript
复制
import multiprocessing as mp

def split_seq(seq, num_pieces):
    # Splits a list into pieces
    start = 0
    for i in xrange(num_pieces):
        stop = start + len(seq[i::num_pieces])
        yield seq[start:stop]
        start = stop   

def detect_active_keys(keys, data, queue):
    # This function MUST be at the top-level, or
    # it can't be pickled (multiprocessing using pickling)
    queue.put([k for k in keys if k in data])

def mass_replace(data, mappings):
    manager = mp.Manager()
    queue = mp.Queue()
    # Data will be SHARED (not duplicated for each process)
    d = manager.list(data) 

    # Split the MAPPINGS KEYS up into multiple LISTS, 
    # same number as CPUs
    key_batches = split_seq(mappings.keys(), mp.cpu_count())

    # Start the key detections
    processes = []
    for i, keys in enumerate(key_batches):
        p = mp.Process(target=detect_active_keys, args=(keys, d, queue))
        # This is non-blocking
        p.start()
        processes.append(p)

    # Consume the output from the queues
    active_keys = []
    for p in processes:
        # We expect one result per process exactly
        # (this is blocking)
        active_keys.append(queue.get())

    # Wait for the processes to finish
    for p in processes:
        # Note that you MUST only call join() after
        # calling queue.get()
        p.join()

    # Same as original submission, now with MUCH fewer keys
    for key in active_keys:
        data = data.replace(k, mappings[key])

    return data

if __name__ == '__main__':
    # You MUST call the mass_replace function from
    # here, due to how multiprocessing works
    filenames = <...obtain filenames...>
    mappings = <...obtain mappings...>
    for filename in filenames:
        with open(filename, 'r+') as f:
            data = mass_replace(f.read(), mappings)
            f.seek(0)
            f.truncate()
            f.write(data)

方法三:使用 Trie

代码语言:javascript
复制
tree = ahocorasick.KeywordTree()
for key in mappings:
    tree.add(key)
tree.make()    
for start, end in reversed(list(tree.findall(target))):
    target = target[:start] + mappings[target[start:end]] + target[end:]

这三个解决方案可以帮助您更快地比较和替换键值对。您可以根据自己的需求选择最合适的方法。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档