首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >使用正则表达式检查数据集是否存在,而无需先读取所有数据集的路径

使用正则表达式检查数据集是否存在,而无需先读取所有数据集的路径
EN

Stack Overflow用户
提问于 2018-06-06 20:33:42
回答 2查看 1.2K关注 0票数 1

在不先读取所有数据集的路径的情况下,如何使用正则表达式之类的方法检查数据集是否存在?

例如,我想检查数据集'completed'是否存在于可能包含(也可能不包含)的文件中

/123/completed

(假设我事先不知道完整的路径,我只想检查数据集的名称。因此,在我的情况下将不起作用。)

EN

回答 2

Stack Overflow用户

回答已采纳

发布于 2018-06-06 20:44:20

自定义递归

不需要正则表达式。您可以通过递归遍历HDF5文件中的组来构建数据集名称的set

import h5py

def traverse_datasets(hdf_file):

    """Traverse all datasets across all groups in HDF5 file."""

    def h5py_dataset_iterator(g, prefix=''):
        for key in g.keys():
            item = g[key]
            path = '{}/{}'.format(prefix, key)
            if isinstance(item, h5py.Dataset): # test for dataset
                yield (path, item)
            elif isinstance(item, h5py.Group): # test for group (go down)
                yield from h5py_dataset_iterator(item, path)

    with h5py.File(hdf_file, 'r') as f:
        for (path, dset) in h5py_dataset_iterator(f):
            yield path.split('/')[-1]

all_datasets = set(traverse_datasets('file.h5'))

然后只需检查成员:'completed' in all_datasets

Group.visit

或者,您可以使用Group.visit。注意,您需要使用return None的搜索函数来迭代所有组。

res = []

def searcher(name, k='completed'):
    """ Find all objects with k anywhere in the name """
    if k in name:
        res.append(name)
        return None

with h5py.File('file.h5', 'r') as f:
    group = f['/']
    group.visit(searcher)

print(res)  # print list of dataset names matching criterion

在这两种情况下,复杂度都是O(n)。您需要测试每个数据集的名称,但仅此而已。如果您需要一个懒惰的解决方案,第一个选项可能更可取。

票数 1
EN

Stack Overflow用户

发布于 2019-04-23 03:03:57

用于查找数据集的所有有效路径的递归

下面的代码使用递归查找所有数据集的有效数据路径。在获得有效路径(重复3次后终止可能的循环引用)之后,我可以对返回的列表(未显示)使用正则表达式。

import numpy as np
import h5py
import collections
import warnings


def visit_data_sets(group, max_len_check=20, max_repeats=3):
    # print(group.name)
    # print(list(group.items()))

    if len(group.name) > max_len_check:
        # this section terminates a circular reference after 4 repeats. However it  will
        # incorrectly terminate  a tree if the identical repetitive sequences of names are
        # actually used in the tree.
        name_list = group.name.split('/')
        current_name = name_list[-1]
        res_list = [i for i in range(len(name_list)) if name_list[i] == current_name]
        res_deq = collections.deque(res_list)
        res_deq.rotate(1)
        res_deq2 = collections.deque(res_list)
        diff = [res_deq2[i] - res_deq[i] for i in range(0, len(res_deq))]

        if len(diff) >= max_repeats:
            if diff[-1] == diff[-2]:
                message = 'Terminating likely circular reference "{}"'.format(group.name)
                warnings.warn(message, UserWarning)
                print()
                return []

    dataset_list = list()
    for key, value in group.items():
        if isinstance(value, h5py.Dataset):
            current_path = group.name + '/{}'.format(key)
            dataset_list.append(current_path)
        elif isinstance(value, h5py.Group):
            dataset_list += visit_data_sets(value)

        else:
            print('Unhandled class name {}'.format(value.__class__.__name__))

    return dataset_list

def visit_callback(name, object):
    print('Visiting name = "{}", object name = "{}"'.format(name, object.name))
    return None

hdf_fptr = h5py.File('link_test.hdf5', mode='w')

group1 = hdf_fptr.require_group('/junk/group1')
group1a = hdf_fptr.require_group('/junk/group1/group1a')
# group1a1 = hdf_fptr.require_group('/junk/group1/group1a/group1ai')
group2 = hdf_fptr.require_group('/junk/group2')
group3 = hdf_fptr.require_group('/junk/group3')

# create a circular reference
group1ai = group1a['group1ai'] = group1


avect = np.arange(0,12.3, 1.0)

dset = group1.create_dataset('avect', data=avect)

group2['alias'] = dset
group3['alias3'] = h5py.SoftLink(dset.name)


print('\nThis demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"')
print('Visiting Root - {}'.format(hdf_fptr.name))
hdf_fptr.visititems(visit_callback)

print('\nThis demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"')
print('Visiting Group - {}'.format(group2.name))
group2.visititems(visit_callback)
print('\nThis demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"')
print('Visiting Group - {}'.format(group3.name))
group3.visititems(visit_callback)


print('\n\nNow demonstrate recursive visit of Root looking for datasets')
print('using the function "visit_data_sets" in this code snippet.\n')
data_paths = visit_data_sets(hdf_fptr)

for data_path in data_paths:
    print('Data Path = "{}"'.format(data_path))

hdf_fptr.close()

下面的输出显示了"visititems“是如何工作的,或者我的目的是无法识别所有有效的路径,而递归满足了我的需要,也可能满足了你的需要。

This demonstrates  "h5py visititems" visiting Root with subgroups containing a Hard Link and Soft Link to "avect"
Visiting Root - /
Visiting name = "junk", object name = "/junk"
Visiting name = "junk/group1", object name = "/junk/group1"
Visiting name = "junk/group1/avect", object name = "/junk/group1/avect"
Visiting name = "junk/group1/group1a", object name = "/junk/group1/group1a"
Visiting name = "junk/group2", object name = "/junk/group2"
Visiting name = "junk/group3", object name = "/junk/group3"

This demonstrates  "h5py visititems" visiting "group2" with a Hard Link to "avect"
Visiting Group - /junk/group2
Visiting name = "alias", object name = "/junk/group2/alias"

This demonstrates "h5py visititems" visiting "group3" with a Soft Link to "avect"
Visiting Group - /junk/group3


Now demonstrate recursive visit of Root looking for datasets
using the function "visit_data_sets" in this code snippet.

link_ref_test.py:26: UserWarning: Terminating likely circular reference "/junk/group1/group1a/group1ai/group1a/group1ai/group1a"

  warnings.warn(message, UserWarning)
Data Path = "/junk/group1/avect"
Data Path = "/junk/group1/group1a/group1ai/avect"
Data Path = "/junk/group1/group1a/group1ai/group1a/group1ai/avect"
Data Path = "/junk/group2/alias"
Data Path = "/junk/group3/alias3"

第一个“数据路径”结果是原始数据集。第二个和第三个是由循环引用引起的对原始数据集的引用。第四个结果是硬链接,第五个结果是到原始数据集的软链接。

票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/50720523

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档