前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Pytorch的数据采样器

Pytorch的数据采样器

作者头像
狼啸风云
修改2022-09-02 22:07:24
1.9K0
修改2022-09-02 22:07:24
举报

目录

class torch.utils.data.Sampler(data_source)[source]

class torch.utils.data.SequentialSampler(data_source)[source]

class torch.utils.data.RandomSampler(data_source, replacement=False, num_samples=None)[source]

class torch.utils.data.SubsetRandomSampler(indices)[source]

class torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True)[source]

class torch.utils.data.BatchSampler(sampler, batch_size, drop_last)[source]

class torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]

源代码


采样器的返回值是一个索引列表,用于在训练集中查找训练样本,一般总的元素数是数据集的长度。

class torch.utils.data.Sampler(data_source)[source]

所有采样器的基类。

每个采样器的子类必须提供一个__iter__()方法,提供一个数据集元素指数上进行迭代的方法,并且__len__()方法返回迭代器的长度。

注意:

在Dataloader中__len__()方法不是严格需要的,但是在任何包含Datalaoder长度的计算中都需要。

class torch.utils.data.SequentialSampler(data_source)[source]

顺序的采样元素,通常以相同的顺序。

参数:

data_source (Dataset) – 数据集的来源

class torch.utils.data.RandomSampler(data_source, replacement=False, num_samples=None)[source]

随机采样元素。如果不能重复采样,样本来自打乱后的数据集。如果可以重复采样,使用者可以指定需要的样本数num_samples。

参数:

  • data_source (Dataset) – 需要采样的数据集
  • replacement (bool) – 是否可以重复采样
  • num_samples (int) – 需要采样的样本数,默认为数据集的长度,参数仅仅在可以重复为真实设置。

class torch.utils.data.SubsetRandomSampler(indices)[source]

从给定的指数列表中随机采样,不可以重复采样。

参数:

  • indices (sequence) – 指数的序列

class torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True)[source]

[0,..,len(weights)-1]中以给定的概率(权重)进行采样元素。

参数:

  • weights (sequence) – 一个权重序列,不必要不需要加起来是1。
  • num_samples (int) – 需要采样的样本数。
  • replacement (bool) – 如果为真的话,样本可以进行重复采样。如果为假,不可以进行重复采样,这意味着当一个样本指数来自某行时,对那行不能再一次进行采样。

Example

代码语言:javascript
复制
>>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
[4, 4, 1, 4, 5]
>>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
[0, 1, 4, 3, 2]

class torch.utils.data.BatchSampler(sampler, batch_size, drop_last)[source]

包裹另一个采样器来产生指数的mini-batch。

参数:

  • sampler (Sampler or Iterable) – 基采样器,任何用__len__()实现的可迭代采样器都可以。
  • batch_size (int) – min-batch的尺寸。
  • drop_last (bool) – 如果为真,采样器将会下降到最后一个batch,如果它的尺寸比batch_size小的话。

Example:

代码语言:javascript
复制
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
[[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
>>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
[[0, 1, 2], [3, 4, 5], [6, 7, 8]]

class torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]

Sampler that restricts data loading to a subset of the dataset.

限制数据载入成为数据集子集的采样器。

It is especially useful in conjunction with torch.nn.parallel.DistributedDataParallel. In such a case, each process can pass a :class`~torch.utils.data.DistributedSampler` instance as a DataLoader sampler, and load a subset of the original dataset that is exclusive to it.

torch.nn.parallel.DistributedDataParallel一起使用很有必要。在这种情况下,每个过程能通过一个类torch.utils.data.DistributedSampler实例作为一个DataLoader采样器,并且载入除了它的原始数据集的子集。

注意

数据集假定是一个固定的尺寸。

参数:

  • dataset – 用来进行采样的数据集。
  • num_replicas (int, optional) – 参与到分布式训练的进程数。默认情况下,rank来自当前的分布式组。
  • rank (int, optional) – num_replicas内当前进程的rank。默认情况下,rank来自当前分布式的组。
  • shuffle (bool, optional) – 如果是真的话,采样器将会打乱指数。
  • seed (int, optional) – 如果打乱的话,用来打乱采样器的随机种子。在分布式group的所有进程上数量将是一样的。默认是0。

注意:

在分布式模式中称为:meth`set_epoch(epoch) <set_epoch>`方法,在每个epoch开始的时候。在创建DataLoader之前,迭代器有必要通过多epochs来进行适当的打乱。否则,总是使用相同的顺序。

例:

代码语言:javascript
复制
>>> sampler = DistributedSampler(dataset) if is_distributed else None
>>> loader = DataLoader(dataset, shuffle=(sampler is None),
...                     sampler=sampler)
>>> for epoch in range(start_epoch, n_epochs):
...     if is_distributed:
...         sampler.set_epoch(epoch)
...     train(loader)

源代码

代码语言:javascript
复制
import torch
from torch._six import int_classes as _int_classes


[docs]class Sampler(object):
    r"""Base class for all Samplers.

    Every Sampler subclass has to provide an :meth:`__iter__` method, providing a
    way to iterate over indices of dataset elements, and a :meth:`__len__` method
    that returns the length of the returned iterators.

    .. note:: The :meth:`__len__` method isn't strictly required by
              :class:`~torch.utils.data.DataLoader`, but is expected in any
              calculation involving the length of a :class:`~torch.utils.data.DataLoader`.
    """

    def __init__(self, data_source):
        pass

    def __iter__(self):
        raise NotImplementedError


    # NOTE [ Lack of Default `__len__` in Python Abstract Base Classes ]
    #
    # Many times we have an abstract class representing a collection/iterable of
    # data, e.g., `torch.utils.data.Sampler`, with its subclasses optionally
    # implementing a `__len__` method. In such cases, we must make sure to not
    # provide a default implementation, because both straightforward default
    # implementations have their issues:
    #
    #   + `return NotImplemented`:
    #     Calling `len(subclass_instance)` raises:
    #       TypeError: 'NotImplementedType' object cannot be interpreted as an integer
    #
    #   + `raise NotImplementedError()`:
    #     This prevents triggering some fallback behavior. E.g., the built-in
    #     `list(X)` tries to call `len(X)` first, and executes a different code
    #     path if the method is not found or `NotImplemented` is returned, while
    #     raising an `NotImplementedError` will propagate and and make the call
    #     fail where it could have use `__iter__` to complete the call.
    #
    # Thus, the only two sensible things to do are
    #
    #   + **not** provide a default `__len__`.
    #
    #   + raise a `TypeError` instead, which is what Python uses when users call
    #     a method that is not defined on an object.
    #     (@ssnl verifies that this works on at least Python 3.7.)


[docs]class SequentialSampler(Sampler):
    r"""Samples elements sequentially, always in the same order.

    Arguments:
        data_source (Dataset): dataset to sample from
    """

    def __init__(self, data_source):
        self.data_source = data_source

    def __iter__(self):
        return iter(range(len(self.data_source)))

    def __len__(self):
        return len(self.data_source)



[docs]class RandomSampler(Sampler):
    r"""Samples elements randomly. If without replacement, then sample from a shuffled dataset.
    If with replacement, then user can specify :attr:`num_samples` to draw.

    Arguments:
        data_source (Dataset): dataset to sample from
        replacement (bool): samples are drawn with replacement if ``True``, default=``False``
        num_samples (int): number of samples to draw, default=`len(dataset)`. This argument
            is supposed to be specified only when `replacement` is ``True``.
    """

    def __init__(self, data_source, replacement=False, num_samples=None):
        self.data_source = data_source
        self.replacement = replacement
        self._num_samples = num_samples

        if not isinstance(self.replacement, bool):
            raise TypeError("replacement should be a boolean value, but got "
                            "replacement={}".format(self.replacement))

        if self._num_samples is not None and not replacement:
            raise ValueError("With replacement=False, num_samples should not be specified, "
                             "since a random permute will be performed.")

        if not isinstance(self.num_samples, int) or self.num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(self.num_samples))

    @property
    def num_samples(self):
        # dataset size might change at runtime
        if self._num_samples is None:
            return len(self.data_source)
        return self._num_samples

    def __iter__(self):
        n = len(self.data_source)
        if self.replacement:
            return iter(torch.randint(high=n, size=(self.num_samples,), dtype=torch.int64).tolist())
        return iter(torch.randperm(n).tolist())

    def __len__(self):
        return self.num_samples



[docs]class SubsetRandomSampler(Sampler):
    r"""Samples elements randomly from a given list of indices, without replacement.

    Arguments:
        indices (sequence): a sequence of indices
    """

    def __init__(self, indices):
        self.indices = indices

    def __iter__(self):
        return (self.indices[i] for i in torch.randperm(len(self.indices)))

    def __len__(self):
        return len(self.indices)



[docs]class WeightedRandomSampler(Sampler):
    r"""Samples elements from ``[0,..,len(weights)-1]`` with given probabilities (weights).

    Args:
        weights (sequence)   : a sequence of weights, not necessary summing up to one
        num_samples (int): number of samples to draw
        replacement (bool): if ``True``, samples are drawn with replacement.
            If not, they are drawn without replacement, which means that when a
            sample index is drawn for a row, it cannot be drawn again for that row.

    Example:
        >>> list(WeightedRandomSampler([0.1, 0.9, 0.4, 0.7, 3.0, 0.6], 5, replacement=True))
        [4, 4, 1, 4, 5]
        >>> list(WeightedRandomSampler([0.9, 0.4, 0.05, 0.2, 0.3, 0.1], 5, replacement=False))
        [0, 1, 4, 3, 2]
    """

    def __init__(self, weights, num_samples, replacement=True):
        if not isinstance(num_samples, _int_classes) or isinstance(num_samples, bool) or \
                num_samples <= 0:
            raise ValueError("num_samples should be a positive integer "
                             "value, but got num_samples={}".format(num_samples))
        if not isinstance(replacement, bool):
            raise ValueError("replacement should be a boolean value, but got "
                             "replacement={}".format(replacement))
        self.weights = torch.as_tensor(weights, dtype=torch.double)
        self.num_samples = num_samples
        self.replacement = replacement

    def __iter__(self):
        return iter(torch.multinomial(self.weights, self.num_samples, self.replacement).tolist())

    def __len__(self):
        return self.num_samples



[docs]class BatchSampler(Sampler):
    r"""Wraps another sampler to yield a mini-batch of indices.

    Args:
        sampler (Sampler or Iterable): Base sampler. Can be any iterable object
            with ``__len__`` implemented.
        batch_size (int): Size of mini-batch.
        drop_last (bool): If ``True``, the sampler will drop the last batch if
            its size would be less than ``batch_size``

    Example:
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=False))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
        >>> list(BatchSampler(SequentialSampler(range(10)), batch_size=3, drop_last=True))
        [[0, 1, 2], [3, 4, 5], [6, 7, 8]]
    """

    def __init__(self, sampler, batch_size, drop_last):
        # Since collections.abc.Iterable does not check for `__getitem__`, which
        # is one way for an object to be an iterable, we don't do an `isinstance`
        # check here.
        if not isinstance(batch_size, _int_classes) or isinstance(batch_size, bool) or \
                batch_size <= 0:
            raise ValueError("batch_size should be a positive integer value, "
                             "but got batch_size={}".format(batch_size))
        if not isinstance(drop_last, bool):
            raise ValueError("drop_last should be a boolean value, but got "
                             "drop_last={}".format(drop_last))
        self.sampler = sampler
        self.batch_size = batch_size
        self.drop_last = drop_last

    def __iter__(self):
        batch = []
        for idx in self.sampler:
            batch.append(idx)
            if len(batch) == self.batch_size:
                yield batch
                batch = []
        if len(batch) > 0 and not self.drop_last:
            yield batch

    def __len__(self):
        if self.drop_last:
            return len(self.sampler) // self.batch_size
        else:
            return (len(self.sampler) + self.batch_size - 1) // self.batch_size
本文参与 腾讯云自媒体分享计划,分享自作者个人站点/博客。
原始发表:2020-06-08 ,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 作者个人站点/博客 前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 目录
  • class torch.utils.data.Sampler(data_source)[source]
  • class torch.utils.data.SequentialSampler(data_source)[source]
  • class torch.utils.data.RandomSampler(data_source, replacement=False, num_samples=None)[source]
  • class torch.utils.data.SubsetRandomSampler(indices)[source]
  • class torch.utils.data.WeightedRandomSampler(weights, num_samples, replacement=True)[source]
  • class torch.utils.data.BatchSampler(sampler, batch_size, drop_last)[source]
  • class torch.utils.data.distributed.DistributedSampler(dataset, num_replicas=None, rank=None, shuffle=True, seed=0)[source]
  • 源代码
相关产品与服务
批量计算
批量计算(BatchCompute,Batch)是为有大数据计算业务的企业、科研单位等提供高性价比且易用的计算服务。批量计算 Batch 可以根据用户提供的批处理规模,智能地管理作业和调动其所需的最佳资源。有了 Batch 的帮助,您可以将精力集中在如何分析和处理数据结果上。
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档