前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >XGBoost如何用2GB内存训练100GB的数据!

XGBoost如何用2GB内存训练100GB的数据!

作者头像
炼丹笔记
发布2022-10-27 14:50:24
9680
发布2022-10-27 14:50:24
举报
文章被收录于专栏:炼丹笔记炼丹笔记

↑↑↑关注后"星标"炼丹笔记

炼丹笔记

作者:Coggle

XGBoost迭代读取数据集

简介

在大规模数据集进行读取进行训练的过程中,迭代读取数据集是一个非常合适的选择,在Pytorch中支持迭代读取的方式。接下来我们将介绍XGBoost的迭代读取的方式。

内存数据读取

代码语言:javascript
复制
class IterLoadForDMatrix(xgb.core.DataIter):
    def __init__(self, df=None, features=None, target=None, batch_size=256*1024):
        self.features = features
        self.target = target
        self.df = df
        self.batch_size = batch_size
        self.batches = int( np.ceil( len(df) / self.batch_size ) )
        self.it = 0 # set iterator to 0
        super().__init__()

    def reset(self):
        '''Reset the iterator'''
        self.it = 0

    def next(self, input_data):
        '''Yield next batch of data.'''
        if self.it == self.batches:
            return 0 # Return 0 when there's no more batch.
        
        a = self.it * self.batch_size
        b = min( (self.it + 1) * self.batch_size, len(self.df) )
        dt = pd.DataFrame(self.df.iloc[a:b])
        input_data(data=dt[self.features], label=dt[self.target]) #, weight=dt['weight'])
        self.it += 1
        return 1

调用方法(此种方式比较适合GPU训练):

代码语言:javascript
复制
Xy_train = IterLoadForDMatrix(train.loc[train_idx], FEATURES, 'target')
dtrain = xgb.DeviceQuantileDMatrix(Xy_train, max_bin=256)

参考文档:

https://xgboost.readthedocs.io/en/latest/python/examples/quantile_data_iterator.html

外部数据迭代读取

代码语言:javascript
复制
class Iterator(xgboost.DataIter):
  def __init__(self, svm_file_paths: List[str]):
    self._file_paths = svm_file_paths
    self._it = 0
    super().__init__(cache_prefix=os.path.join(".", "cache"))

  def next(self, input_data: Callable):
    if self._it == len(self._file_paths):
      # return 0 to let XGBoost know this is the end of iteration
      return 0

    X, y = load_svmlight_file(self._file_paths[self._it])
    input_data(X, y)
    self._it += 1
    return 1

  def reset(self):
    """Reset the iterator to its beginning"""
    self._it = 0

调用方法(此种方式比较适合CPU训练):

代码语言:javascript
复制
it = Iterator(["file_0.svm", "file_1.svm", "file_2.svm"])
Xy = xgboost.DMatrix(it)

# Other tree methods including ``hist`` and ``gpu_hist`` also work, but has some caveats
# as noted in following sections.
booster = xgboost.train({"tree_method": "approx"}, Xy)

参考文档:

https://xgboost.readthedocs.io/en/stable/tutorials/external_memory.html

本文参与 腾讯云自媒体分享计划,分享自微信公众号。
原始发表:2022-07-26,如有侵权请联系 cloudcommunity@tencent.com 删除

本文分享自 炼丹笔记 微信公众号,前往查看

如有侵权,请联系 cloudcommunity@tencent.com 删除。

本文参与 腾讯云自媒体分享计划  ,欢迎热爱写作的你一起参与!

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档