试图复制一个Kaggle笔记本工作
我分割了一个数据集
# Split data
raw_train_df, valid_df = train_test_split(image_df, test_size = 0.25, random_state = 12345, stratify =
image_df['class_name'])
# Print results
print(raw_train_df.shape, 'Training data')
print(valid_df.shape, 'Validation data')
(11250, 10) Training data
(3750, 10) Validation data
现在试着平衡一套训练:
fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 10))
raw_train_df.groupby('class_name').size().plot.bar(ax = ax1)
train_df = raw_train_df.groupby('class_name').\
apply(lambda x: x.sample(TRAIN_SAMPLES//15)).\ # Here I put 15 instead of 3, because I have 15
classes
reset_index(drop=True)
train_df.groupby('class_name').size().plot.bar(ax=ax2)
print(train_df.shape[0], 'new training size')
我收到一个错误:
ValueError Traceback (most recent call last)
<ipython-input-16-3b4d2b82246c> in <module>()
1 fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (20, 10))
2 raw_train_df.groupby('class_name').size().plot.bar(ax = ax1)
----> 3 train_df = raw_train_df.groupby('class_name'). apply(lambda x:
x.sample(TRAIN_SAMPLES//15)). reset_index(drop=True)
4 train_df.groupby('class_name').size().plot.bar(ax=ax2)
5 print(train_df.shape[0], 'new training size')
4 frames
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in sample(self, n, frac, replace,
weights, random_state, axis)
4993 )
4994
-> 4995 locs = rs.choice(axis_length, size=n, replace=replace, p=weights)
4996 return self.take(locs, axis=axis)
4997
mtrand.pyx in numpy.random.mtrand.RandomState.choice()
ValueError: Cannot take a larger sample than population when 'replace=False'
图像如下所示
当您需要将Replace = True
放在某个地方时,这是一个常见的错误,但我不知道确切的位置。
发布于 2021-02-15 04:44:26
错误在于在设置x.sample(TRAIN_SAMPLES//15)
的行中调用train_df
。
这可以根据以下几个方面进行追踪:
train_df
的行(如箭头----> 3 train_df
所示)。replace=True/False
。如果有必要,您可以参考Series.sample的熊猫API指南来获得更多的错误提示。
https://stackoverflow.com/questions/66200784
复制相似问题