我正在使用Keras处理MNIST Sign Language数据集来对图像进行分类。数据集中有24个不同的类。但问题是,班级的分布千差万别。
我对stratify=df['label']使用了sklearn.model_selection.train_test_split,但仍然有一些类有5%的数据,而另一些类有3%的数据。我如何让他们选择一个在类中分布在4%左右的数据。
我的test_df有7172行和785列,其中一列是label列,其余的784是灰度像素值(28*28)
test_df = pd.read_csv(TEST_PATH)
# shuffle and split validation,test data
test_df = test_df.sample(frac=1.0,random_state=SEED).iloc[:2000,:] # shuffle the whole data, get first 2000 rows
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])
# stratify the labels so that distribution of classes is almost same
# extract pixels and labels for both validation,test data
X_val = val_df.drop('label',axis=1).values.reshape((val_df.shape[0],28,28))/255.0 # validation images
y_val = val_df['label'].ravel() # validation label
X_test = test_df.drop('label',axis=1).values.reshape((test_df.shape[0],28,28))/255.0 # test images
y_val = test_df['label'].ravel() # test label发布于 2020-06-02 16:36:17
这一行使您可以使用val和test进行均匀分布。您还可以使用样本数量进行游戏
SEED = 42
n_classes = 24
test_df = pd.read_csv(TEST_PATH)
test_df = [test_df.loc[test_df.label==i].sample(n=int(2000/n_classes),random_state=SEED) for i in test_df.label.unique()]
test_df = pd.concat(test_df, axis=0, ignore_index=True)
val_df,test_df = train_test_split(test_df,test_size=0.5,random_state=SEED,stratify=test_df['label'])https://stackoverflow.com/questions/62145243
复制相似问题