我有一个非常大的图像文件夹,以及一个包含每个图像的类标签的CSV文件。因为它们都在一个巨大的文件夹中,所以我想将它们分成training/test/validation集合;也许可以创建三个新文件夹,并基于某种Python脚本将图像移动到每个文件夹中。我想做分层抽样,这样我就可以在所有三个集合中保持类的百分比相同。
制作一个可以做到这一点的脚本的方法是什么?
发布于 2020-09-22 16:52:53
使用python库的拆分文件夹。
pip install split-folders
让所有的图像都存储在Data
文件夹中。然后按如下方式应用:
import split_folders
split_folders.ratio('Data', output="output", seed=1337, ratio=(.8, 0.1,0.1))
在运行上面的代码片段时,它将在output
目录中创建3个文件夹:
可以使用ratio
参数(train:val:test)
中的值来改变每个文件夹中的图像数量。
发布于 2018-12-03 07:37:16
我自己也遇到了类似的问题。我所有的图片都存储在两个文件夹中。“项目/数据2/DPN+”和“项目/数据2/DPN-”。这是一个二进制分类问题。这两个类是"DPN+“和"DPN-”。这两个类文件夹中都有.png。我的目标是将数据集分发到训练、验证和测试文件夹中。这些新文件夹中的每一个都将有另外两个文件夹- "DPN+“和"DPN-”-表示类。对于分区,我使用70:15:15分布。我是python的初学者,所以如果我犯了什么错误,请让我知道。
以下是我的代码:
import os
import numpy as np
import shutil
# # Creating Train / Val / Test folders (One time use)
root_dir = 'Data2'
posCls = '/DPN+'
negCls = '/DPN-'
os.makedirs(root_dir +'/train' + posCls)
os.makedirs(root_dir +'/train' + negCls)
os.makedirs(root_dir +'/val' + posCls)
os.makedirs(root_dir +'/val' + negCls)
os.makedirs(root_dir +'/test' + posCls)
os.makedirs(root_dir +'/test' + negCls)
# Creating partitions of the data after shuffeling
currentCls = posCls
src = "Data2"+currentCls # Folder to copy images from
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames)*0.7), int(len(allFileNames)*0.85)])
train_FileNames = [src+'/'+ name for name in train_FileNames.tolist()]
val_FileNames = [src+'/' + name for name in val_FileNames.tolist()]
test_FileNames = [src+'/' + name for name in test_FileNames.tolist()]
print('Total images: ', len(allFileNames))
print('Training: ', len(train_FileNames))
print('Validation: ', len(val_FileNames))
print('Testing: ', len(test_FileNames))
# Copy-pasting images
for name in train_FileNames:
shutil.copy(name, "Data2/train"+currentCls)
for name in val_FileNames:
shutil.copy(name, "Data2/val"+currentCls)
for name in test_FileNames:
shutil.copy(name, "Data2/test"+currentCls)
发布于 2020-03-01 03:32:41
接受Steven怀特上面的回答,并对其进行了一些修改,因为拆分有一个小问题。此外,这些文件分别保存在主文件夹中,而不是train/test/val文件夹中。
import os
import numpy as np
import shutil
import pandas as pd
def train_test_split():
print("########### Train Test Val Script started ###########")
#data_csv = pd.read_csv("DataSet_Final.csv") ##Use if you have classes saved in any .csv file
root_dir = 'New_folder_to_be_created'
classes_dir = ['class 1', 'class 2', 'class 3', 'class 4']
#for name in data_csv['names'].unique()[:10]:
# classes_dir.append(name)
processed_dir = 'Existing_folder_to_take_images_from'
val_ratio = 0.20
test_ratio = 0.20
for cls in classes_dir:
# Creating partitions of the data after shuffeling
print("$$$$$$$ Class Name " + cls + " $$$$$$$")
src = processed_dir +"//" + cls # Folder to copy images from
allFileNames = os.listdir(src)
np.random.shuffle(allFileNames)
train_FileNames, val_FileNames, test_FileNames = np.split(np.array(allFileNames),
[int(len(allFileNames) * (1 - (val_ratio + test_ratio))),
int(len(allFileNames) * (1 - val_ratio)),
])
train_FileNames = [src + '//' + name for name in train_FileNames.tolist()]
val_FileNames = [src + '//' + name for name in val_FileNames.tolist()]
test_FileNames = [src + '//' + name for name in test_FileNames.tolist()]
print('Total images: '+ str(len(allFileNames)))
print('Training: '+ str(len(train_FileNames)))
print('Validation: '+ str(len(val_FileNames)))
print('Testing: '+ str(len(test_FileNames)))
# # Creating Train / Val / Test folders (One time use)
os.makedirs(root_dir + '/train//' + cls)
os.makedirs(root_dir + '/val//' + cls)
os.makedirs(root_dir + '/test//' + cls)
# Copy-pasting images
for name in train_FileNames:
shutil.copy(name, root_dir + '/train//' + cls)
for name in val_FileNames:
shutil.copy(name, root_dir + '/val//' + cls)
for name in test_FileNames:
shutil.copy(name, root_dir + '/test//' + cls)
print("########### Train Test Val Script Ended ###########")
train_test_split()
https://stackoverflow.com/questions/53074712
复制相似问题