http://arxiv.org/pdf/1409.0575v3.pdf表2表示,在ILSVRC2012培训集中,每班有1,281,167幅图像和732-1300张图片。
理想情况下,我想避免下载138 GB只是为了这个目的,否则我不需要它。
我想知道是否有人知道训练集中每个班的确切数目,也就是说,训练集中每个班的可能性有多大?
发布于 2016-08-16 09:50:29
我找不到ILSVRC2012训练集的URL文本文件,但是对于完整的imagenet,您只能作为文本文件下载URL:http://image-net.org/download
我编写了以下脚本以获得对数据的感觉:
#!/usr/bin/env python
"""Analyze the distribution of classes in ImageNet."""
classes = {}
images = 0
with open("fall11_urls.txt") as f:
for i, line in enumerate(f):
label, _ = line.split("\t", 1)
wnid, _ = label.split("_")
if wnid in classes:
classes[wnid] += 1
else:
classes[wnid] = 1
images += 1
# Output
print("Classes: %i" % len(classes))
print("Images: %i" % images)
class_counts = [count for _, count in classes.items()]
import matplotlib.pyplot as plt
plt.hist(class_counts, bins=range(max(class_counts)))
plt.show()它提供了:
Classes: 21841
Images: 14197122
少于100个示例的类几乎毫无用处。让我们把它们从情节中删除。还将垃圾箱大小增加到25:
#!/usr/bin/env python
"""Analyze the distribution of classes in ImageNet."""
classes = {}
images = 0
with open("fall11_urls.txt") as f:
for i, line in enumerate(f):
label, _ = line.split("\t", 1)
wnid, _ = label.split("_")
if wnid in classes:
classes[wnid] += 1
else:
classes[wnid] = 1
images += 1
# Output
print("Classes: %i" % len(classes))
print("Images: %i" % images)
class_counts = [count for _, count in classes.items()]
import matplotlib.pyplot as plt
plt.title('ImageNet class distribution')
plt.xlabel('Amount of available images')
plt.ylabel('Number of classes')
min_examples = 100
bin_size = 25
plt.hist(class_counts, bins=range(min_examples, max(class_counts), bin_size))
plt.show()
或海运:
import seaborn as sns
sns.distplot(class_counts, kde=True, rug=False);
sns.plt.show()
数据最多的前10个类是:
top10 = sorted(classes.items(), key=lambda n: n[1], reverse=True)[:10]
for class_label, count in top10:
print("%s:\t%i" % (class_label, count))
n02094433: 3047 (Yorkshire terrier)
n02086240: 2563 (Shih-Tzu)
n01882714: 2469 (koala bear, kangaroo bear, native bear, )
n02087394: 2449 (Rhodesian ridgeback)
n02100735: 2426 (English setter)
n00483313: 2410 (singles)
n02279972: 2386 (monarch butterfly, Danaus plexippus)
n09428293: 2382 (seashore)
n02138441: 2341 (meerkat)
n02100583: 2334 (vizsla, Hungarian pointer)使用http://www.image-net.org/api/text/wordnet.synset.getwords?wnid=n02094433,您可以查找名称。
https://datascience.stackexchange.com/questions/11777
复制相似问题