首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >当使用ImageDataGenerator类对图像进行数据增强时,使用TPU训练卷积神经网络时会遇到问题。

当使用ImageDataGenerator类对图像进行数据增强时,使用TPU训练卷积神经网络时会遇到问题。
EN

Stack Overflow用户
提问于 2021-05-29 12:39:43
回答 2查看 370关注 0票数 1

最近,我一直在训练一个CNN,也就是AlexNet,用于将脑MRI图像分为四类,但是当我在我的Google上用CPU或GPU训练它时,它需要很多时间,大约5小时左右。我想把我的训练过程迁移到TPU,因为硬件是专门为做矩阵计算而设计的,但是我得到了下面的错误,无法找到任何解决错误的方法。

Tensorflow版本: 2.5.0

源代码,用于检查和初始化TPU (如果在运行时分配):

代码语言:javascript
运行
复制
print("OS Version & Details: ")
!lsb_release -a
print()

gpu_device_location = tpu_device_location = cpu_device_location = None

if os.environ['COLAB_GPU'] == '1':
    print("Allocated GPU Runtime Details:")
    !nvidia-smi
    print()
    try:
        import pynvml
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        gpu_device_name = pynvml.nvmlDeviceGetName(handle)
 
        if gpu_device_name not in {b'Tesla T4', b'Tesla P4', b'Tesla P100-PCIE-16GB'}:
            raise Exception("Unfortunately this instance does not have a T4, P4 or P100 GPU.\nSometimes Colab allocates a Tesla K80 instead of a T4, P4 or P100.\nIf you get Tesla K80 then you can factory reset your runtime to get another GPUs.")
    except Exception as hardware_exception:
        print(hardware_exception, end = '\n\n')
    gpu_device_location = tf.test.gpu_device_name()
    print(f"{gpu_device_name.decode('utf-8')} is allocated sucessfully at location: {gpu_device_location}")
elif 'COLAB_TPU_ADDR' in os.environ:
    tpu_device_location = f"grpc://{os.environ['COLAB_TPU_ADDR']}"
    print(f"TPU is allocated successfully at location: {tpu_device_location}.")
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_location)
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    tpu_strategy = tf.distribute.TPUStrategy()
else:
    cpu_device_location = "/cpu:0"
    print("GPUs and TPUs are not allocated successfully, hence runtime fallbacked to CPU.")

使用:ImageDataGenerator:进行数据增强

代码语言:javascript
运行
复制
image_size = 224
batch_size = 16

image_datagen_kwargs = dict(rescale = 1 / 255,
                            rotation_range = 15, 
                            width_shift_range = 0.1, 
                            zoom_range = 0.01, 
                            shear_range = 0.01,
                            brightness_range = [0.3, 1.5],
                            horizontal_flip = True,
                            vertical_flip = True)

train_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
validation_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
test_image_datagen = ImageDataGenerator(**image_datagen_kwargs)

train_dataset = train_image_datagen.flow_from_dataframe(train_data, 
                                                        x_col = 'image_filepaths', 
                                                        y_col = 'tumor_class', 
                                                        seed = 42, 
                                                        batch_size = batch_size,
                                                        target_size = (image_size, image_size),
                                                        color_mode = 'grayscale')
validation_dataset = validation_image_datagen.flow_from_dataframe(validation_data, 
                                                                  x_col = 'image_filepaths', 
                                                                  y_col = 'tumor_class', 
                                                                  seed = 42,
                                                                  batch_size = batch_size, 
                                                                  target_size = (image_size, image_size),
                                                                  color_mode = 'grayscale')
test_dataset = test_image_datagen.flow_from_dataframe(test_data, 
                                                      x_col = 'image_filepaths', 
                                                      y_col = 'tumor_class', 
                                                      seed = 42, 
                                                      batch_size = batch_size,
                                                      target_size = (image_size, image_size),
                                                      color_mode = 'grayscale')

基本上,一旦创建了ImageDataGenerator类的实例,就可以调用flow_from_dataframe()方法,它返回DataFrameIterator类的一个实例,您可以使用该实例来迭代根据所需变体创建的图像的变体。

使用keras:的AlexNet CNN的架构

代码语言:javascript
运行
复制
alexnet_cnn = Sequential()
    alexnet_cnn.add(Conv2D(96, kernel_size = 11, strides = 4, activation = 'relu', input_shape = (image_size, image_size, 1), name = 'Conv2D-1'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-1'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-1'))
    alexnet_cnn.add(Conv2D(256, kernel_size = 5, padding = 'same', activation = 'relu', name = 'Conv2D-2'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-2'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-2'))
    alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-3'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-3'))
    alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-4'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-4'))
    alexnet_cnn.add(Conv2D(256, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-5'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-5'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-3'))
    alexnet_cnn.add(Flatten(name = 'Flatten-Layer-1'))
    alexnet_cnn.add(Dense(1024, activation = 'relu', name = 'Hidden-Layer-1'))
    alexnet_cnn.add(Dropout(rate = 0.5, name = 'Dropout-Layer-1'))
    alexnet_cnn.add(Dense(4, activation = 'softmax', name = 'Output-Layer'))
    alexnet_cnn.compile(optimizer = 'Adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

当我开始使用以下代码训练上面的CNN时:

代码语言:javascript
运行
复制
alexnet_train_history = alexnet_cnn.fit(train_dataset, 
                                        validation_data = validation_dataset,
                                        epochs = cnn_epochs)

我遇到的错误如下:

代码语言:javascript
运行
复制
UnavailableError: 8 root error(s) found.
  (0) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_6849197215061331409/_5/_261]]
  (1) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[OptionalHasValue_6/_14]]
     [[OptionalHasValue_8/_17]]
  (2) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_109/_308]]
  (3) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_12/switch_pre ... [truncated]

我搜索了上面的错误ImageDataGenerator不适用于tpu #34346,结果发现在较早版本的tensorflow中,TPU不适用于DataFrameIterators

是否有任何方法来解决上述问题,或者是否有任何方法将DataFrameIterator的实例转换为TPU支持的TFRecord等实例?

EN

Stack Overflow用户

发布于 2021-08-11 00:08:43

尝试使用tf.keras.preprocessing.image_dataset_from_directorytf.data.Dataset,并将其与Keras预处理相结合。

票数 1
EN
查看全部 2 条回答
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/67751478

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档