问当使用ImageDataGenerator类对图像进行数据增强时，使用TPU训练卷积神经网络时会遇到问题。
EN

Stack Overflow用户

提问于 2021-05-29 12:39:43

回答 2查看 370关注 0票数 1

最近，我一直在训练一个CNN，也就是AlexNet，用于将脑MRI图像分为四类，但是当我在我的Google上用CPU或GPU训练它时，它需要很多时间，大约5小时左右。我想把我的训练过程迁移到TPU，因为硬件是专门为做矩阵计算而设计的，但是我得到了下面的错误，无法找到任何解决错误的方法。

Tensorflow版本: 2.5.0

源代码，用于检查和初始化TPU (如果在运行时分配)：

print("OS Version & Details: ")
!lsb_release -a
print()

gpu_device_location = tpu_device_location = cpu_device_location = None

if os.environ['COLAB_GPU'] == '1':
    print("Allocated GPU Runtime Details:")
    !nvidia-smi
    print()
    try:
        import pynvml
        pynvml.nvmlInit()
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        gpu_device_name = pynvml.nvmlDeviceGetName(handle)
 
        if gpu_device_name not in {b'Tesla T4', b'Tesla P4', b'Tesla P100-PCIE-16GB'}:
            raise Exception("Unfortunately this instance does not have a T4, P4 or P100 GPU.\nSometimes Colab allocates a Tesla K80 instead of a T4, P4 or P100.\nIf you get Tesla K80 then you can factory reset your runtime to get another GPUs.")
    except Exception as hardware_exception:
        print(hardware_exception, end = '\n\n')
    gpu_device_location = tf.test.gpu_device_name()
    print(f"{gpu_device_name.decode('utf-8')} is allocated sucessfully at location: {gpu_device_location}")
elif 'COLAB_TPU_ADDR' in os.environ:
    tpu_device_location = f"grpc://{os.environ['COLAB_TPU_ADDR']}"
    print(f"TPU is allocated successfully at location: {tpu_device_location}.")
    resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_location)
    tf.config.experimental_connect_to_cluster(resolver)
    tf.tpu.experimental.initialize_tpu_system(resolver)
    tpu_strategy = tf.distribute.TPUStrategy()
else:
    cpu_device_location = "/cpu:0"
    print("GPUs and TPUs are not allocated successfully, hence runtime fallbacked to CPU.")

使用：ImageDataGenerator:进行数据增强

image_size = 224
batch_size = 16

image_datagen_kwargs = dict(rescale = 1 / 255,
                            rotation_range = 15, 
                            width_shift_range = 0.1, 
                            zoom_range = 0.01, 
                            shear_range = 0.01,
                            brightness_range = [0.3, 1.5],
                            horizontal_flip = True,
                            vertical_flip = True)

train_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
validation_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
test_image_datagen = ImageDataGenerator(**image_datagen_kwargs)

train_dataset = train_image_datagen.flow_from_dataframe(train_data, 
                                                        x_col = 'image_filepaths', 
                                                        y_col = 'tumor_class', 
                                                        seed = 42, 
                                                        batch_size = batch_size,
                                                        target_size = (image_size, image_size),
                                                        color_mode = 'grayscale')
validation_dataset = validation_image_datagen.flow_from_dataframe(validation_data, 
                                                                  x_col = 'image_filepaths', 
                                                                  y_col = 'tumor_class', 
                                                                  seed = 42,
                                                                  batch_size = batch_size, 
                                                                  target_size = (image_size, image_size),
                                                                  color_mode = 'grayscale')
test_dataset = test_image_datagen.flow_from_dataframe(test_data, 
                                                      x_col = 'image_filepaths', 
                                                      y_col = 'tumor_class', 
                                                      seed = 42, 
                                                      batch_size = batch_size,
                                                      target_size = (image_size, image_size),
                                                      color_mode = 'grayscale')

基本上，一旦创建了ImageDataGenerator类的实例，就可以调用flow_from_dataframe()方法，它返回DataFrameIterator类的一个实例，您可以使用该实例来迭代根据所需变体创建的图像的变体。

使用keras：的AlexNet CNN的架构

alexnet_cnn = Sequential()
    alexnet_cnn.add(Conv2D(96, kernel_size = 11, strides = 4, activation = 'relu', input_shape = (image_size, image_size, 1), name = 'Conv2D-1'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-1'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-1'))
    alexnet_cnn.add(Conv2D(256, kernel_size = 5, padding = 'same', activation = 'relu', name = 'Conv2D-2'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-2'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-2'))
    alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-3'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-3'))
    alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-4'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-4'))
    alexnet_cnn.add(Conv2D(256, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-5'))
    alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-5'))
    alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-3'))
    alexnet_cnn.add(Flatten(name = 'Flatten-Layer-1'))
    alexnet_cnn.add(Dense(1024, activation = 'relu', name = 'Hidden-Layer-1'))
    alexnet_cnn.add(Dropout(rate = 0.5, name = 'Dropout-Layer-1'))
    alexnet_cnn.add(Dense(4, activation = 'softmax', name = 'Output-Layer'))
    alexnet_cnn.compile(optimizer = 'Adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])

当我开始使用以下代码训练上面的CNN时：

alexnet_train_history = alexnet_cnn.fit(train_dataset, 
                                        validation_data = validation_dataset,
                                        epochs = cnn_epochs)

我遇到的错误如下：

UnavailableError: 8 root error(s) found.
  (0) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[tpu_compile_succeeded_assert/_6849197215061331409/_5/_261]]
  (1) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[OptionalHasValue_6/_14]]
     [[OptionalHasValue_8/_17]]
  (2) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[strided_slice_109/_308]]
  (3) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
     [[{{node MultiDeviceIteratorGetNextFromShard}}]]
     [[RemoteCall]]
     [[IteratorGetNextAsOptional]]
     [[cond_12/switch_pre ... [truncated]

我搜索了上面的错误ImageDataGenerator不适用于tpu #34346，结果发现在较早版本的tensorflow中，TPU不适用于DataFrameIterators。

是否有任何方法来解决上述问题，或者是否有任何方法将DataFrameIterator的实例转换为TPU支持的TFRecord等实例？

tensorflow