最近,我一直在训练一个CNN,也就是AlexNet,用于将脑MRI图像分为四类,但是当我在我的Google上用CPU或GPU训练它时,它需要很多时间,大约5小时左右。我想把我的训练过程迁移到TPU,因为硬件是专门为做矩阵计算而设计的,但是我得到了下面的错误,无法找到任何解决错误的方法。
Tensorflow版本: 2.5.0
源代码,用于检查和初始化TPU (如果在运行时分配):
print("OS Version & Details: ")
!lsb_release -a
print()
gpu_device_location = tpu_device_location = cpu_device_location = None
if os.environ['COLAB_GPU'] == '1':
print("Allocated GPU Runtime Details:")
!nvidia-smi
print()
try:
import pynvml
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
gpu_device_name = pynvml.nvmlDeviceGetName(handle)
if gpu_device_name not in {b'Tesla T4', b'Tesla P4', b'Tesla P100-PCIE-16GB'}:
raise Exception("Unfortunately this instance does not have a T4, P4 or P100 GPU.\nSometimes Colab allocates a Tesla K80 instead of a T4, P4 or P100.\nIf you get Tesla K80 then you can factory reset your runtime to get another GPUs.")
except Exception as hardware_exception:
print(hardware_exception, end = '\n\n')
gpu_device_location = tf.test.gpu_device_name()
print(f"{gpu_device_name.decode('utf-8')} is allocated sucessfully at location: {gpu_device_location}")
elif 'COLAB_TPU_ADDR' in os.environ:
tpu_device_location = f"grpc://{os.environ['COLAB_TPU_ADDR']}"
print(f"TPU is allocated successfully at location: {tpu_device_location}.")
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu_location)
tf.config.experimental_connect_to_cluster(resolver)
tf.tpu.experimental.initialize_tpu_system(resolver)
tpu_strategy = tf.distribute.TPUStrategy()
else:
cpu_device_location = "/cpu:0"
print("GPUs and TPUs are not allocated successfully, hence runtime fallbacked to CPU.")使用:ImageDataGenerator:进行数据增强
image_size = 224
batch_size = 16
image_datagen_kwargs = dict(rescale = 1 / 255,
rotation_range = 15,
width_shift_range = 0.1,
zoom_range = 0.01,
shear_range = 0.01,
brightness_range = [0.3, 1.5],
horizontal_flip = True,
vertical_flip = True)
train_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
validation_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
test_image_datagen = ImageDataGenerator(**image_datagen_kwargs)
train_dataset = train_image_datagen.flow_from_dataframe(train_data,
x_col = 'image_filepaths',
y_col = 'tumor_class',
seed = 42,
batch_size = batch_size,
target_size = (image_size, image_size),
color_mode = 'grayscale')
validation_dataset = validation_image_datagen.flow_from_dataframe(validation_data,
x_col = 'image_filepaths',
y_col = 'tumor_class',
seed = 42,
batch_size = batch_size,
target_size = (image_size, image_size),
color_mode = 'grayscale')
test_dataset = test_image_datagen.flow_from_dataframe(test_data,
x_col = 'image_filepaths',
y_col = 'tumor_class',
seed = 42,
batch_size = batch_size,
target_size = (image_size, image_size),
color_mode = 'grayscale')基本上,一旦创建了ImageDataGenerator类的实例,就可以调用flow_from_dataframe()方法,它返回DataFrameIterator类的一个实例,您可以使用该实例来迭代根据所需变体创建的图像的变体。
使用keras:的AlexNet CNN的架构
alexnet_cnn = Sequential()
alexnet_cnn.add(Conv2D(96, kernel_size = 11, strides = 4, activation = 'relu', input_shape = (image_size, image_size, 1), name = 'Conv2D-1'))
alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-1'))
alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-1'))
alexnet_cnn.add(Conv2D(256, kernel_size = 5, padding = 'same', activation = 'relu', name = 'Conv2D-2'))
alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-2'))
alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-2'))
alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-3'))
alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-3'))
alexnet_cnn.add(Conv2D(384, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-4'))
alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-4'))
alexnet_cnn.add(Conv2D(256, kernel_size = 3, padding = 'same', activation = 'relu', name = 'Conv2D-5'))
alexnet_cnn.add(BatchNormalization(name = 'Batch-Normalization-5'))
alexnet_cnn.add(MaxPool2D(pool_size = 3, strides = 2, name = 'Max-Pooling-3'))
alexnet_cnn.add(Flatten(name = 'Flatten-Layer-1'))
alexnet_cnn.add(Dense(1024, activation = 'relu', name = 'Hidden-Layer-1'))
alexnet_cnn.add(Dropout(rate = 0.5, name = 'Dropout-Layer-1'))
alexnet_cnn.add(Dense(4, activation = 'softmax', name = 'Output-Layer'))
alexnet_cnn.compile(optimizer = 'Adam', loss = 'categorical_crossentropy', metrics = ['accuracy'])当我开始使用以下代码训练上面的CNN时:
alexnet_train_history = alexnet_cnn.fit(train_dataset,
validation_data = validation_dataset,
epochs = cnn_epochs)我遇到的错误如下:
UnavailableError: 8 root error(s) found.
(0) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[tpu_compile_succeeded_assert/_6849197215061331409/_5/_261]]
(1) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[OptionalHasValue_6/_14]]
[[OptionalHasValue_8/_17]]
(2) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[strided_slice_109/_308]]
(3) Unavailable: {{function_node __inference_train_function_38767}} failed to connect to all addresses
Additional GRPC error information from remote target /job:localhost/replica:0/task:0/device:CPU:0:
:{"created":"@1622146086.692146903","description":"Failed to pick subchannel","file":"third_party/grpc/src/core/ext/filters/client_channel/client_channel.cc","file_line":5420,"referenced_errors":[{"created":"@1622146086.692145579","description":"failed to connect to all addresses","file":"third_party/grpc/src/core/ext/filters/client_channel/lb_policy/pick_first/pick_first.cc","file_line":398,"grpc_status":14}]}
[[{{node MultiDeviceIteratorGetNextFromShard}}]]
[[RemoteCall]]
[[IteratorGetNextAsOptional]]
[[cond_12/switch_pre ... [truncated]我搜索了上面的错误ImageDataGenerator不适用于tpu #34346,结果发现在较早版本的tensorflow中,TPU不适用于DataFrameIterators。
是否有任何方法来解决上述问题,或者是否有任何方法将DataFrameIterator的实例转换为TPU支持的TFRecord等实例?
发布于 2021-08-11 00:08:43
尝试使用tf.keras.preprocessing.image_dataset_from_directory或tf.data.Dataset,并将其与Keras预处理层相结合。
https://stackoverflow.com/questions/67751478
复制相似问题