# TensorFlow从1到2（六）结构化数据预处理和心脏病预测

#### 心脏病预测

Age

integer

Sex

(1 = 男; 0 = 女)

integer

CP

integer

Trestbpd

integer

Chol

integer

FBS

(空腹血糖含量达到120 mg/dl) (1 = 是; 0 = 否)

integer

RestECG

integer

Thalach

integer

Exang

integer

Oldpeak

integer

Slope

float

CA

integer

Thal

string

Target

integer

```\$ python3
Python 3.7.3 (default, Mar 27 2019, 09:23:39)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
>>> import pandas as pd
>>> dataframe = pd.read_csv('heart.csv')
age  sex  cp  trestbps  chol  fbs  restecg  thalach  exang  oldpeak  slope  ca        thal  target
0   63    1   1       145   233    1        2      150      0      2.3      3   0       fixed       0
1   67    1   4       160   286    0        2      108      1      1.5      2   3      normal       1
2   67    1   4       120   229    0        2      129      1      2.6      2   2  reversible       0
3   37    1   3       130   250    0        0      187      0      3.5      3   0      normal       0
4   41    0   2       130   204    0        2      172      0      1.4      1   0      normal       0
>>> ```

1

0

0

0

1

0

0

0

1

0

0

0

1.3

0.7

0.1

0.9

1.1

0.9

1.8

0.4

1.3

0.6

0.5

0.5

...

#### 结构化数据的预处理

```[[60.]
[41.]
[61.]
[59.]
[52.]]```

```[[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.]
[0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0.]]```

TensorFlow中对于这种情况的数据已经有了专门的处理方式，以下一行语句就是完成这个工作：

```# 代码请在完整程序中执行
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])```

```# 请在完整代码中执行
# 获取thal字段原始数据
thal = feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
# 转换为one-hot编码
thal_one_hot = feature_column.indicator_column(thal)```

```[[0. 0. 1.]
[0. 1. 0.]
[0. 1. 0.]
[0. 0. 1.]
[0. 1. 0.]]```

```# 此代码不要执行，仅为示例
# 将thal字段嵌入到8维空间
thal_embedding = feature_column.embedding_column(thal, dimension=8)```

```[[ 0.15909313 -0.17830053 -0.01482905  0.26818395 -0.7063258   0.17809148
-0.33043832  0.34121528]
[ 0.2877485   0.20686264  0.2649153  -0.2827308   0.10686944 -0.12080232
-0.28829345  0.43876123]
[ 0.2877485   0.20686264  0.2649153  -0.2827308   0.10686944 -0.12080232
-0.28829345  0.43876123]
[ 0.15909313 -0.17830053 -0.01482905  0.26818395 -0.7063258   0.17809148
-0.33043832  0.34121528]
[ 0.2877485   0.20686264  0.2649153  -0.2827308   0.10686944 -0.12080232
-0.28829345  0.43876123]]```

```# 本代码仅为示例，不要执行
thal_hashed = feature_column.categorical_column_with_hash_bucket(
'thal', hash_bucket_size=1000)```

```# 本代码仅为示例，不要执行
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)```

#### 建模

```# 定义输入层
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# 将输入层一定要放在模型的第一层
model = tf.keras.Sequential([
feature_layer,
layers.Dense(128, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(1, activation='sigmoid')
])```

#### 完整代码

```#!/usr/bin/env python3
from __future__ import absolute_import, division, print_function

# 引入所需头文件
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import feature_column
from tensorflow.keras import layers
from sklearn.model_selection import train_test_split

# 打开样本数据文件
# URL = 'https://storage.googleapis.com/applied-dl/heart.csv'   #直接从网上打开可以使用这一行
URL = 'heart.csv'
# 显示数据的头几行

# 将数据中20%分做测试数据
train, test = train_test_split(dataframe, test_size=0.2)
# 将数据的64%作为训练数据，16%作为验证数据
train, val = train_test_split(train, test_size=0.2)
# 显示训练、验证、测试三个数据集的记录数量
print(len(train), 'train examples')
print(len(val), 'validation examples')
print(len(test), 'test examples')

# 定义一个函数，将Pandas Dataframe对象转换为TensorFlow的Dataset对象
def df_to_dataset(dataframe, shuffle=True, batch_size=32):
dataframe = dataframe.copy()
# target字段是确诊是否罹患心脏病的数据，取出来作为标注数据
labels = dataframe.pop('target')
# 生成Dataset
ds = tf.data.Dataset.from_tensor_slices((dict(dataframe), labels))
if shuffle:
# 是否需要乱序
ds = ds.shuffle(buffer_size=len(dataframe))
# 设置每批次的记录数量
ds = ds.batch(batch_size)
return ds

# 训练、验证、测试三个数据集都转换成Dataset类型，其中训练集需要重新排序
train_ds = df_to_dataset(train)
val_ds = df_to_dataset(val, shuffle=False)
test_ds = df_to_dataset(test, shuffle=False)

# 用于保存所需的数据列
feature_columns = []

# 根据字段名，添加所需的数据列
for header in ['age', 'trestbps', 'chol', 'thalach', 'oldpeak', 'slope', 'ca']:

# 取出年龄数据
age = feature_column.numeric_column("age")
# 按照18-25/25-30/30-35/.../60-65为年龄分段，最后形成one-hot编码
age_buckets = feature_column.bucketized_column(age, boundaries=[18, 25, 30, 35, 40, 45, 50, 55, 60, 65])
# 数据段作为一个新参量添加到数据集
feature_columns.append(age_buckets)

# 获取thal字段原始数据
thal = feature_column.categorical_column_with_vocabulary_list(
'thal', ['fixed', 'normal', 'reversible'])
# 做one-hot编码
thal_one_hot = feature_column.indicator_column(thal)
# 作为新的数据列添加
feature_columns.append(thal_one_hot)

# 将thal嵌入8维空间做向量化
thal_embedding = feature_column.embedding_column(thal, dimension=8)
feature_columns.append(thal_embedding)

# 把年龄段和thal字段作为关联属性加入新列
crossed_feature = feature_column.crossed_column([age_buckets, thal], hash_bucket_size=1000)
crossed_feature = feature_column.indicator_column(crossed_feature)
feature_columns.append(crossed_feature)

# 定义输入层
feature_layer = tf.keras.layers.DenseFeatures(feature_columns)

# 定义完整模型
model = tf.keras.Sequential([
feature_layer,
layers.Dense(128, activation='relu'),
layers.Dense(128, activation='relu'),
layers.Dense(1, activation='sigmoid')
])

# 模型编译
loss='binary_crossentropy',
metrics=['accuracy'])

# 训练
model.fit(train_ds,
validation_data=val_ds,
epochs=5)
# 评估
test_loss, test_acc = model.evaluate(test_ds)
# 显示评估的正确率
print('===================\nTest accuracy:', test_acc)```

```Epoch 1/5
7/7 [==============================] - 1s 110ms/step - loss: 1.2045 - accuracy: 0.5884 - val_loss: 1.1234 - val_accuracy: 0.7755
Epoch 2/5
7/7 [==============================] - 0s 46ms/step - loss: 1.0691 - accuracy: 0.6383 - val_loss: 0.5731 - val_accuracy: 0.7959
Epoch 3/5
7/7 [==============================] - 0s 43ms/step - loss: 0.9016 - accuracy: 0.7100 - val_loss: 0.5924 - val_accuracy: 0.7551
Epoch 4/5
7/7 [==============================] - 0s 44ms/step - loss: 0.5362 - accuracy: 0.7055 - val_loss: 0.6440 - val_accuracy: 0.7755
Epoch 5/5
7/7 [==============================] - 0s 43ms/step - loss: 0.7290 - accuracy: 0.6940 - val_loss: 0.5966 - val_accuracy: 0.7347
2/2 [==============================] - 0s 24ms/step - loss: 0.4600 - accuracy: 0.7705
===================
Test accuracy: 0.7704918```

（待续...）

0 条评论

• ### TensorFlow从1到2（十四）评估器的使用和泰坦尼克号乘客分析

通常认为评估器因为内置的紧密结合，运行速度要高于Keras。Keras一直是一个通用的高层框架，除了支持TensorFlow作为后端，还同时支持Theano...

• ### Visual Studio 20周年软件趋势随想

从2002年开始，.net让开发人员能快速构建和部署应用程序，便捷的开发windows和web服务器应用，同时著名的hacker Miguel de Icaza...

• ### 网址推荐

版权声明：本文为博主原创文章，欢迎转载。 https://blog.csdn.net/che...

• ### wordpress文章内添加按钮

谷腾堡编辑器自带，提取出平时使用经典编辑器方便使用。理论高版本的wordpress都可以用吧。字体颜色也可以更改，用到了再去谷腾堡拿吧！

• ### HTTP跨域详解和解决方式

那么究竟什么是跨域，跨域又是怎么产生的，以及跨域请求的问题需要怎么解决。我们一起来了解一下这些知识。

• ### MyBatis 自定义 typeHandler

对于自定义typeHandler来说,需要在配置文件中注册typeHandlers 然后需要实现TypeHandler接口,