数据预处理-对类别数据的处理方法

caoqi95

发布于 2019-03-27 17:28:42

8460

发布于 2019-03-27 17:28:42

one hot 来自维基百科的说明：在数字电路中，"one hot" 是一组 bit 值，其中合法的值只由表示高位的单个 1 和其他表示低位的 0 组成。相反地，除了一个 “0”，所有位都是 “1” 的类似的实现过程有时被称为 “one cold” 。
one-hot encoding 在机器学习和深度学习中，经常使用 one-hot encoding 来处理 categorical 类型的数据。one-hot encoding，又称为一位有效编码，因为只有 “1” 代表有效。举一个例子来说明，例子来自 sklearn 文档中的说明：在实际应用中，经常遇到数据不是连续型的而是离散的，相互独立的。比如关于一个人的数据有这些特征： ["male", "female"], ["from Europe", "from US", "from Asia"], ["uses Firefox", "uses Chrome", "uses Safari", "uses Internet Explorer"]。对于这样的相互独立的数据可以高效地编码成整数，这样不影响相互之间的独立性。例如["male", "from US", "uses Internet Explorer"] 可以被表示为[0, 1, 3] ， ["female", "from Asia", "uses Chrome"]可以被表示为[1, 2, 1]。但是这样的离散的整数数据，在一些机器学习或深度学习算法中，无法直接应用。因为有些算法需要连续的输入，并且会把这样表示相互之间独立的特征的整数数据理解为有序的，这通常是不符合实际的。为了将上面这些分类特征转换为算法可以直接使用的数据且消除和实际情况不一致的现象，可以使用 one hot encoding 把这些整数转化为二进制。每个特征用一个二进制数字来表示的方法就是 one-hot encoding。该方法将每个具有 n 个可能的分类特征转换成 n 个二元特征，且只有一个特征值有效。
one-hot encoding in TensorFlow tf.one_hot

import tensorflow as tf

tf.one_hot = (
    indices,         
    depth,           
    on_value=None,
    off_value=None,
    axis=None,
    dtype=None,
    name=None
)

参数: -indices: A Tensor of indices. -depth: A scalar defining the depth of the one hot dimension. -on_value: A scalar defining the value to fill in output when indices[j] = i. (default: 1) -off_value: A scalar defining the value to fill in output when indices[j] != i. (default: 0) -axis: The axis to fill (default: -1, a new inner-most axis). -dtype: The data type of the output tensor.

For example:

indices = [0, 1, 2]
depth = 3
tf.one_hot(indices, depth)  # output: [3 x 3]
# [[1., 0., 0.],
#  [0., 1., 0.],
#  [0., 0., 1.]]

one-hot encoding in Keras to_categorical

keras.utils.to_categorical(y, num_classes=None)

参数： -y: class vector to be converted into a matrix (integers from 0 to num_classes). -num_classes: total number of classes.

one-hot encoding in Pandas get_dummies

import pandas as pd
pd.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, 
                       columns=None, sparse=False, drop_first=False)

For example:

import pandas as pd

s = pd.Series(list('abcde'))
pd.get_dummies(s)

>>>
    a   b   c   d   e
0   1   0   0   0   0
1   0   1   0   0   0
2   0   0   1   0   0
3   0   0   0   1   0
4   0   0   0   0   1

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018.01.19 ，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法