数据预处理-对文本数据的处理方法

caoqi95

发布于 2019-03-27 17:37:51

97500

代码可运行

文章被收录于专栏：caoqi95的记录日志caoqi95的记录日志

运行总次数：0

代码可运行

「整合一下做udacity深度学习练习时对文本数据处理的代码，便于自己理解，提供对于文本数据处理的思路。版权归udacity所有，不妥删。」

将文本数据转换为训练可用的数据

建立词级vocab：

给标点添加Token，并将出现低于5次的低频词丢弃。

def preprocess(text):

    # Replace punctuation with tokens so we can use them in our model
    text = text.lower() # 将文本全转换为小写
    text = text.replace('.', ' <PERIOD> ') # 下面全是对于标点的处理，给标点添加Token
    text = text.replace(',', ' <COMMA> ')
    text = text.replace('"', ' <QUOTATION_MARK> ')
    text = text.replace(';', ' <SEMICOLON> ')
    text = text.replace('!', ' <EXCLAMATION_MARK> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    text = text.replace('(', ' <LEFT_PAREN> ')
    text = text.replace(')', ' <RIGHT_PAREN> ')
    text = text.replace('--', ' <HYPHENS> ')
    text = text.replace('?', ' <QUESTION_MARK> ')
    # text = text.replace('\n', ' <NEW_LINE> ')
    text = text.replace(':', ' <COLON> ')
    words = text.split()
    
    # Remove all words with  5 or fewer occurences
    word_counts = Counter(words)
    trimmed_words = [word for word in words if word_counts[word] > 5]

    return trimmed_words

with open('data/text8') as f:
    text = f.read()
words = preprocess(text)

创建字典将词汇转换为整数，反过来将整数转换为词汇

def create_lookup_tables(words):
    """
    Create lookup tables for vocabulary
    :param words: Input list of words
    :return: A tuple of dicts.  The first dict....
    """
    word_counts = Counter(words)
    sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
    int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
    vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}

    return vocab_to_int, int_to_vocab

vocab_to_int, int_to_vocab = create_lookup_tables(words)
# 将文本中的词汇转换为整数后存在一个list里
int_words = [vocab_to_int[word] for word in words]

建立字符级vocab：

with open('anna.txt', 'r') as f:
    text=f.read()
vocab = sorted(set(text)) 
## set() 是把传入的文本内容拆分为一个个不重复的元素，
## 传入到集合中, sorted()是把set整合为list，这样就成为一个字符级的词汇表
vocab_to_int = {c: i for i, c in enumerate(vocab)}
int_to_vocab = dict(enumerate(vocab))
# 将文本中的词汇转换为整数后存在一个list里
encoded = np.array([vocab_to_int[c] for c in text], dtype=np.int32)

对于高频无用词的处理--Subsampling

此方法来自下面paper的2.3节：NIPS paper from Mikolov et al.

In very large corpora, the most frequent words can easily occur hundreds of millions of times (e.g.,“in”, “the”, and “a”). Such words usually provide less information value than the rare words. For example, while the Skip-gram model benefits from observing the co-occurrences of “France” and“Paris”, it benefits much less from observing the frequent co-occurrences of “France” and “the”, as nearly every word co-occurs frequently within a sentence with “the”. This idea can also be applied in the opposite direction; the vector representations of frequent words do not change significantly after training on several million examples.

在非常大的语料库中，最常见的单词可能容易出现数亿次(例如“in”,“the”和“a”)。这些单词通常比罕见的单词提供更少的有价值的信息。将这些高频无用的单词去除掉就能消除数据的噪音，这使得训练能够更快更好。

具体做法是，在训练集中的每个单词wi，我们可以使用下面的公式来计算其丢弃概率P(wi)：

其中t是一个选定的阈值，经常选用10-5，f(wi)是单词在整个训练集中出现的频率。

We chose this subsampling formula because it aggressively subsamples words whose frequency is greater than t while preserving the ranking of the frequencies. Although this subsampling formula was chosen heuristically, we found it to work well in practice. It accelerates learning and even significantly improves the accuracy of the learned vectors of the rare words...

代码实现:

from collections import Counter
import numpy as np
import random

# set the threahold parameter  
t = 0.00001  

# calculate the Pw
word_counts = Counter(int_words)
Pw = {word: 1- np.sqrt(t /(counts / len(int_words))) for word, counts in word_counts.items()}

# get the final subsampled word list
train_words = []
for word in int_words:
    if random.random() < (1 - Pw[word]):
        train_words.append(word)

对数据进行批量处理

对于字符级样本的处理：

首先，我们需要做的是抛弃一些文本数据以至于可以得到完整的batches。每个batch的字符数量为N×M，其中N为batch size(序列的数量)，M为step的数量。然后，用输入数组arr的长度除以(N×M)得到batches的总数K，K = len(arr)/(N×M)。一旦知道K的大小就能得知从arr获取的字符总数，即为N×M×K，按照这个从原输入数据截取N×M×K长度的数据，即抛弃了一些数据。之后，我们需要把数组arr分为N个序列。使用arr = arr.reshape((batch_size, -1)) 这句代码就能够搞定，确定了batch_size为第一个维度，第二个维度就会自动推导(-1的意思表示占位，代码会自动推导)。此时，得到的数组为N×(M∗K)。有了上面reshape后的数组，我们就能通过这个数组迭代我们的batches。思路就是：每个batch就是在N×(M∗K)数组上的一个N×M的窗口。如上图所示，当N为2，M为3时，在数组上的窗口为2×3大小。同样我们希望得到目标数据，目标数据就是输入数据移动一位字符的数据。实现代码如下：

def get_batches(arr, batch_size, n_steps):
    '''Create a generator that returns batches of size
       batch_size x n_steps from arr.
       
       Arguments
       ---------
       arr: Array you want to make batches from
       batch_size: Batch size, the number of sequences per batch
       n_steps: Number of sequence steps per batch
    '''
    # Get the number of characters per batch and number of batches we can make
    characters_per_batch = batch_size*n_steps
    n_batches = len(arr)//characters_per_batch
    
    # Keep only enough characters to make full batches
    arr = arr[:n_batches*characters_per_batch]

    # Reshape into batch_size rows
    arr = arr.reshape((batch_size, -1))
    
    for n in range(0, arr.shape[1], n_steps):
        # The features
        x = arr[:, n:n+n_steps]
        # The targets, shifted by one
        y = np.zeros_like(x)
        y[:, :-1], y[:, -1] = x[:, 1:], x[:, 0]

        yield x, y

对于词级样本的处理：对于词级样本的处理和对于字符级样本的处理方法基本相同。实现代码如下：

def get_batches(int_text, batch_size, seq_length):
    """
    Return batches of input and target
    :param int_text: Text with the words replaced by their ids
    :param batch_size: The size of batch
    :param seq_length: The length of sequence
    :return: Batches as a Numpy array
    """
    # numbers of batches
    arr_int_text = np.array(int_text)
    n_batches = len(int_text) // (batch_size * seq_length)
    
    # input
    arr = arr_int_text[:n_batches * batch_size*seq_length]
    arr_re = arr.reshape((batch_size,-1))
    
    x = list()
    for i in range(0, arr_re.shape[1], seq_length):   
        inputs = arr_re[:, i:i + seq_length]
        x.append(inputs)
     
    # targets
    arr_shifted = arr_int_text[1:n_batches * batch_size * seq_length + 1]
    arr_shifted[-1] = arr[0]
    arr_shifted = arr_shifted.reshape((batch_size, -1))
    
    y = list()
    for n in range(0, arr_shifted.shape[1], seq_length):
        targets = arr_shifted[:, n:n + seq_length]
        y.append(targets)
    
    batches = np.array(list(zip(x,y)))
    return batches

本文参与腾讯云自媒体同步曝光计划，分享自作者个人站点/博客。

原始发表：2018.02.18 ，如有侵权请联系 cloudcommunity@tencent.com 删除

编程算法