问如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定？
EN

Stack Overflow用户

提问于 2018-06-01 03:16:12

回答 1查看 0关注 0票数 0

它为char-tokenize现代化了一个着名的单词标记化示例：

texts = ['This is a text','This is NOT not a text','А это русский текст']
labels = array([1,1,0])

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences 
from keras.utils import to_categorical

max_review_length = 30 #maximum length of the sentence
embedding_vector_length = 3
top_words = 10

# num_words is tne number of unique words in the sequence, if there's more top count words are taken
# tokenizer = Tokenizer(top_words)
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
input_dim = len(word_index) + 1
print('word_index: ', word_index)
print('input_dim : ', input_dim )
print('Found %s unique tokens.' % len(word_index))

# max_review_length is the maximum length of the input text so that we can create vector [... 0,0,1,3,50] where 1,3,50 are individual words
data = pad_sequences(sequences, max_review_length, padding='post')
# pad_sequences?
print('Shape of data tensor:', data.shape)
print(data)

word_index：{''：1，'t'：2，'i'：3，'s'：4，'T'：5，'t'：6，'c'：7，'h'：8， 'a'：9，'e'：10，'x'：11，'to'：12，'N'：13，'O'：14，'n'：15，'o'：16，'A '，17，'e'：18，'o'：19，'p'：20，'y'：21，'和'：22，'d'：23，'e'：24}

input_dim：25

数据张量的形状：（3,30）[[5 8 3 4 1 3 4 1 9 1 2 10 11 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [5 8 3 4 1 3 4 1 13 14 5 1 15 16 2 1 9 1 2 10 11 2 0 0 0 0 0 0 0 0] [17 1 18 6 19 1 20 21 7 7 12 22 23 1 6 24 12 7 6 0 0 0 0 0 0 0 0 0 0 0]]

那么，如何为此组织第一个Input-Dense-Embedding层呢？input_dim，output_dim，input_length的值是多少？

model = Sequential()
#Embedding?
#model.add(Embedding(input_dim, output_dim, input_length=max_length))
#model.add(Embedding([2,2], 30, ))#, input_length=max_review_length))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(data, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))

回答 1

Stack Overflow用户

发布于 2018-06-01 12:21:28

你有可变长度的填充输入，如果你天真地扁平化，你的网络会处理噪音，并且可能对句子的长度有偏差。例如，如果一个类有更短的句子，它可以将其识别为有用的特征。

相反，应该考虑使用带有掩码的循环神经网络来处理输入：

model = Sequential()
# we mask zero to indicate 0s should be skipped
model.add(Embedding(input_dim, output_dim, input_length=max_l, mask_zero=True))
model.add(LSTM(hidden_units))
model.add(Dense(1, activation='sigmoid'))

票数 0

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/-100004686

复制

相似问题

问如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定？
EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定？EN

回答 1

Stack Overflow用户

社区

活动

资源

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定？
EN