首页
学习
活动
专区
工具
TVP
发布
社区首页 >问答首页 >如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定?

如何将CHAR标记化数据与Keras嵌入、输入或密集层绑定?
EN

Stack Overflow用户
提问于 2018-06-01 03:16:12
回答 1查看 0关注 0票数 0

它为char-tokenize现代化了一个着名的单词标记化示例:

代码语言:javascript
复制
texts = ['This is a text','This is NOT not a text','А это русский текст']
labels = array([1,1,0])

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences 
from keras.utils import to_categorical

max_review_length = 30 #maximum length of the sentence
embedding_vector_length = 3
top_words = 10

# num_words is tne number of unique words in the sequence, if there's more top count words are taken
# tokenizer = Tokenizer(top_words)
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)
word_index = tokenizer.word_index
input_dim = len(word_index) + 1
print('word_index: ', word_index)
print('input_dim : ', input_dim )
print('Found %s unique tokens.' % len(word_index))

# max_review_length is the maximum length of the input text so that we can create vector [... 0,0,1,3,50] where 1,3,50 are individual words
data = pad_sequences(sequences, max_review_length, padding='post')
# pad_sequences?
print('Shape of data tensor:', data.shape)
print(data)

word_index:{'':1,'t':2,'i':3,'s':4,'T':5,'t':6,'c':7,'h':8, 'a':9,'e':10,'x':11,'to':12,'N':13,'O':14,'n':15,'o':16,'A ',17,'e':18,'o':19,'p':20,'y':21,'和':22,'d':23,'e':24}

input_dim:25

数据张量的形状:(3,30)[[5 8 3 4 1 3 4 1 9 1 2 10 11 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] [5 8 3 4 1 3 4 1 13 14 5 1 15 16 2 1 9 1 2 10 11 2 0 0 0 0 0 0 0 0] [17 1 18 6 19 1 20 21 7 7 12 22 23 1 6 24 12 7 6 0 0 0 0 0 0 0 0 0 0 0]]

那么,如何为此组织第一个Input-Dense-Embedding层呢?input_dim,output_dim,input_length的值是多少?

代码语言:javascript
复制
model = Sequential()
#Embedding?
#model.add(Embedding(input_dim, output_dim, input_length=max_length))
#model.add(Embedding([2,2], 30, ))#, input_length=max_review_length))

model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(data, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
EN

回答 1

Stack Overflow用户

发布于 2018-06-01 12:21:28

你有可变长度的填充输入,如果你天真地扁平化,你的网络会处理噪音,并且可能对句子的长度有偏差。例如,如果一个类有更短的句子,它可以将其识别为有用的特征。

相反,应该考虑使用带有掩码的循环神经网络来处理输入:

代码语言:javascript
复制
model = Sequential()
# we mask zero to indicate 0s should be skipped
model.add(Embedding(input_dim, output_dim, input_length=max_l, mask_zero=True))
model.add(LSTM(hidden_units))
model.add(Dense(1, activation='sigmoid'))
票数 0
EN
页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://stackoverflow.com/questions/-100004686

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档