FasttextR在R中读错了一些西班牙语单词(例如,“participaciç³n”。对于“participación”),我从他们的网站(https://fasttext.cc/docs/en/crawl-vectors.html)下载了预训练模型"cc.es.300.bin“。
我认为问题在于,当我上传模型时,我无法告诉R编码应该是"UTF-8",而不是"Latin1“或其他。也就是说,我可以加载西班牙语模型并得到错误的单词,如下所示:
model <- ft_load("cc.es.300.bin") 但我不能这样做:
model <- ft_load("cc.es.300.bin", encoding="UTF-8") 就像可以处理xlsx文件一样,例如:
model <- xlsx::read.xlsx("file.xlsx", sheetIndex = 1, encoding="UTF-8")我尝试过:更改Windows中的语言和编码;使用UTF-8编码重新打开并保存.R文件;使用Sys.setlocale("LC_ALL", "Spanish")将区域设置更改为西班牙语。什么都不管用。
任何帮助都将不胜感激。致以敬意,
发布于 2021-01-29 05:07:35
库"readr“帮助了我。
install.packages("read")
library(readr)guess_encoding(ft_words(model))
| | 0%
# A tibble: 2 x 2
encoding confidence
<chr> <dbl>
1 UTF-8 1
2 Shift_JIS 0.31parse_character(ft_words(model), locale=locale(encoding="UTF-8"))
[1] "de" "," "." "la" "y"
[6] "en" "que" "el" "</s>" "a"
[11] "los" ":" "\"" "del" "un"
[16] ")" "se" "con" "por" "las"
[21] "(" "para" "una" "es" "no"
[26] "su" "al" "como" "lo" "/"
[31] "más" "El" "o" "'" "La"
[36] "!" "|" "?" "me" "En"
[41] "..." "-" "sus" "este" "pero"
[46] "ha" "esta" ";" "“" "_"
[51] "”" "si" "sobre" "¿" "fue"
[56] "son" "le" "muy" "ser" "ya"
[61] "tu" "todo" "1" "entre" "te"
[66] "mi" "Los" "%" "sin" "también"
...而不是
[1] "de" "," "." "la"
[5] "y" "en" "que" "el"
[9] "</s>" "a" "los" ":"
[13] "\"" "del" "un" ")"
[17] "se" "con" "por" "las"
[21] "(" "para" "una" "es"
[25] "no" "su" "al" "como"
[29] "lo" "/" "más" "El"
[33] "o" "'" "La" "!"
[37] "|" "?" "me" "En"
[41] "..." "-" "sus" "este"
[45] "pero" "ha" "esta" ";"
[49] "“" "_" "â€\u009d" "si"
[53] "sobre" "¿" "fue" "son"
[57] "le" "muy" "ser" "ya" 但是,当我使用函数来获取最近邻域时,它似乎没有帮助
parse_character(ft_nearest_neighbors(model, "pera", k = 10L), locale=locale(encoding="UTF-8"))
Error in parse_vector(x, col_character(), na = na, locale = locale, trim_ws = trim_ws) :
is.character(x) is not TRUE但是(请注意pi?±a而不是piña)
ft_nearest_neighbors(model, "pera", k = 10L)
limonera ciruela manzana mandarina piña fruta sandÃa compota sandia fresa
0.6326169 0.6112964 0.6079050 0.5713655 0.5707002 0.5576053 0.5557024 0.5526152 0.5485740 0.5437940 现在,有用的是enc2utf8 (尽管如此,输出中的字符看起来很滑稽)
ft_nearest_neighbors(model,enc2utf8("piña"), k = 10L)
sandÃa papaya sandia ananá plátano ananás fruta limón mandarina maracuyá
0.6763531 0.6571828 0.6365163 0.6341625 0.6205474 0.6205293 0.6137358 0.6037553 0.6032383 0.5941805如果您想要获得单独的词向量,enc2utf8也会有所帮助
piña <- as.vector(ft_word_vectors(model, enc2utf8("piña")))https://stackoverflow.com/questions/65940490
复制相似问题