首页
学习
活动
专区
圈层
工具
发布
首页
学习
活动
专区
圈层
工具
MCP广场
社区首页 >问答首页 >句子的空间词嵌入

句子的空间词嵌入
EN

Data Science用户
提问于 2019-09-15 10:41:32
回答 2查看 1.9K关注 0票数 3

Spacy提供预先训练的文字向量。不过,我注意到你也可以得到句子的向量:

代码语言:javascript
运行
复制
spacy_nlp('hello I').has_vector == True

然而,我不知道它是如何从句子中计算出word2vecs的。我试过:

代码语言:javascript
运行
复制
spacy_nlp('hello I').vector == spacy_nlp('hello').vector + spacy_nlp('I').vector

错误

代码语言:javascript
运行
复制
spacy_nlp('hello I').vector/spacy_nlp('hello I').vector_norm == spacy_nlp('hello').vector/spacy_nlp('hello').vector_norm + spacy_nlp('I').vector/spacy_nlp('I').vector_norm

错误

我似乎无法找到或计算出spacy如何计算句子的w2v。

代码语言:javascript
运行
复制
a =spacy_nlp('hello').vector
a

array([ 2.1919045 , -1.3554063 , -2.0530818 , -1.4123821 ,  0.73116064,
       -0.24243775, -1.238019  , -1.038872  , -3.8119905 ,  0.3023836 ,
        2.0082908 , -0.4146578 ,  0.52871764, -4.171281  , -4.014127  ,
        3.5551465 ,  3.5740273 ,  0.5369273 , -0.92361224,  1.4550962 ,
        2.1736908 , -0.05514041,  0.02151388, -2.1722403 ,  0.81322104,
        3.5877275 , -1.0136521 ,  4.6003613 , -0.19145766,  5.403145  ,
       -1.9958102 ,  0.80248785, -2.3566568 ,  2.15387   ,  0.26684093,
        1.8178961 ,  3.594517  , -2.9950802 ,  2.5587099 , -5.6746616 ,
       -3.7259517 ,  4.0144114 , -1.4814405 ,  1.5888698 , -0.2371515 ,
        0.5498152 ,  0.9527153 , -4.1197095 , -4.252441  , -0.36907774,
       -4.510469  ,  1.2669985 , -0.91693896, -3.0032263 , -4.037157  ,
       -1.986922  ,  1.8322158 , -0.9520336 , -2.6739838 ,  0.368276  ,
        0.5881702 ,  1.4819605 ,  2.1771026 ,  0.20011072, -0.20952749,
       -1.7966032 ,  4.412916  , -0.8781664 ,  3.0670204 ,  3.92986   ,
       -0.7381511 , -0.07432494, -3.6973615 , -3.546731  ,  1.6010978 ,
       -4.0834403 ,  1.7816883 ,  0.8037724 ,  0.40344352, -1.2090104 ,
       -3.3253288 ,  4.6769385 ,  1.3193885 , -1.1775286 , -1.2436512 ,
       -0.29471165,  1.9998071 ,  1.1338542 ,  5.747326  , -0.10331005,
        1.6050186 ,  2.6961374 , -1.9422164 , -3.0807574 , -1.1481779 ,
        7.1367517 ], dtype=float32)

b =spacy_nlp('I').vector
b

array([ 1.9940598e+00, -2.7776110e+00,  8.4717870e-01, -2.1956882e+00,
       -1.6103275e+00,  1.2993972e-01,  8.3826280e-01,  8.7950850e-01,
       -3.5490465e+00,  4.4254961e+00, -1.4894485e+00,  4.4692218e-01,
       -6.0040636e+00,  3.4809113e-01,  7.5852954e-01, -5.0149399e-01,
       -1.9669157e+00,  8.8114321e-01,  5.3964740e-01,  1.6436796e+00,
       -4.3819084e+00,  7.1328688e-01, -8.9688343e-01, -1.2563754e+00,
       -2.6987386e-01,  3.3273227e+00,  7.1929336e-01,  1.2008041e-01,
        2.8758078e+00, -8.6590099e-01,  5.6435466e-01, -5.4331255e-01,
       -3.3853512e+00, -2.0917976e+00, -1.1649452e+00,  8.6632729e+00,
        9.1355121e-01, -3.9117950e-01, -6.3341379e-01, -3.4170332e+00,
        3.2871642e+00,  4.5229197e-03, -4.0161700e+00,  2.6399128e+00,
       -2.4242992e+00, -1.2012237e-01, -1.1977488e-01, -1.6422987e-01,
        7.7170479e-01, -1.5015860e+00, -3.0203837e-01,  1.9385589e+00,
       -2.9229348e+00, -2.8134599e+00, -6.1340892e-01, -2.5029099e+00,
       -6.6817325e-01, -8.4735197e-01,  4.2243872e+00,  2.8358276e+00,
       -2.7096636e+00,  6.3791027e+00,  1.3461562e+00, -3.9387980e+00,
        1.0648534e+00,  5.3636909e-01,  4.1285772e+00, -2.8879738e+00,
        1.3546917e+00, -1.9005369e+00, -3.7411542e+00, -4.8598945e-02,
       -1.4411114e+00,  1.3436056e+00,  1.1946709e+00,  2.3972931e+00,
        2.1032238e+00,  1.8248746e+00, -2.1880054e+00, -1.4601905e+00,
       -1.9771397e+00,  9.3115008e-01, -3.7088573e+00, -4.9041757e-01,
        1.0846795e+00,  2.2863836e+00,  3.5038524e+00,  1.0964345e+00,
        3.6875091e+00, -1.6266774e+00,  1.4012933e-02,  2.7396250e+00,
        3.9477596e+00, -3.5737205e+00,  3.1862993e+00,  2.2955155e+00],
      dtype=float32)

c =spacy_nlp('hello I').vector
c

array([ 2.4846857 , -1.9697192 , -0.09456831, -1.5198507 , -1.6889997 ,
       -0.7867774 , -1.1812011 ,  0.01011622, -2.9120972 ,  3.59254   ,
        1.3454058 , -0.305678  , -2.1474035 , -3.110804  , -0.6446719 ,
        1.9236953 ,  0.88007987,  0.4077559 ,  0.27990723,  0.36027157,
        1.214731  , -0.27636862,  0.33037317, -1.4009418 , -1.7570219 ,
        2.0057924 ,  0.1711272 ,  0.65295005, -0.6732832 ,  1.5165039 ,
       -1.8387947 , -0.49002886, -2.529176  ,  1.0543746 ,  0.13975173,
        6.3513803 ,  3.1074045 , -1.8838222 ,  1.707653  , -3.5569887 ,
        0.02888358,  1.4662569 , -1.4711913 ,  1.6238092 , -0.996526  ,
        0.29157495,  0.7459268 , -2.6089895 , -1.4595604 , -1.6607146 ,
       -1.9626031 ,  0.0429309 , -2.2927856 , -2.7657444 , -2.2093186 ,
       -1.8635755 ,  1.1076405 , -0.87808686, -0.8882728 , -0.20140225,
       -0.14074779,  1.5494955 ,  2.2195954 , -0.8879056 ,  0.16175044,
       -0.47926584,  6.069929  , -2.2804523 ,  1.389133  ,  2.3614829 ,
       -1.6746982 , -0.65907   , -0.88322634, -0.35415757,  1.2424103 ,
       -1.3832704 ,  1.74179   ,  2.0219522 , -0.3940425 , -1.076731  ,
       -3.0649443 ,  2.6106696 , -0.03948617,  0.03465301,  0.6218431 ,
        0.8250919 ,  1.7428303 ,  0.8449378 ,  3.0572054 ,  0.29650444,
        0.4229828 ,  0.38575757,  0.20896101, -0.91772854,  0.3865456 ,
        4.248111  ], dtype=float32)
EN

回答 2

Data Science用户

发布于 2019-09-15 15:29:23

为了构造句子嵌入,Spacy只是对单词嵌入进行平均值。我现在还没有进入Spacy的权限,否则我会给你演示的,但是你可以试试:

代码语言:javascript
运行
复制
spacy_nlp('hello I').vector == (spacy_nlp('hello').vector + spacy_nlp('I').vector) / 2

如果这也给出了False,这将是因为浮点值在计算后可能不完全相等。所以,把它们单独打印出来,你就会发现它们非常接近。

票数 1
EN

Data Science用户

发布于 2023-02-16 19:21:43

答案是:解析的令牌的平均嵌入。请注意,标记器可能是自定义的/与您预期的不同。

在这里你可以找到一个完整的例子

代码语言:javascript
运行
复制
import spacy
import numpy as np

nlp = spacy.load("en_core_web_md")
txt= 'ChatGPT could automatically compose comments submitted in regulatory processes. It could write letters to the editor for publication in local newspapers. It could comment on news articles, blog entries and social media posts millions of times every day." https://t.co/rXL8WgJ4hV'

vec1 = nlp(txt).vector
vec2 = np.array([t.vector for t in nlp(txt)]).mean(0)

np.testing.assert_almost_equal(vec1,vec2)
票数 0
EN
页面原文内容由Data Science提供。腾讯云小微IT领域专用引擎提供翻译支持
原文链接:

https://datascience.stackexchange.com/questions/60222

复制
相关文章

相似问题

领券
问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档