Spacy提供预先训练的文字向量。不过,我注意到你也可以得到句子的向量:
spacy_nlp('hello I').has_vector == True
然而,我不知道它是如何从句子中计算出word2vecs的。我试过:
spacy_nlp('hello I').vector == spacy_nlp('hello').vector + spacy_nlp('I').vector
错误
spacy_nlp('hello I').vector/spacy_nlp('hello I').vector_norm == spacy_nlp('hello').vector/spacy_nlp('hello').vector_norm + spacy_nlp('I').vector/spacy_nlp('I').vector_norm
错误
我似乎无法找到或计算出spacy如何计算句子的w2v。
a =spacy_nlp('hello').vector
a
array([ 2.1919045 , -1.3554063 , -2.0530818 , -1.4123821 , 0.73116064,
-0.24243775, -1.238019 , -1.038872 , -3.8119905 , 0.3023836 ,
2.0082908 , -0.4146578 , 0.52871764, -4.171281 , -4.014127 ,
3.5551465 , 3.5740273 , 0.5369273 , -0.92361224, 1.4550962 ,
2.1736908 , -0.05514041, 0.02151388, -2.1722403 , 0.81322104,
3.5877275 , -1.0136521 , 4.6003613 , -0.19145766, 5.403145 ,
-1.9958102 , 0.80248785, -2.3566568 , 2.15387 , 0.26684093,
1.8178961 , 3.594517 , -2.9950802 , 2.5587099 , -5.6746616 ,
-3.7259517 , 4.0144114 , -1.4814405 , 1.5888698 , -0.2371515 ,
0.5498152 , 0.9527153 , -4.1197095 , -4.252441 , -0.36907774,
-4.510469 , 1.2669985 , -0.91693896, -3.0032263 , -4.037157 ,
-1.986922 , 1.8322158 , -0.9520336 , -2.6739838 , 0.368276 ,
0.5881702 , 1.4819605 , 2.1771026 , 0.20011072, -0.20952749,
-1.7966032 , 4.412916 , -0.8781664 , 3.0670204 , 3.92986 ,
-0.7381511 , -0.07432494, -3.6973615 , -3.546731 , 1.6010978 ,
-4.0834403 , 1.7816883 , 0.8037724 , 0.40344352, -1.2090104 ,
-3.3253288 , 4.6769385 , 1.3193885 , -1.1775286 , -1.2436512 ,
-0.29471165, 1.9998071 , 1.1338542 , 5.747326 , -0.10331005,
1.6050186 , 2.6961374 , -1.9422164 , -3.0807574 , -1.1481779 ,
7.1367517 ], dtype=float32)
b =spacy_nlp('I').vector
b
array([ 1.9940598e+00, -2.7776110e+00, 8.4717870e-01, -2.1956882e+00,
-1.6103275e+00, 1.2993972e-01, 8.3826280e-01, 8.7950850e-01,
-3.5490465e+00, 4.4254961e+00, -1.4894485e+00, 4.4692218e-01,
-6.0040636e+00, 3.4809113e-01, 7.5852954e-01, -5.0149399e-01,
-1.9669157e+00, 8.8114321e-01, 5.3964740e-01, 1.6436796e+00,
-4.3819084e+00, 7.1328688e-01, -8.9688343e-01, -1.2563754e+00,
-2.6987386e-01, 3.3273227e+00, 7.1929336e-01, 1.2008041e-01,
2.8758078e+00, -8.6590099e-01, 5.6435466e-01, -5.4331255e-01,
-3.3853512e+00, -2.0917976e+00, -1.1649452e+00, 8.6632729e+00,
9.1355121e-01, -3.9117950e-01, -6.3341379e-01, -3.4170332e+00,
3.2871642e+00, 4.5229197e-03, -4.0161700e+00, 2.6399128e+00,
-2.4242992e+00, -1.2012237e-01, -1.1977488e-01, -1.6422987e-01,
7.7170479e-01, -1.5015860e+00, -3.0203837e-01, 1.9385589e+00,
-2.9229348e+00, -2.8134599e+00, -6.1340892e-01, -2.5029099e+00,
-6.6817325e-01, -8.4735197e-01, 4.2243872e+00, 2.8358276e+00,
-2.7096636e+00, 6.3791027e+00, 1.3461562e+00, -3.9387980e+00,
1.0648534e+00, 5.3636909e-01, 4.1285772e+00, -2.8879738e+00,
1.3546917e+00, -1.9005369e+00, -3.7411542e+00, -4.8598945e-02,
-1.4411114e+00, 1.3436056e+00, 1.1946709e+00, 2.3972931e+00,
2.1032238e+00, 1.8248746e+00, -2.1880054e+00, -1.4601905e+00,
-1.9771397e+00, 9.3115008e-01, -3.7088573e+00, -4.9041757e-01,
1.0846795e+00, 2.2863836e+00, 3.5038524e+00, 1.0964345e+00,
3.6875091e+00, -1.6266774e+00, 1.4012933e-02, 2.7396250e+00,
3.9477596e+00, -3.5737205e+00, 3.1862993e+00, 2.2955155e+00],
dtype=float32)
c =spacy_nlp('hello I').vector
c
array([ 2.4846857 , -1.9697192 , -0.09456831, -1.5198507 , -1.6889997 ,
-0.7867774 , -1.1812011 , 0.01011622, -2.9120972 , 3.59254 ,
1.3454058 , -0.305678 , -2.1474035 , -3.110804 , -0.6446719 ,
1.9236953 , 0.88007987, 0.4077559 , 0.27990723, 0.36027157,
1.214731 , -0.27636862, 0.33037317, -1.4009418 , -1.7570219 ,
2.0057924 , 0.1711272 , 0.65295005, -0.6732832 , 1.5165039 ,
-1.8387947 , -0.49002886, -2.529176 , 1.0543746 , 0.13975173,
6.3513803 , 3.1074045 , -1.8838222 , 1.707653 , -3.5569887 ,
0.02888358, 1.4662569 , -1.4711913 , 1.6238092 , -0.996526 ,
0.29157495, 0.7459268 , -2.6089895 , -1.4595604 , -1.6607146 ,
-1.9626031 , 0.0429309 , -2.2927856 , -2.7657444 , -2.2093186 ,
-1.8635755 , 1.1076405 , -0.87808686, -0.8882728 , -0.20140225,
-0.14074779, 1.5494955 , 2.2195954 , -0.8879056 , 0.16175044,
-0.47926584, 6.069929 , -2.2804523 , 1.389133 , 2.3614829 ,
-1.6746982 , -0.65907 , -0.88322634, -0.35415757, 1.2424103 ,
-1.3832704 , 1.74179 , 2.0219522 , -0.3940425 , -1.076731 ,
-3.0649443 , 2.6106696 , -0.03948617, 0.03465301, 0.6218431 ,
0.8250919 , 1.7428303 , 0.8449378 , 3.0572054 , 0.29650444,
0.4229828 , 0.38575757, 0.20896101, -0.91772854, 0.3865456 ,
4.248111 ], dtype=float32)
发布于 2019-09-15 15:29:23
为了构造句子嵌入,Spacy只是对单词嵌入进行平均值。我现在还没有进入Spacy的权限,否则我会给你演示的,但是你可以试试:
spacy_nlp('hello I').vector == (spacy_nlp('hello').vector + spacy_nlp('I').vector) / 2
如果这也给出了False
,这将是因为浮点值在计算后可能不完全相等。所以,把它们单独打印出来,你就会发现它们非常接近。
发布于 2023-02-16 19:21:43
答案是:解析的令牌的平均嵌入。请注意,标记器可能是自定义的/与您预期的不同。
在这里你可以找到一个完整的例子
import spacy
import numpy as np
nlp = spacy.load("en_core_web_md")
txt= 'ChatGPT could automatically compose comments submitted in regulatory processes. It could write letters to the editor for publication in local newspapers. It could comment on news articles, blog entries and social media posts millions of times every day." https://t.co/rXL8WgJ4hV'
vec1 = nlp(txt).vector
vec2 = np.array([t.vector for t in nlp(txt)]).mean(0)
np.testing.assert_almost_equal(vec1,vec2)
https://datascience.stackexchange.com/questions/60222
复制相似问题