文章/答案/技术大牛

发布

社区首页 >问答首页 >scikit中分类算法的文本特征输入格式

问scikit中分类算法的文本特征输入格式
EN

Stack Overflow用户

提问于 2012-08-24 02:00:51

回答 2查看 7.7K关注 0票数 5

我开始使用坐骨器-学习做一些NLP。我已经使用了一些来自NLTK的分类器，现在我想尝试一下在scikit-learn中实现的分类器。

我的数据基本上是句子，我从这些句子中的一些单词中提取特征来完成一些分类任务。我的大部分特征都是名词性的:词性词性( POS )，词对左，词性字到左，字对右，词性字对右，句法关系路径从一个词到另一个，等等。

当我使用NLTK分类器(决策树，朴素贝叶斯)做一些实验时，特征集只是一个字典，其中包含特征的对应值:标称值。例如：{"postag":"noun"，“path”：“house”，"path":"VPNPNP"，.}，.。我只需要把这个交给分类器，他们就完成了他们的工作。

这是使用的代码的一部分：

def train_classifier(self):
        if self.reader == None:
            raise ValueError("No reader was provided for accessing training instances.")

        # Get the argument candidates
        argcands = self.get_argcands(self.reader)

        # Extract the necessary features from the argument candidates
        training_argcands = []
        for argcand in argcands:
            if argcand["info"]["label"] == "NULL":
                training_argcands.append( (self.extract_features(argcand), "NULL") )
            else:
                training_argcands.append( (self.extract_features(argcand), "ARG") )

        # Train the appropriate supervised model
        self.classifier = DecisionTreeClassifier.train(training_argcands)

        return

下面是一个提取的特征集的示例：

[({'phrase': u'np', 'punct_right': 'NULL', 'phrase_left-sibling': 'NULL', 'subcat': 'fcl=np np vp np pu', 'pred_lemma': u'revelar', 'phrase_right-sibling': u'np', 'partial_path': 'vp fcl', 'first_word-postag': 'Bras\xc3\xadlia PROP', 'last_word-postag': 'Bras\xc3\xadlia PROP', 'phrase_parent': u'fcl', 'pred_context_right': u'um', 'pred_form': u'revela', 'punct_left': 'NULL', 'path': 'vp\xc2\xa1fcl!np', 'position': 0, 'pred_context_left_postag': u'ADV', 'voice': 0, 'pred_context_right_postag': u'ART', 'pred_context_left': u'hoje'}, 'NULL')]

正如我前面提到的，大多数特性都是象征性的(字符串值)。

现在，我想试一试科学知识包中的分类器。据我所知，这种类型的特征集对于在sklearn中实现的算法来说是不可接受的，因为所有的特征值都必须是数字的，而且它们必须在一个数组或矩阵中。因此，我使用DictVectorizer类转换了“原始”特性集。然而，当我传递这个转换的向量时，我会得到以下错误：

# With DecisionTreeClass
Traceback (most recent call last): 
.....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/tree/tree.py", line 458, in fit
    X = np.asarray(X, dtype=DTYPE, order='F')
  File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
    return array(a, dtype, copy=False, order=order)
TypeError: float() argument must be a string or a number


# With GaussianNB

Traceback (most recent call last):
....
self.classifier.fit(train_argcands_feats,new_train_argcands_target)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 156, in fit
    n_samples, n_features = X.shape
ValueError: need more than 0 values to unpack

当我只使用DictVectorizer()时，我会得到这些错误。但是，如果我使用DictVectorizer(sparse=False)，甚至在代码到达培训部分之前就会得到错误：

Traceback (most recent call last):
train_argcands_feats = self.feat_vectorizer.fit_transform(train_argcands_feats)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 123, in fit_transform
    return self.transform(X)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/dict_vectorizer.py", line 212, in transform
    Xa = np.zeros((len(X), len(vocab)), dtype=dtype)
ValueError: array is too big.

由于这个错误，很明显必须使用稀疏表示。

因此，问题是:我如何转换我的名义特征，以便使用科学知识提供的分类算法？

提前谢谢你能帮我的忙。

更新

正如下面的一个答案所建议的那样，我尝试使用NLTK包装器来进行scikit-learn。我刚刚更改了创建分类器的代码行：

self.classifier = SklearnClassifier(DecisionTreeClassifier())

然后，当我调用"train“方法时，会得到以下内容：

File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 100, in train
    X = self._convert(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 109, in _convert
    return self._featuresets_to_coo(featuresets)
  File "/usr/local/lib/python2.7/dist-packages/nltk/classify/scikitlearn.py", line 126, in _featuresets_to_coo
    values.append(self._dtype(v))
ValueError: could not convert string to float: np

因此，显然，包装器不能创建稀疏矩阵，因为这些特性是标称的。然后，我又回到原来的问题上。

feature-engineering

python

scikit-learn

classification

text-processing

Stack Overflow用户

回答已采纳

发布于 2012-08-24 08:38:08

ValueError: array is too big.是非常明确的:您不能在内存中分配(n_samples，n_features)的密集数组数据结构。在一个连续的内存块中存储那么多的零是无用的(在您的情况下也是不可能的)。使用稀疏数据结构，就像在DictVectorizer文档中那样。

另外，如果您喜欢NLTK，您可以使用它的scikit学习集成，而不是使用scikit学习DictVectorizer。

模块/nltk/classify/sckitlearn.html

查看文件的末尾。

票数 4

查看全部 2 条回答

页面原文内容由Stack Overflow提供。腾讯云小微IT领域专用引擎提供翻译支持

原文链接：

https://stackoverflow.com/questions/12102320

复制

相似问题

问scikit中分类算法的文本特征输入格式
EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scikit中分类算法的文本特征输入格式EN

Stack Overflow用户

社区

活动

圈层

关于

腾讯云开发者

热门产品

热门推荐

更多推荐

问scikit中分类算法的文本特征输入格式
EN