我有一个功能联合,它使用一些自定义转换器来选择文本和数据帧的各个部分。我想知道它在使用哪些功能。
流水线选择并转换列,然后选择k个最佳。我可以使用下面的代码最好地从k中提取出特性:
mask = union.named_steps['select_features'].get_support()但是,我无法将此掩码应用于特征联合输出,因为我正在努力返回最终的转换。我认为我需要在自定义transformer - see related post中定义一个'get_feature_names‘函数。
管道如下所示:
union = Pipeline([
('feature_union', FeatureUnion([
('pipeline_1', Pipeline([
('selector', TextSelector(key='notes_1')),
('vectorise', CountVectorizer())
])),
('pipeline_2', Pipeline([
('selector', TextSelector(key='notes_2')),
('vectorise', CountVectorizer())
])),
('pipeline_3', Pipeline([
('selector', TextSelector(key='notes_3')),
('vectorise', CountVectorizer())
])),
('pipeline_4', Pipeline([
('selector', TextSelector(key='notes_4')),
('vectorise', CountVectorizer())
])),
('tf-idf_pipeline', Pipeline([
('selector', TextSelector(key='notes_5')),
('Tf-idf', TfidfVectorizer())
])),
('categorical_pipeline', Pipeline([
('selector', DataFrameSelector(['area', 'type', 'age'], True)),
('one_hot_encoding', OneHotEncoder(handle_unknown='ignore'))
]))
], n_jobs=-1)),
('select_features', SelectKBest(k='all')),
('classifier', MLPClassifier())
])如下所示的自定义转换器NB我尝试在每个转换器中包含一个'get_feature_names‘函数,它不能正常工作:
class TextSelector(BaseEstimator, TransformerMixin):
def __init__(self, key):
self.key = key
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.key]
def get_feature_names(self):
return X[self.key].columns.tolist()
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names, factorize=False):
self.attribute_names = attribute_names
self.factorize = factorize
def transform(self, X):
selection = X[self.attribute_names]
if self.factorize:
selection = selection.apply(lambda p: pd.factorize(p)[0] + 1)
return selection.values
def fit(self, X, y=None):
return self
def get_feature_names(self):
return X.columns.tolist()谢谢你的帮助。
发布于 2018-10-03 14:27:58
这个对我很有效。很简单,正如我们所建议的
union = Pipeline([
('feature_union', FeatureUnion([
('pipeline_1', MyPipeline([
('selector', TextSelector(key='notes_1')),
('vectorise', CountVectorizer())
])),
])
class myPipeline(Pipeline):
def get_feature_names(self):
for name, step in self.steps:
if isinstance(step,TfidfVectorizer):
return step.get_feature_names()发布于 2018-10-08 13:26:59
如果您知道步骤的名称(例如,pipeline_1)和调用自定义转换器的子步骤的名称(例如vectorise),则可以通过步骤和子步骤的名称直接引用它们
fnames = dict(union.named_steps['feature_union']
.transformer_list)
.get('pipeline_1')
.named_steps['vectorise']
.get_feature_names()发布于 2019-03-12 21:28:58
到目前为止,获得嵌套特性的最好方法(感谢edesz):
pipeline = Pipeline(steps=[
("union", FeatureUnion(
transformer_list=[
("descriptor", Pipeline(steps=[
("selector", ItemSelector(column="Description")),
("tfidf", TfidfVectorizer(min_df=5, analyzer=u'word'))
]))
],...
pvect= dict(pipeline.named_steps['union'].transformer_list).get('descriptor').named_steps['tfidf']然后让TfidfVectorizer()实例传入另一个函数:
Show_most_informative_features(pvect,
pipeline.named_steps['classifier'], n=MostIF)https://stackoverflow.com/questions/48005889
复制相似问题