,这是我用来运行nltk程序的代码
导入nltk #从nltk.corpus导入PlaintextCorpusReader导入自然语言工具包,从nltk导入PlainTextCorpusReader模块,导入word_tokenize、pos_tag、FreqDist,从prettytable pos_tag导入PrettyTable
# PARTS of SPEECH Lookup
POSTAGS = {
'CC': 'conjunction',
'CD': 'CardinalNumber',
'DT': 'Determiner',
'EX': 'ExistentialThere',
'FW': 'ForeignWord',
'IN': 'Preposition',
'JJ': 'Adjective',
'JJR': 'AdjectiveComparative',
'JJS': 'AdjectiveSuperlative',
'LS': 'ListItem',
'MD': 'Modal',
'NN': 'Noun',
'NNS': 'NounPlural',
'NNP': 'ProperNounSingular',
'NNPS': 'ProperNounPlural',
'PDT': 'Predeterminer',
'POS': 'PossessiveEnding',
'PRP': 'PersonalPronoun',
'PRP$': 'PossessivePronoun',
'RB': 'Adverb',
'RBR': 'AdverbComparative',
'RBS': 'AdverbSuperlative',
'RP': 'Particle',
'SYM': 'Symbol',
'TO': 'to',
'UH': 'Interjection',
'VB': 'Verb',
'VBD': 'VerbPastTense',
'VBG': 'VerbPresentParticiple',
'VBN': 'VerbPastParticiple',
'VBP': 'VerbNon3rdPersonSingularPresent',
'VBZ': 'Verb3rdPersonSingularPresent',
'WDT': 'WhDeterminer',
'WP': 'WhPronoun',
'WP$': 'PossessiveWhPronoun',
'WRB': 'WhAdverb'
}
# Read all contents of the corpus
stopWords = set(stopwords.words('english'))
Corpus = PlaintextCorpusReader('./CORPUS', '.*')
rawText = Corpus.raw()
rawText = re.sub("[^a-zA-Z' ]", ' ', rawText)
# Extract tokens from the raw text
tokens = nltk.word_tokenize(rawText)
filteredTokens = [w for w in tokens if not w in stopWords]
TextCorpus = nltk.Text(filteredTokens)
print ("Compiling Vocabulary Frequencies")
print(TextCorpus.vocab())
# Take sampling of the parts of speech found
posTagged = pos_tag(filteredTokens[0:1000])
tblTags = PrettyTable(['Token', 'Part-of-Speech'])
for taggedToken in posTagged:
tblTags.add_row([taggedToken[0], taggedToken[1]])
print(tblTags.get_string())
这段代码产生这个
+-----------------+----------------+
| Token | Part-of-Speech |
+-----------------+----------------+
| LOS | NNP |
| ANGELES | NNP |
| CALIFORNIA | NNP |
| WEDNESDAY | NNP |
| JANUARY | NNP |
| A | NNP |
| M | NNP |
| DEPARTMENT | NNP |
| NO | NNP |
| HON | NNP |
| LANCE | NNP |
| A | NNP |
,但是我希望它看起来像这样,每个正确的列下面的单词,当操作.add_row时,我不能让它跟随每一个正确的列
+------+-----------+------+------------+--------------------+------------------+------+
| Word | Adjective | Noun | NounPlural | ProperNounSingular | ProperNounPlural | Verb |
+------+-----------+------+------------+--------------------+------------------+------+
发布于 2021-05-08 14:59:21
您可以尝试这样的方法,即构建一个dataframe。
import pandas as pd
my_df=pd.DataFrame(columns=list(POSTAGS.values()))
for taggedToken in posTagged:
my_map={}
my_map[POSTAGS.get(taggedToken[1])]=taggedToken[0]
my_df=my_df.append(my_map,ignore_index=True)
my_df=my_df.fillna(" ")
my_df.head(2)
https://stackoverflow.com/questions/67448297
复制相似问题