我创建了下面的Pig脚本来过滤一组提到电影标题(来自电影标题的预定义数据文件)的web文档(公共爬行)中的句子,对这些句子应用情感分析,并将这些情感按电影分组。
register ../commoncrawl-examples/lib/*.jar;
set mapred.task.timeout= 1000;
register ../commoncrawl-examples/dist/lib/commoncrawl-examples-1.0.1-HM.jar;
register ../dist/lib/movierankings-1.jar
register ../lib/piggybank.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1.jar;
register ../lib/stanford-corenlp-full-2014-01-04/stanford-corenlp-3.3.1-models.jar;
register ../lib/stanford-corenlp-full-2014-01-04/ejml-0.23.jar;
register ../lib/stanford-corenlp-full-2014-01-04/joda-time.jar;
register ../lib/stanford-corenlp-full-2014-01-04/jollyday.jar;
register ../lib/stanford-corenlp-full-2014-01-04/xom.jar;
DEFINE IsNotWord com.moviereviewsentimentrankings.IsNotWord;
DEFINE IsMovieDocument com.moviereviewsentimentrankings.IsMovieDocument;
DEFINE ToSentenceMoviePairs com.moviereviewsentimentrankings.ToSentenceMoviePairs;
DEFINE ToSentiment com.moviereviewsentimentrankings.ToSentiment;
DEFINE MoviesInDocument com.moviereviewsentimentrankings.MoviesInDocument;
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
-- LOAD pages, movies and words
pages = LOAD '../data/textData-*' USING SequenceFileLoader as (url:chararray, content:chararray);
movies_fltr_grp = LOAD '../data/movie_fltr_grp_2/part-*' as (group: chararray,movies_fltr: {(movie: chararray)});
-- FILTER pages containing movie
movie_pages = FILTER pages BY IsMovieDocument(content, movies_fltr_grp.movies_fltr);
-- SPLIT pages containing movie in sentences and create movie-sentence pairs
movie_sentences = FOREACH movie_pages GENERATE flatten(ToSentenceMoviePairs(content, movies_fltr_grp.movies_fltr)) as (content:chararray, movie:chararray);
-- Calculate sentiment for each movie-sentence pair
movie_sentiment = FOREACH movie_sentences GENERATE flatten(ToSentiment(movie, content)) as (movie:chararray, sentiment:int);
-- GROUP movie-sentiment pairs by movie
movie_sentiment_grp_tups = GROUP movie_sentiment BY movie;
-- Reformat and print movie-sentiment pairs
movie_sentiment_grp = FOREACH movie_sentiment_grp_tups GENERATE group, movie_sentiment.sentiment AS sentiments:{(sentiment: int)};
describe movie_sentiment_grp;测试运行在web爬行的一个小子集上,显示它成功地给了我一对带有整数数据库的电影标题(从1到5,表示非常负的、负的、中性的、正的和非常积极的)。作为最后一步,我想将这些数据转换为成对电影标题和一个包含元组的数据库,其中包含了这个电影标题存在的所有不同的整数及其计数。脚本末尾的描述movie_sentiment_grp返回:
movie_sentiment_grp: {group: chararray,sentiments: {(sentiment: int)}}因此,基本上,我可能需要对movie_sentiment_grp的每个元素进行FOREACH,并将情感数据库分组为相同值的组,然后使用COUNT()函数来获取每个组中的元素数。然而,我无法找到关于如何将整数数据库分组为相同值组的任何信息。有人知道怎么做吗?
虚拟解决方案:
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp{
sentiments_grp = GROUP sentiments BY ?;
}发布于 2014-01-31 08:10:30
从CountEach查看Apache DataFu的UDF。给定一个包,它将产生一个不同的元组的新袋子,并将计数附加到每个相应的元组。
文档中的示例应该清楚地说明这一点:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
-- input:
-- ({(A),(A),(C),(B)})
input = LOAD 'input' AS (B: bag {T: tuple(alpha:CHARARRAY, numeric:INT)});
-- output_flatten:
-- ({(A,2),(C,1),(B,1)})
output_flatten = FOREACH input GENERATE CountEachFlatten(B);就你的情况而言:
DEFINE CountEachFlatten datafu.pig.bags.CountEach('flatten');
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp GENERATE
group,
CountEach(sentiments);发布于 2014-01-29 18:59:09
你走在正确的轨道上。movie_sentiment_grp是正确的格式,嵌套的FOREACH将是正确的,除非您不能在其中使用GROUP。解决方案是使用UDF。就像这样:
myudfs.py
#!/usr/bin/python
@outputSchema('sentiments: {(sentiment:int, count:int)}')
def count_sentiments(BAG):
res = {}
for s in BAG:
if s in res:
res[s] += 1
else:
res[s] = 1
return res.items()这个UDF的用法如下:
Register 'myudfs.py' using jython as myfuncs;
movie_sentiment_grp_cnt = FOREACH movie_sentiment_grp
GENERATE group, myfuncs.count_sentiments(sentiments) ;https://stackoverflow.com/questions/21219704
复制相似问题