大数据作业,利用Hadoop去跑数据集,先是几个基本的MapReduce简单问题,当然也可以用Hive,然后是去计算TF-IDF,当然,数据集得自己下,Hadoop平台也得自己去搭。
Tasks:
The TF-IDF algorithm is used to calculate the relative frequency of a word in a document, as compared to the overall frequency of that word in a collection of documents. This allows you to discover the distinctive words for a particular user or document. The formula is: TF(t) = Number of times t appears in the document / Number of words in the document IDF(t) = log_e(Total number of documents / Number of Documents containing t) The TFIDF(t) score of the term t is the multiple of those two.
select top 50000 * from posts where posts.ViewCount > 1000000 ORDER BY posts.ViewCount
复制代码
> select count(*) from posts where posts.ViewCount>15000 and posts.ViewCount < 20000
复制代码
> select * from posts where posts.ViewCount > 15000 and posts.ViewCount < 20000
复制代码
用Hadoop去计算TF-IDF的时间复杂度还是挺高的,毕竟有很多临时数据要落地,而且Hadoop程序也不是一个就能解决问题的,如果换成Spark的话,应该会高效很多。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。