前往小程序,Get更优阅读体验!
立即前往
首页
学习
活动
专区
工具
TVP
发布
社区首页 >专栏 >Hadoop代做编程辅导:CA675 TF-IDF

Hadoop代做编程辅导:CA675 TF-IDF

原创
作者头像
拓端
发布2022-10-27 17:53:38
2060
发布2022-10-27 17:53:38
举报
文章被收录于专栏:拓端tecdat拓端tecdat

全文链接:tecdat.cn/?p=29680

Introduction

大数据作业,利用Hadoop去跑数据集,先是几个基本的MapReduce简单问题,当然也可以用Hive,然后是去计算TF-IDF,当然,数据集得自己下,Hadoop平台也得自己去搭。

Requirement

Tasks:

  1. Using MapReduce, carry out the following tasks:
  2. Acquire the top 250,000 posts by viewcount (see notes)
  3. Using pig or mapreduce, extract, transform and load the data as applicable
  4. Using mapreduce calculate the per-user TF-IDF (just submit the top 10 terms for each user)
  5. Bonus use elastic mapreduce to execute one or more of these tasks (if so, provide logs / screenshots)
  6. Using hive and/or mapreduce, get:
  • The top 10 posts by score
  • The top 10 users by post score
  • The number of distinct users, who used the word ‘java’ in one of their posts

Notes

TF-IDF

The TF-IDF algorithm is used to calculate the relative frequency of a word in a document, as compared to the overall frequency of that word in a collection of documents. This allows you to discover the distinctive words for a particular user or document. The formula is: TF(t) = Number of times t appears in the document / Number of words in the document IDF(t) = log_e(Total number of documents / Number of Documents containing t) The TFIDF(t) score of the term t is the multiple of those two.

Downloading from Stackoverflow

  • You can only download 50000 rows in one query. Here is a query to get to get most popular posts:
代码语言:javascript
复制
select top 50000 * from posts where posts.ViewCount > 1000000 ORDER BY posts.ViewCount
复制代码
  • To count the number of records in a range:
代码语言:javascript
复制
> select count(*) from posts where posts.ViewCount>15000 and posts.ViewCount < 20000
复制代码
  • To retrieve records from a particular range:
代码语言:javascript
复制
> select * from posts where posts.ViewCount > 15000 and posts.ViewCount < 20000
复制代码

Summary

用Hadoop去计算TF-IDF的时间复杂度还是挺高的,毕竟有很多临时数据要落地,而且Hadoop程序也不是一个就能解决问题的,如果换成Spark的话,应该会高效很多。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。

如有侵权,请联系 cloudcommunity@tencent.com 删除。

评论
登录后参与评论
0 条评论
热度
最新
推荐阅读
目录
  • 全文链接:tecdat.cn/?p=29680
  • Introduction
  • Requirement
  • Notes
    • TF-IDF
    • Downloading from Stackoverflow
    • Summary
    领券
    问题归档专栏文章快讯文章归档关键词归档开发者手册归档开发者手册 Section 归档