# 使用 Apache PIG 统计积累型数据的差值

```PID		GrpID	TIMESTAMP				VALUE
a.1		a		2017-08-04 20:01:00		5
a.1		a		2017-08-04 20:02:00		7
a.2		a		2017-08-04 20:00:00		13
a.1		a		2017-08-04 20:03:00		0
a.1		a		2017-08-04 20:04:00		10
b.3		b		2017-08-04 20:10:00		12
b.3		b		2017-08-04 20:11:00		27
b.3		b		2017-08-04 20:12:00		33
c.2		c		2017-08-04 20:01:00		27
c.3		c		2017-08-04 20:02:00		30
......```

1. 对数据集按照PID聚合
2. 对于每个聚合子集，按照TIMESTAMP进行ASC排序
3. 对于排序过后的VALUE序列 V1, V_2, V_3 …… ,V(n-1), V_n 计算：
`SUM_Diff = SUM((V_t – V_(t-1)) >= 0 ? (V_t – V_(t-1)) : 0)`

1. 对GrpID聚合，求出一个分组下所有进程SUM_Diff的求和值。

https://pig.apache.org/docs/r0.9.1/udf.html

```REGISTER 'pycalc/calc_lost_pkg.py' using jython as myudf;
REGISTER /data/gdata/pig-0.15.0/thirdparty/mysql-connector-java-5.1.38-bin.jar;
REGISTER /data/gdata/pig-0.15.0/thirdparty/piggybank-0.15.0.jar;

B = FOREACH A GENERATE \$0 AS pid, \$1 AS grpid, \$2 as ts, (int)\$3 as value;
C = FILTER B BY pid is not null AND grpid is not null AND ts is not null AND value is not null;
D = FOREACH C GENERATE
pid as pid,
grpid as grpid,
ts as ts,
value as value;

uniq_D = DISTINCT D;
E = GROUP uniq_D BY (pid, grpid);
F = FOREACH E {
sorted = ORDER uniq_D by ts DESC;
diff_sum = myudf.calc_lost_pkg_cnt(sorted);
GENERATE FLATTEN(group) AS (pid, grpid),
diff_sum AS diff_sum;
};
G = FOREACH (GROUP F BY grpid) GENERATE
\${MACRO_LOG_DAY}\${MACRO_LOG_HOUR} as logtime,
group as grpid,
SUM(F.diff_sum) as lost_pkg_cnt;

H = FILTER G BY lost_pkg_cnt is not null;
STORE H INTO '/pigtest/test.result.7' USING org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver',
'jdbc:mysql://\${MACRO_DBHOST}:\${MACRO_DBPORT}/\${MACRO_DATABASE}',
'REPLACE INTO  \${MACRO_TABLENAME} (logtime, grpid, lost_pkg_cnt) VALUES (?, ?, ?)');```

```@outputSchema("sum:long")
def calc_lost_pkg_cnt(sorted_data):
sum = 0
for idx in xrange(len(sorted_data)):
if idx < len(sorted_data) - 1:
delta = sorted_data[idx][3] - sorted_data[idx+1][3]
if delta >= 0:
sum += delta
return sum```

5 篇文章17 人订阅

0 条评论

## 相关文章

37360

29050

68050

### 网上书店管理系统数据库 sql sever

1.数据库各数据对象的设计与实现：表、约束、完整性体现、查询、视图，要求用合理的数据体现。

15930

26630

17530

29270

32180

24180

35680