Hive自定义函数UDAF开发


Hive自定义函数UDAF开发
Hive支持自定义函数,UDAF是接受多行,输出一行。通常是group by时用到这种函数。
其实最好的学习资料就是官方自带的examples了。我这里用的是0.10版本hive,所以对于的examples在https://github.com/apache/hive/tree/branch-0.10/contrib/src/java/org/apache/hadoop/hive/contrib/udaf/example

我这里的功能需求是:actionCount(act_code,act_times,'1')
如果act_code=='1',则将在一个group里面的act_times加起来。
package hive.udaf;


import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

/**
*
* It should be very easy to follow and can be used as an example for writing
* new UDAFs.
*
* Note that Hive internally uses a different mechanism (called GenericUDAF) to
* implement built-in aggregation functions, which are harder to program but
* more efficient.
*
*/
public final class ActionCount extends UDAF {

/**
* The internal state of an aggregation for average.
*
* Note that this is only needed if the internal state cannot be represented
* by a primitive.
*
* The internal state can also contains fields with types like
* ArrayList<String> and HashMap<String,Double> if needed.
*/
public static class UDAFState {
private long mCount;
private long mSum;
}

/**
* The actual class for doing the aggregation. Hive will automatically look
* for all internal classes of the UDAF that implements UDAFEvaluator.
*/
public static class UDAFExampleAvgEvaluator implements UDAFEvaluator {

UDAFState state;

public UDAFExampleAvgEvaluator() {
super();
state = new UDAFState();
init();
}

/**
* Reset the state of the aggregation.
*/
public void init() {
state.mSum = 0;
state.mCount = 0;
}

/**
* Iterate through one row of original data.
*
* The number and type of arguments need to the same as we call this UDAF
* from Hive command line.
*
* This function should always return true.
*/
public boolean iterate(String act_code,long act_times,String act_type) // 来了一行
{
if (act_code .equals(act_type))
{
state.mSum += act_times;
state.mCount++;
}
return true;
}

/**
* Terminate a partial aggregation and return the state. If the state is a
* primitive, just return primitive Java classes like Integer or String.
*/
public UDAFState terminatePartial() {//状态传递
// This is SQL standard - average of zero items should be null.
return state.mCount == 0 ? null : state;
}

/**
* Merge with a partial aggregation.
*
* This function should always have a single argument which has the same
* type as the return value of terminatePartial().
*/
public boolean merge(UDAFState o) {//子任务合并
if (o != null) {
state.mSum += o.mSum;
state.mCount += o.mCount;
}
return true;
}

/**
* Terminates the aggregation and return the final result.
*/
public long terminate() {//返回最终结果
// This is SQL standard - average of zero items should be null.
return state.mCount == 0 ? 0 : state.mSum;
}
}

private ActionCount() {
// prevent instantiation
}

}

关键还是要深刻理解map-reduce工作模型,才能更好驾驭hive。

本文作者:linger
本文链接:http://blog.csdn.net/lingerlanlan/article/details/41920151



智能推荐

注意!

本站转载的文章为个人学习借鉴使用,本站对版权不负任何法律责任。如果侵犯了您的隐私权益,请联系我们删除。



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告