A generic, re-usable histogram class that supports partial aggregations.
The algorithm is a heuristic adapted from the following paper:
Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm",
J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation
guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number
of histogram bins.
Adapted from Hive's NumericHistogram. Can refer to
1. Declaring [[Coord]] and it's variables as public types for
easy access in the HistogramNumeric class.
2. Add method [[getNumBins()]] for serialize [[NumericHistogram]]
3. Add method [[addBin()]] for deserialize [[NumericHistogram]]
4. In Hive's code, the method [[merge()] pass a serialized histogram,
in Spark, this method pass a deserialized histogram.
Here we change the code about merge bins.
Takes a histogram and merges it with the current histogram object.
public void add(double v)
Adds a new data point to the histogram approximation. Make sure you have
called either allocate() or merge() first. This method implements Algorithm #1
from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.
v - The data point to add to the histogram approximation.