org.apache.spark.sql.util.NumericHistogram

public class NumericHistogram extends Object

A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins. Adapted from Hive's NumericHistogram. Can refer to https://github.com/apache/hive/blob/master/ql/src/ java/org/apache/hadoop/hive/ql/udf/generic/NumericHistogram.java Differences: 1. Declaring [[Coord]] and it's variables as public types for easy access in the HistogramNumeric class. 2. Add method [[getNumBins()]] for serialize [[NumericHistogram]] in [[NumericHistogramSerializer]]. 3. Add method [[addBin()]] for deserialize [[NumericHistogram]] in [[NumericHistogramSerializer]]. 4. In Hive's code, the method [[merge()] pass a serialized histogram, in Spark, this method pass a deserialized histogram. Here we change the code about merge bins.

Since:: 3.3.0

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

NumericHistogram.Coord

The Coord class defines a histogram bin, which is just an (x,y) pair.
Constructor Summary

Constructors

Constructor

Description

NumericHistogram()

Creates a new histogram object.
Method Summary

Modifier and Type

Method

Description

void

add(double v)

Adds a new data point to the histogram approximation.

void

addBin(double x, double y, int b)

Set a particular histogram bin with index.

void

allocate(int num_bins)

Sets the number of histogram bins to use for approximating data.

NumericHistogram.Coord

getBin(int b)

Returns a particular histogram bin.

int

getNumBins()

Returns the number of bins.

int

getUsedBins()

Returns the number of bins currently being used by the histogram.

boolean

isReady()

Returns true if this histogram object has been initialized by calling merge() or allocate().

void

merge(NumericHistogram other)

Takes a histogram and merges it with the current histogram object.

void

reset()

Resets a histogram object to its initial state.

void

setUsedBins(int nusedBins)

Set the number of bins currently being used by the histogram.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- NumericHistogram
  
  public NumericHistogram()
  
  Creates a new histogram object. Note that the allocate() or merge() method must be called before the histogram can be used.
Method Details
- reset
  
  public void reset()
  
  Resets a histogram object to its initial state. allocate() or merge() must be called again before use.
- getNumBins
  
  public int getNumBins()
  
  Returns the number of bins.
- getUsedBins
  
  public int getUsedBins()
  
  Returns the number of bins currently being used by the histogram.
- setUsedBins
  
  public void setUsedBins(int nusedBins)
  
  Set the number of bins currently being used by the histogram.
- isReady
  
  public boolean isReady()
  
  Returns true if this histogram object has been initialized by calling merge() or allocate().
- getBin
  
  public NumericHistogram.Coord getBin(int b)
  
  Returns a particular histogram bin.
- addBin
  
  public void addBin(double x, double y, int b)
  
  Set a particular histogram bin with index.
- allocate
  
  public void allocate(int num_bins)
  
  Sets the number of histogram bins to use for approximating data.
  
  Parameters:
  
  num_bins - Number of non-uniform-width histogram bins to use
- merge
  
  public void merge(NumericHistogram other)
  
  Takes a histogram and merges it with the current histogram object.
- add
  
  public void add(double v)
  
  Adds a new data point to the histogram approximation. Make sure you have called either allocate() or merge() first. This method implements Algorithm #1 from Ben-Haim and Tom-Tov, "A Streaming Parallel Decision Tree Algorithm", JMLR 2010.
  
  Parameters:
  
  v - The data point to add to the histogram approximation.

Class NumericHistogram

Nested Class Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

NumericHistogram

Method Details

reset

getNumBins

getUsedBins

setUsedBins

isReady

getBin

addBin

allocate

merge

add