Class CountMinSketch

Object
org.apache.spark.util.sketch.CountMinSketch

public abstract class CountMinSketch extends Object
A Count-min sketch is a probabilistic data structure used for cardinality estimation using sub-linear space. Currently, supported data types include: A CountMinSketch is initialized with a random seed, and a pair of parameters:
  1. relative error (or eps), and
  2. confidence (or delta)
Suppose you want to estimate the number of times an element x has appeared in a data stream so far. With probability delta, the estimate of this frequency is within the range true frequency <= estimate <= true frequency + eps * N, where N is the total count of items have appeared the data stream so far. Under the cover, a CountMinSketch is essentially a two-dimensional long array with depth d and width w, where
  • d = ceil(2 / eps)
  • w = ceil(-log(1 - confidence) / log(2))
This implementation is largely based on the CountMinSketch class from stream-lib.
  • Constructor Details

    • CountMinSketch

      public CountMinSketch()
  • Method Details

    • relativeError

      public abstract double relativeError()
      Returns the relative error (or eps) of this CountMinSketch.
    • confidence

      public abstract double confidence()
      Returns the confidence (or delta) of this CountMinSketch.
    • depth

      public abstract int depth()
      Depth of this CountMinSketch.
    • width

      public abstract int width()
      Width of this CountMinSketch.
    • totalCount

      public abstract long totalCount()
      Total count of items added to this CountMinSketch so far.
    • add

      public abstract void add(Object item)
      Increments item's count by one.
    • add

      public abstract void add(Object item, long count)
      Increments item's count by count.
    • addLong

      public abstract void addLong(long item)
      Increments item's count by one.
    • addLong

      public abstract void addLong(long item, long count)
      Increments item's count by count.
    • addString

      public abstract void addString(String item)
      Increments item's count by one.
    • addString

      public abstract void addString(String item, long count)
      Increments item's count by count.
    • addBinary

      public abstract void addBinary(byte[] item)
      Increments item's count by one.
    • addBinary

      public abstract void addBinary(byte[] item, long count)
      Increments item's count by count.
    • estimateCount

      public abstract long estimateCount(Object item)
      Returns the estimated frequency of item.
    • mergeInPlace

      public abstract CountMinSketch mergeInPlace(CountMinSketch other) throws IncompatibleMergeException
      Merges another CountMinSketch with this one in place. Note that only Count-Min sketches with the same depth, width, and random seed can be merged.
      Throws:
      IncompatibleMergeException - if the other CountMinSketch has incompatible depth, width, relative-error, confidence, or random seed.
    • writeTo

      public abstract void writeTo(OutputStream out) throws IOException
      Writes out this CountMinSketch to an output stream in binary format. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • toByteArray

      public abstract byte[] toByteArray() throws IOException
      Serializes this CountMinSketch and returns the serialized form.
      Throws:
      IOException
    • readFrom

      public static CountMinSketch readFrom(InputStream in) throws IOException
      Reads in a CountMinSketch from an input stream. It is the caller's responsibility to close the stream.
      Throws:
      IOException
    • readFrom

      public static CountMinSketch readFrom(byte[] bytes) throws IOException
      Reads in a CountMinSketch from a byte array.
      Throws:
      IOException
    • create

      public static CountMinSketch create(int depth, int width, int seed)
      Creates a CountMinSketch with given depth, width, and random seed.
      Parameters:
      depth - depth of the Count-min Sketch, must be positive
      width - width of the Count-min Sketch, must be positive
      seed - random seed
    • create

      public static CountMinSketch create(double eps, double confidence, int seed)
      Creates a CountMinSketch with given relative error (eps), confidence, and random seed.
      Parameters:
      eps - relative error, must be positive
      confidence - confidence, must be positive and less than 1.0
      seed - random seed