Frequent Pattern Mining

Mining frequent items, itemsets, subsequences, or other substructures is usually among the first steps to analyze a large-scale dataset, which has been an active research topic in data mining for years. We refer users to Wikipedia’s association rule learning for more information.

Table of Contents

FP-Growth

The FP-growth algorithm is described in the paper Han et al., Mining frequent patterns without candidate generation, where “FP” stands for frequent pattern. Given a dataset of transactions, the first step of FP-growth is to calculate item frequencies and identify frequent items. Different from Apriori-like algorithms designed for the same purpose, the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets explicitly, which are usually expensive to generate. After the second step, the frequent itemsets can be extracted from the FP-tree. In spark.mllib, we implemented a parallel version of FP-growth called PFP, as described in Li et al., PFP: Parallel FP-growth for query recommendation. PFP distributes the work of growing FP-trees based on the suffixes of transactions, and hence is more scalable than a single-machine implementation. We refer users to the papers for more details.

spark.ml’s FP-growth implementation takes the following (hyper-)parameters:

The FPGrowthModel provides:

Examples

Refer to the Scala API docs for more details.

import org.apache.spark.ml.fpm.FPGrowth

val dataset = spark.createDataset(Seq( “1 2 5”, “1 2 3 5”, “1 2”) ).map(t => t.split(” “)).toDF(“items”)

val fpgrowth = new FPGrowth().setItemsCol(“items”).setMinSupport(0.5).setMinConfidence(0.6) val model = fpgrowth.fit(dataset)

// Display frequent itemsets. model.freqItemsets.show()

// Display generated association rules. model.associationRules.show()

// transform examines the input items against all the association rules and summarize the // consequents as prediction model.transform(dataset).show()

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/FPGrowthExample.scala" in the Spark repo.

Refer to the Java API docs for more details.

import java.util.Arrays; import java.util.List;

import org.apache.spark.ml.fpm.FPGrowth; import org.apache.spark.ml.fpm.FPGrowthModel; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.*;

List<Row> data = Arrays.asList( RowFactory.create(Arrays.asList(“1 2 5”.split(” “))), RowFactory.create(Arrays.asList(“1 2 3 5”.split(” “))), RowFactory.create(Arrays.asList(“1 2”.split(” “))) ); StructType schema = new StructType(new StructField[]{ new StructField( “items”, new ArrayType(DataTypes.StringType, true), false, Metadata.empty()) }); Dataset<Row> itemsDF = spark.createDataFrame(data, schema);

FPGrowthModel model = new FPGrowth() .setItemsCol(“items”) .setMinSupport(0.5) .setMinConfidence(0.6) .fit(itemsDF);

// Display frequent itemsets. model.freqItemsets().show();

// Display generated association rules. model.associationRules().show();

// transform examines the input items against all the association rules and summarize the // consequents as prediction model.transform(itemsDF).show();

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaFPGrowthExample.java" in the Spark repo.

Refer to the Python API docs for more details.

from pyspark.ml.fpm import FPGrowth

df = spark.createDataFrame([ (0, [1, 2, 5]), (1, [1, 2, 3, 5]), (2, [1, 2]) ], [“id”, “items”])

fpGrowth = FPGrowth(itemsCol=“items”, minSupport=0.5, minConfidence=0.6) model = fpGrowth.fit(df)

# Display frequent itemsets. model.freqItemsets.show()

# Display generated association rules. model.associationRules.show()

# transform examines the input items against all the association rules and summarize the

consequents as prediction

</span>model.transform(df).show()

Find full example code at "examples/src/main/python/ml/fpgrowth_example.py" in the Spark repo.

Refer to the R API docs for more details.

# Load training data

</span>df <- selectExpr(createDataFrame(data.frame(rawItems = c( “1,2,5”, “1,2,3,5”, “1,2” ))), “split(rawItems, ‘,’) AS items”)

</span>fpm <- spark.fpGrowth(df, itemsCol=“items”, minSupport=0.5, minConfidence=0.6)

</span># Extracting frequent itemsets

</span>spark.freqItemsets(fpm)

</span># Extracting association rules

</span>spark.associationRules(fpm)

</span># Predict uses association rules to and combines possible consequents

</span>predict(fpm, df)

</span><div>Find full example code at “examples/src/main/r/ml/fpm.R” in the Spark repo.</div>

PrefixSpan

PrefixSpan is a sequential pattern mining algorithm described in Pei et al., Mining Sequential Patterns by Pattern-Growth: The PrefixSpan Approach. We refer the reader to the referenced paper for formalizing the sequential pattern mining problem.

spark.ml’s PrefixSpan implementation takes the following parameters:

Examples

Refer to the Scala API docs for more details.

import org.apache.spark.ml.fpm.PrefixSpan

val smallTestData = Seq( Seq(Seq(1, 2), Seq(3)), Seq(Seq(1), Seq(3, 2), Seq(1, 2)), Seq(Seq(1, 2), Seq(5)), Seq(Seq(6)))

val df = smallTestData.toDF(“sequence”) val result = new PrefixSpan() .setMinSupport(0.5) .setMaxPatternLength(5) .setMaxLocalProjDBSize(32000000) .findFrequentSequentialPatterns(df) .show()

Find full example code at "examples/src/main/scala/org/apache/spark/examples/ml/PrefixSpanExample.scala" in the Spark repo.

Refer to the Java API docs for more details.

import java.util.Arrays; import java.util.List;

import org.apache.spark.ml.fpm.PrefixSpan; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.RowFactory; import org.apache.spark.sql.SparkSession; import org.apache.spark.sql.types.*;

List<Row> data = Arrays.asList( RowFactory.create(Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3))), RowFactory.create(Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1,2))), RowFactory.create(Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5))), RowFactory.create(Arrays.asList(Arrays.asList(6))) ); StructType schema = new StructType(new StructField[]{ new StructField( “sequence”, new ArrayType(new ArrayType(DataTypes.IntegerType, true), true), false, Metadata.empty()) }); Dataset<Row> sequenceDF = spark.createDataFrame(data, schema);

PrefixSpan prefixSpan = new PrefixSpan().setMinSupport(0.5).setMaxPatternLength(5);

// Finding frequent sequential patterns prefixSpan.findFrequentSequentialPatterns(sequenceDF).show();

Find full example code at "examples/src/main/java/org/apache/spark/examples/ml/JavaPrefixSpanExample.java" in the Spark repo.

Refer to the Python API docs for more details.

from pyspark.ml.fpm import PrefixSpan

df = sc.parallelize([Row(sequence=[[1, 2], [3]]), Row(sequence=[[1], [3, 2], [1, 2]]), Row(sequence=[[1, 2], [5]]), Row(sequence=[[6]])]).toDF()

prefixSpan = PrefixSpan(minSupport=0.5, maxPatternLength=5, maxLocalProjDBSize=32000000)

# Find frequent sequential patterns. prefixSpan.findFrequentSequentialPatterns(df).show()

Find full example code at "examples/src/main/python/ml/prefixspan_example.py" in the Spark repo.

Refer to the R API docs for more details.

# Load training data

</span>df <- createDataFrame(list(list(list(list(1L, 2L), list(3L))), list(list(list(1L), list(3L, 2L), list(1L, 2L))), list(list(list(1L, 2L), list(5L))), list(list(list(6L)))), schema = c(“sequence”))

</span># Finding frequent sequential patterns frequency <- spark.findFrequentSequentialPatterns(df, minSupport = 0.5, maxPatternLength = 5L, maxLocalProjDBSize = 32000000L) showDF(frequency)

</span><div>Find full example code at “examples/src/main/r/ml/prefixSpan.R” in the Spark repo.</div>