RowMatrix¶
- 
class pyspark.mllib.linalg.distributed.RowMatrix(rows: Union[pyspark.rdd.RDD[pyspark.mllib.linalg.Vector], pyspark.sql.dataframe.DataFrame], numRows: int = 0, numCols: int = 0)[source]¶
- Represents a row-oriented distributed Matrix with no meaningful row indices. - Parameters
- rowspyspark.RDDorpyspark.sql.DataFrame
- An RDD or DataFrame of vectors. If a DataFrame is provided, it must have a single vector typed column. 
- numRowsint, optional
- Number of rows in the matrix. A non-positive value means unknown, at which point the number of rows will be determined by the number of records in the rows RDD. 
- numColsint, optional
- Number of columns in the matrix. A non-positive value means unknown, at which point the number of columns will be determined by the size of the first row. 
 
- rows
 - Methods - columnSimilarities([threshold])- Compute similarities between columns of this matrix. - Computes column-wise summary statistics. - Computes the covariance matrix, treating each row as an observation. - Computes the Gramian matrix A^T A. - Computes the k principal components of the given row matrix - computeSVD(k[, computeU, rCond])- Computes the singular value decomposition of the RowMatrix. - multiply(matrix)- Multiply this matrix by a local dense matrix on the right. - numCols()- Get or compute the number of cols. - numRows()- Get or compute the number of rows. - tallSkinnyQR([computeQ])- Compute the QR decomposition of this RowMatrix. - Attributes - Rows of the RowMatrix stored as an RDD of vectors. - Methods Documentation - 
columnSimilarities(threshold: float = 0.0) → pyspark.mllib.linalg.distributed.CoordinateMatrix[source]¶
- Compute similarities between columns of this matrix. - The threshold parameter is a trade-off knob between estimate quality and computational cost. - The default threshold setting of 0 guarantees deterministically correct results, but uses the brute-force approach of computing normalized dot products. - Setting the threshold to positive values uses a sampling approach and incurs strictly less computational cost than the brute-force approach. However the similarities computed will be estimates. - The sampling guarantees relative-error correctness for those pairs of columns that have similarity greater than the given similarity threshold. - To describe the guarantee, we set some notation: - Let A be the smallest in magnitude non-zero element of this matrix. 
- Let B be the largest in magnitude non-zero element of this matrix. 
- Let L be the maximum number of non-zeros per row. 
 - For example, for {0,1} matrices: A=B=1. Another example, for the Netflix matrix: A=1, B=5 - For those column pairs that are above the threshold, the computed similarity is correct to within 20% relative error with probability at least 1 - (0.981)^10/B^ - The shuffle size is bounded by the smaller of the following two expressions: - O(n log(n) L / (threshold * A)) 
- O(m L^2^) 
 - The latter is the cost of the brute-force approach, so for non-zero thresholds, the cost is always cheaper than the brute-force approach. - New in version 2.0.0. - Parameters
- thresholdfloat, optional
- Set to 0 for deterministic guaranteed correctness. Similarities above this threshold are estimated with the cost vs estimate quality trade-off described above. 
 
- Returns
- CoordinateMatrix
- An n x n sparse upper-triangular CoordinateMatrix of cosine similarities between columns of this matrix. 
 
 - Examples - >>> rows = sc.parallelize([[1, 2], [1, 5]]) >>> mat = RowMatrix(rows) - >>> sims = mat.columnSimilarities() >>> sims.entries.first().value 0.91914503... - New in version 2.0.0. 
 - 
computeColumnSummaryStatistics() → pyspark.mllib.stat._statistics.MultivariateStatisticalSummary[source]¶
- Computes column-wise summary statistics. - New in version 2.0.0. - Returns
- MultivariateStatisticalSummary
- object containing column-wise summary statistics. 
 
 - Examples - >>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6]]) >>> mat = RowMatrix(rows) - >>> colStats = mat.computeColumnSummaryStatistics() >>> colStats.mean() array([ 2.5, 3.5, 4.5]) 
 - 
computeCovariance() → pyspark.mllib.linalg.Matrix[source]¶
- Computes the covariance matrix, treating each row as an observation. - New in version 2.0.0. - Notes - This cannot be computed on matrices with more than 65535 columns. - Examples - >>> rows = sc.parallelize([[1, 2], [2, 1]]) >>> mat = RowMatrix(rows) - >>> mat.computeCovariance() DenseMatrix(2, 2, [0.5, -0.5, -0.5, 0.5], 0) 
 - 
computeGramianMatrix() → pyspark.mllib.linalg.Matrix[source]¶
- Computes the Gramian matrix A^T A. - New in version 2.0.0. - Notes - This cannot be computed on matrices with more than 65535 columns. - Examples - >>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6]]) >>> mat = RowMatrix(rows) - >>> mat.computeGramianMatrix() DenseMatrix(3, 3, [17.0, 22.0, 27.0, 22.0, 29.0, 36.0, 27.0, 36.0, 45.0], 0) 
 - 
computePrincipalComponents(k: int) → pyspark.mllib.linalg.Matrix[source]¶
- Computes the k principal components of the given row matrix - New in version 2.2.0. - Parameters
- kint
- Number of principal components to keep. 
 
- Returns
 - Notes - This cannot be computed on matrices with more than 65535 columns. - Examples - >>> rows = sc.parallelize([[1, 2, 3], [2, 4, 5], [3, 6, 1]]) >>> rm = RowMatrix(rows) - >>> # Returns the two principal components of rm >>> pca = rm.computePrincipalComponents(2) >>> pca DenseMatrix(3, 2, [-0.349, -0.6981, 0.6252, -0.2796, -0.5592, -0.7805], 0) - >>> # Transform into new dimensions with the greatest variance. >>> rm.multiply(pca).rows.collect() [DenseVector([0.1305, -3.7394]), DenseVector([-0.3642, -6.6983]), DenseVector([-4.6102, -4.9745])] 
 - 
computeSVD(k: int, computeU: bool = False, rCond: float = 1e-09) → pyspark.mllib.linalg.distributed.SingularValueDecomposition[pyspark.mllib.linalg.distributed.RowMatrix, pyspark.mllib.linalg.Matrix][source]¶
- Computes the singular value decomposition of the RowMatrix. - The given row matrix A of dimension (m X n) is decomposed into U * s * V’T where - U: (m X k) (left singular vectors) is a RowMatrix whose columns are the eigenvectors of (A X A’) 
- s: DenseVector consisting of square root of the eigenvalues (singular values) in descending order. 
- v: (n X k) (right singular vectors) is a Matrix whose columns are the eigenvectors of (A’ X A) 
 - For more specific details on implementation, please refer the Scala documentation. - New in version 2.2.0. - Parameters
- kint
- Number of leading singular values to keep (0 < k <= n). It might return less than k if there are numerically zero singular values or there are not enough Ritz values converged before the maximum number of Arnoldi update iterations is reached (in case that matrix A is ill-conditioned). 
- computeUbool, optional
- Whether or not to compute U. If set to be True, then U is computed by A * V * s^-1 
- rCondfloat, optional
- Reciprocal condition number. All singular values smaller than rCond * s[0] are treated as zero where s[0] is the largest singular value. 
 
- Returns
 - Examples - >>> rows = sc.parallelize([[3, 1, 1], [-1, 3, 1]]) >>> rm = RowMatrix(rows) - >>> svd_model = rm.computeSVD(2, True) >>> svd_model.U.rows.collect() [DenseVector([-0.7071, 0.7071]), DenseVector([-0.7071, -0.7071])] >>> svd_model.s DenseVector([3.4641, 3.1623]) >>> svd_model.V DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, ...0.0], 0) 
 - 
multiply(matrix: pyspark.mllib.linalg.Matrix) → pyspark.mllib.linalg.distributed.RowMatrix[source]¶
- Multiply this matrix by a local dense matrix on the right. - New in version 2.2.0. - Parameters
- matrixpyspark.mllib.linalg.Matrix
- a local dense matrix whose number of rows must match the number of columns of this matrix 
 
- matrix
- Returns
 - Examples - >>> rm = RowMatrix(sc.parallelize([[0, 1], [2, 3]])) >>> rm.multiply(DenseMatrix(2, 2, [0, 2, 1, 3])).rows.collect() [DenseVector([2.0, 3.0]), DenseVector([6.0, 11.0])] 
 - 
numCols() → int[source]¶
- Get or compute the number of cols. - Examples - >>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6], ... [7, 8, 9], [10, 11, 12]]) - >>> mat = RowMatrix(rows) >>> print(mat.numCols()) 3 - >>> mat = RowMatrix(rows, 7, 6) >>> print(mat.numCols()) 6 
 - 
numRows() → int[source]¶
- Get or compute the number of rows. - Examples - >>> rows = sc.parallelize([[1, 2, 3], [4, 5, 6], ... [7, 8, 9], [10, 11, 12]]) - >>> mat = RowMatrix(rows) >>> print(mat.numRows()) 4 - >>> mat = RowMatrix(rows, 7, 6) >>> print(mat.numRows()) 7 
 - 
tallSkinnyQR(computeQ: bool = False) → pyspark.mllib.linalg.QRDecomposition[Optional[pyspark.mllib.linalg.distributed.RowMatrix], pyspark.mllib.linalg.Matrix][source]¶
- Compute the QR decomposition of this RowMatrix. - The implementation is designed to optimize the QR decomposition (factorization) for the RowMatrix of a tall and skinny shape [1]. - 1
- Paul G. Constantine, David F. Gleich. “Tall and skinny QR factorizations in MapReduce architectures” https://doi.org/10.1145/1996092.1996103 
 - New in version 2.0.0. - Parameters
- computeQbool, optional
- whether to computeQ 
 
- Returns
- pyspark.mllib.linalg.QRDecomposition
- QRDecomposition(Q: RowMatrix, R: Matrix), where Q = None if computeQ = false. 
 
 - Examples - >>> rows = sc.parallelize([[3, -6], [4, -8], [0, 1]]) >>> mat = RowMatrix(rows) >>> decomp = mat.tallSkinnyQR(True) >>> Q = decomp.Q >>> R = decomp.R - >>> # Test with absolute values >>> absQRows = Q.rows.map(lambda row: abs(row.toArray()).tolist()) >>> absQRows.collect() [[0.6..., 0.0], [0.8..., 0.0], [0.0, 1.0]] - >>> # Test with absolute values >>> abs(R.toArray()).tolist() [[5.0, 10.0], [0.0, 1.0]] 
 - Attributes Documentation - 
rows¶
- Rows of the RowMatrix stored as an RDD of vectors. - Examples - >>> mat = RowMatrix(sc.parallelize([[1, 2, 3], [4, 5, 6]])) >>> rows = mat.rows >>> rows.first() DenseVector([1.0, 2.0, 3.0])