Package org.apache.spark.util
Class HadoopFSUtils
Object
org.apache.spark.util.HadoopFSUtils
Utility functions to simplify and speed-up file listing.
-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic org.slf4j.Logger
static void
org$apache$spark$internal$Logging$$log__$eq
(org.slf4j.Logger x$1) static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,
scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles
(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively.static boolean
shouldFilterOutPathName
(String pathName) Checks if we should filter out this path name.
-
Constructor Details
-
HadoopFSUtils
public HadoopFSUtils()
-
-
Method Details
-
parallelListLeafFiles
public static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax) Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.This may only be called on the driver.
- Parameters:
sc
- Spark context used to run parallel listing.paths
- Input paths to listhadoopConf
- Hadoop configurationfilter
- Path filter used to exclude leaf files from resultignoreMissingFiles
- Ignore missing files that happen during recursive listing (e.g., due to race conditions)ignoreLocality
- Whether to fetch data locality info when listing leaf files. If false, this will returnFileStatus
withoutBlockLocation
info.parallelismThreshold
- The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.parallelismMax
- The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.- Returns:
- for each input path, the set of discovered files for the path
-
shouldFilterOutPathName
Checks if we should filter out this path name. -
org$apache$spark$internal$Logging$$log_
public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_() -
org$apache$spark$internal$Logging$$log__$eq
public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)
-