Object

org.apache.spark.util.HadoopFSUtils

public class HadoopFSUtils extends Object

Utility functions to simplify and speed-up file listing.

Constructor Summary

Constructors

Constructor

Description

HadoopFSUtils()
Method Summary

Modifier and Type

Method

Description

static org.slf4j.Logger

org$apache$spark$internal$Logging$$log_()

static void

org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)

static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>>

parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)

Lists a collection of paths recursively.

static boolean

shouldFilterOutPathName(String pathName)

Checks if we should filter out this path name.

Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- HadoopFSUtils
  
  public HadoopFSUtils()
Method Details
- parallelListLeafFiles
  
  public static scala.collection.Seq<scala.Tuple2<org.apache.hadoop.fs.Path,scala.collection.Seq<org.apache.hadoop.fs.FileStatus>>> parallelListLeafFiles(SparkContext sc, scala.collection.Seq<org.apache.hadoop.fs.Path> paths, org.apache.hadoop.conf.Configuration hadoopConf, org.apache.hadoop.fs.PathFilter filter, boolean ignoreMissingFiles, boolean ignoreLocality, int parallelismThreshold, int parallelismMax)
  
  Lists a collection of paths recursively. Picks the listing strategy adaptively depending on the number of paths to list.
  This may only be called on the driver.
  
  Parameters:
  
  sc - Spark context used to run parallel listing.
  
  paths - Input paths to list
  
  hadoopConf - Hadoop configuration
  
  filter - Path filter used to exclude leaf files from result
  
  ignoreMissingFiles - Ignore missing files that happen during recursive listing (e.g., due to race conditions)
  
  ignoreLocality - Whether to fetch data locality info when listing leaf files. If false, this will return FileStatus without BlockLocation info.
  
  parallelismThreshold - The threshold to enable parallelism. If the number of input paths is smaller than this value, this will fallback to use sequential listing.
  
  parallelismMax - The maximum parallelism for listing. If the number of input paths is larger than this value, parallelism will be throttled to this value to avoid generating too many tasks.
  
  Returns:
  
  for each input path, the set of discovered files for the path
- shouldFilterOutPathName
  
  public static boolean shouldFilterOutPathName(String pathName)
  
  Checks if we should filter out this path name.
- org$apache$spark$internal$Logging$$log_
  
  public static org.slf4j.Logger org$apache$spark$internal$Logging$$log_()
- org$apache$spark$internal$Logging$$log__$eq
  
  public static void org$apache$spark$internal$Logging$$log__$eq(org.slf4j.Logger x$1)

Class HadoopFSUtils

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

HadoopFSUtils

Method Details

parallelListLeafFiles

shouldFilterOutPathName

org$apache$spark$internal$Logging$$log_

org$apache$spark$internal$Logging$$log__$eq