Class RDDSampleUtils


  • public class RDDSampleUtils
    extends Object
    The Class RDDSampleUtils.
    • Constructor Detail

      • RDDSampleUtils

        public RDDSampleUtils()
    • Method Detail

      • getSampleNumbers

        public static int getSampleNumbers​(int numPartitions,
                                           long totalNumberOfRecords,
                                           int givenSampleNumbers)
        Returns the number of samples to take to partition the RDD into specified number of partitions.

        Number of partitions cannot exceed half the number of records in the RDD.

        Returns total number of records if it is < 1000. Otherwise, returns 1% of the total number of records or twice the number of partitions whichever is larger. Never returns a number > Integer.MAX_VALUE.

        If desired number of samples is not -1, returns that number.

        Parameters:
        numPartitions - the num partitions
        totalNumberOfRecords - the total number of records
        givenSampleNumbers - the given sample numbers
        Returns:
        the sample numbers
        Throws:
        IllegalArgumentException - if requested number of samples exceeds total number of records or if requested number of partitions exceeds half of total number of records