Enhance Spark RandomForestModel objects with methods for Random Forest Clustering
Enhance a Spark DecisionTreeModel object with methods for Random Forest clustering
An object for training a K-Medoid clustering model on Seq or RDD data.
Represents a K-Medoids clustering model
An object for training a Random Forest clustering model on unsupervised data.
An object for training a Random Forest clustering model on unsupervised data.
Data is required to have a mapping into a feature space of type Seq[Double].
A feature extraction function for data objects
A map from feature indexes into numbers of categories. Feature indexes that do not have an entry in the map are assumed to be numeric, not categorical. Defaults to category-info from Extractor, if the feature extraction function is of this type. Otherwise defaults to empty, i.e. all numeric features.
The size of synthetic (margin-sampled) data to be constructed. Defaults to the size of the input data.
The number of decision trees to train in the Random Forest Defaults to 10.
Maximum decision tree depth. Defaults to 5.
Maximum histogramming bins to use for numeric data. Defaults to 5.
The number of clusters to use when clustering leaf-id vectors. Defaults to an automatic estimation of a "good" number of clusters.
Maximum clustering refinement iterations to compute. Defaults to 25.
Halt clustering if clustering metric-cost changes by less than this value. Defaults to 0
Halt clustering if clustering metric-cost changes by this fractional value from previous iteration. Defaults to 0.0001
If data is larger, use this random sample size. Defaults to 1000.
Use this number of threads to accelerate clustering. Defaults to 1.
A seed to use for RNG. Defaults to using a randomized seed value.
Represents a Random Forest clustering model of some data objects
Class definitions for ClusteringTreeModel methods
Utilities used by K-Medoids clustering
Utility functions for KMedoidsModel
Factory functions and implicits for RandomForestCluster
Factory functions and implicits for RandomForestClusterModel
An object for training a K-Medoid clustering model on Seq or RDD data.
Data is required to have a metric function defined on it, but it does not require an algebra over data elements, as K-Means clustering does.
The distance metric imposed on data elements
The number of clusters to use. If k is zero, the clustering will attempt to identify a number of clusters that is "good" w.r.t. Minimum Description Length.
The maximum number of model refinement iterations to run
The epsilon threshold to use. Must be >= 0. If c1 is the current clustering model cost, and c0 is the cost of the previous model, then refinement halts when (c0 - c1) <= epsilon (Lower cost is better).
The fractionEpsilon threshold to use. Must be >= 0. If c1 is the current clustering model cost, and c0 is the cost of the previous model, then refinement halts when (c0 - c1) / c0 <= fractionEpsilon (Lower cost is better).
The target size of the random sample. Must be > 0.
The number of threads to use while clustering
The random seed to use for RNG. Cluster training runs with the same starting random seed will be the same. By default, training runs will vary randomly.