Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: CC-BY-SA-4.0
In the CreateTrainingJob request, you specify the training algorithm that you want to use. You can also specify algorithm-specific hyperparameters as string-to-string maps. The following table lists the hyperparameters for the k-means training algorithm provided by Amazon SageMaker. For more information about how k-means clustering works, see How K-Means Clustering Works.
Parameter Name | Description |
---|---|
feature_dim | The number of features in the input data. Required Valid values: Positive integer |
k | The number of required clusters. Required Valid values: Positive integer |
epochs | The number of passes done over the training data. Optional Valid values: Positive integer Default value: 1 |
eval_metrics | A JSON list of metric types used to report a score for the model. Allowed values are msd for Means Square Error and ssd for Sum of Square Distance. If test data is provided, the score is reported for each of the metrics requested. Optional Valid values: Either [\"msd\"] or [\"ssd\"] or [\"msd\",\"ssd\"] . Default value: [\"msd\"] |
extra_center_factor | The algorithm creates K centers = num_clusters * extra_center_factor as it runs and reduces the number of centers from K to k when finalizing the model. Optional Valid values: Either a positive integer or auto . Default value: auto |
half_life_time_size | Used to determine the weight given to an observation when computing a cluster mean. This weight decays exponentially as more points are observed. When a point is first observed, it is assigned a weight of 1 when computing the cluster mean. The decay constant for the exponential decay function is chosen so that after observing half_life_time_size points, its weight is 1/2. If set to 0, there is no decay. Optional Valid values: Non-negative integer Default value: 0 |
init_method | Method by which the algorithm chooses the initial cluster centers. The standard k-means approach chooses them at random. An alternative k-means++ method chooses the first cluster center at random. Then it spreads out the position of the remaining initial clusters by weighting the selection of centers with a probability distribution that is proportional to the square of the distance of the remaining data points from existing centers. Optional Valid values: Either random or kmeans++ . Default value: random |
local_lloyd_init_method | The initialization method for Lloyd’s expectation-maximization (EM) procedure used to build the final model containing k centers. Optional Valid values: Either random or kmeans++ . Default value: kmeans++ |
local_lloyd_max_iter | The maximum number of iterations for Lloyd’s expectation-maximization (EM) procedure used to build the final model containing k centers. Optional Valid values: Positive integer Default value: 300 |
local_lloyd_num_trials | The number of times the Lloyd’s expectation-maximization (EM) procedure with the least loss is run when building the final model containing k centers. Optional Valid values: Either a positive integer or auto . Default value: auto |
local_lloyd_tol | The tolerance for change in loss for early stopping of Lloyd’s expectation-maximization (EM) procedure used to build the final model containing k centers. Optional Valid values: Float. Range in [0, 1]. Default value: 0.0001 |
mini_batch_size | The number of observations per mini-batch for the data iterator. Optional Valid values: Positive integer Default value: 5000 |