Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
SPDX-License-Identifier: CC-BY-SA-4.0
To preprocess or get inferences for an entire dataset, use batch transform. Use batch transform when you need to work with large datasets, process datasets quickly or sub-second latency. Use preprocessing to remove noise or bias from your dataset that interferes with training or inference. Use batch transform for inference when you don’t need a persistent endpoint. You can use batch transform for example, to compare production variants that deploy different models.
To filter input data before performing inferences or to associate input records with inferences about those records, use Associate Prediction Results with their Corresponding Input Records. This is useful for example, to provide context for creating and interpreting reports about the output data.
For more information about batch transform, see Get Inferences for an Entire Dataset with Batch Transform.
Topics + Use Batch Transform with Large Datasets + Speed Up a Batch Transform Job + Use Batch Transform to Test Production Variants + Batch Transform Errors + Batch Transform Sample Notebooks + Associate Prediction Results with their Corresponding Input Records
Batch transform automatically manages the processing of large datasets within the limits of specified parameters. For example, suppose that you have a dataset file, input1.csv
, stored in an S3 bucket. The content of the input file might look like this:
Record1-Attribute1, Record1-Attribute2, Record1-Attribute3, ..., Record1-AttributeM
Record2-Attribute1, Record2-Attribute2, Record2-Attribute3, ..., Record2-AttributeM
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
...
RecordN-Attribute1, RecordN-Attribute2, RecordN-Attribute3, ..., RecordN-AttributeM
When a batch transform job starts, Amazon SageMaker initializes compute instances and distributes the inference or preprocessing workload between them. When you have multiples files, one instance might process input1.csv
, and the other instance might process another file named input2.csv
. To keep large payloads within the MaxPayloadInMB limit, you might split an input file into several mini-batches. For example, you might create a mini-batch created from input1.csv
, as follows.
Record3-Attribute1, Record3-Attribute2, Record3-Attribute3, ..., Record3-AttributeM
Record4-Attribute1, Record4-Attribute2, Record4-Attribute3, ..., Record4-AttributeM
Note
Amazon SageMaker processes each input file separately. It doesn’t combine mini-batches from different input files to comply with the MaxPayloadInMB limit.
To split input files into mini-batches, when you create a batch transform job, set the SplitType parameter value to Line
. If SplitType is set to None
or if an input file can’t be split into mini-batches, Amazon SageMaker uses the entire input file in a single request.
If the batch transform job successfully processes all of the records in an input file, it creates an output file with the same name and an .out
file extension. For multiple input files, such as input1.csv
and input2.csv
, the output files are named input1.csv.out
, and input2.csv.out
. The batch transform job stores the output files in the specified location in Amazon S3, such as s3://awsexamplebucket/output/
. The predictions in an output file are listed in the same order as the corresponding records in the input file. The following would be the contents of the output file input1.csv.out
, based on the input file shown earlier.
Inference1-Attribute1, Inference1-Attribute2, Inference1-Attribute3, ..., Inference1-AttributeM
Inference2-Attribute1, Inference2-Attribute2, Inference2-Attribute3, ..., Inference2-AttributeM
Inference3-Attribute1, Inference3-Attribute2, Inference3-Attribute3, ..., Inference3-AttributeM
...
InferenceN-Attribute1, Inference3-Attribute2, Inference3-Attribute3, ..., InferenceN-AttributeM
To combine the results of multiple output files into a single output file, set the AssembleWith parameter to Line
.
When the input data is very large and is transmitted using HTTP chunked encoding, to stream the data to the algorithm, set MaxPayloadInMB to 0
. Currently, Amazon SageMaker built-in algorithms don’t support this feature.
For information about using the API to create a batch transform job, see the CreateTransformJob API. For more information about the correlation between batch transform input and output objects, see OutputDataConfig. For an example of how to use batch transform, see Step 6.2: Deploy the Model with Batch Transform.
If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using different parameter values, such as MaxPayloadInMB, MaxConcurrentTransforms, and BatchStrategy. Amazon SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
If you are using the Amazon SageMaker console, you can reduce the time it takes to complete batch transform jobs by using different parameter values, such as Max payload size (MB), Max concurrent transforms, and Batch strategy, in the Additional configuration section of the Batch transform job configuration page. Amazon SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
To test different models or various hyperparameter settings, create a separate transform job for each new model variant and use a validation dataset. For each transform job, specify a unique model name and location in Amazon S3 for the output file. To analyze the results, use Inference Pipeline Logs and Metrics.
Amazon SageMaker uses the Amazon S3 Multipart Upload API to upload results from a batch transform job to Amazon S3. If an error occurs, the uploaded results are removed from Amazon S3. In some cases, such as when a network outage occurs, an incomplete multipart upload might remain in Amazon S3. To avoid incurring storage charges, we recommend that you add the S3 bucket policy to the S3 bucket lifecycle rules. This policy deletes incomplete multipart uploads that might be stored in the S3 bucket. For more information, see Object Lifecycle Management.
If a batch transform job fails to process an input file because of a problem with the dataset, Amazon SageMaker marks the job as “failed” to alert you. If an input file contains a bad record, the transform job doesn’t create an output file for that input file because it can’t maintain the same order in the transformed data. When your dataset has multiple input files, a transform job continues to process input files even if it fails to process one. The processed files still generate useable results.
Exceeding the MaxPayloadInMB limit causes an error. This might happen with a large dataset if it can’t be split, the SplitType parameter is set to none
, or individual records within the dataset exceed the limit.
If you are using your own algorithms, you can use placeholder text, such as ERROR
, when the algorithm finds a bad record in an input file. For example, if the last record in a dataset is bad, the algorithm should place the error placeholder for that record in the output file.
For a sample notebook that uses batch transform to with a PCA model as a data reduction step on user-item review matrix followed by DBSCAN to cluster movies, see https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker_batch_transform/introduction_to_batch_transform/batch_transform_pca_dbscan_movie_clusters.ipynb. For instructions on creating and accessing Jupyter notebook instances that you can use to run the example in Amazon SageMaker, see Use Notebook Instances. After creating and opening a notebook instance, choose the SageMaker Examples tab to see a list of all the Amazon SageMaker examples. The topic modeling example notebooks that use the NTM algorithms are located in the Advanced functionality section. To open a notebook, choose its Use tab, then choose Create copy.