Benchmark Configuration

The table below provides a list of configuration options that were supplied to this benchmark

Stability Benchmark Results

Invocation Latency and Error Metrics

The table below provides a summary of invocation latency metrics as measured from the client side. The metrics include the minimum, mean, median, and max latencies in addition to the interquartile range (iqr) which shows the difference in latency between the 75th and 25th percentiles. Only the successful memory configurations are included.

Request Latency Distribution

The distribution of latencies is summarized in the chart below. Longer latencies due to cold start are not included.

Endpoint CloudWatch Metrics

The average values of the metrics monitored by CloudWatch are captured below. The ModelSetupTime metric represents the time it takes to launch new compute resources for a serverless endpoint and indicates the impact of a cold start. This metric may not appear as endpoints are launched in a warm state. You can invoke a cold start by increasing the cold_start_delay parameter when configuring the benchmark. Alternatively, the CloudWatch metrics for the concurrency benchmark bellow are more likely to capture this metric due to the larger number of compute resources involved. Refer to the documentation for an explanation of each metric.

Cost Savings and Performance Analysis

This section provides an analysis of cost and performance of each memory configuration. Additionally it provides and overview of the expected cost savings compared to a Real Time endpoint running on a comparable SageMaker hosting instance.

The graph below graph visualizes the performance and cost trade-off of each memory configuration.

The table below provides an estimate of the savings compared against a real-time hosting instance based on the number of monthly invocations.

Optimal memory configuration: {{ context.optimal_memory_config }}

Comparable SageMaker Hosting Instance: {{ context.comparable_instance }}

Concurency Benchmark Results

This benchmark tests the performance of specified MaxConcurrency configurations. It helps determine the right setting to support the expected invocation volumes.

Invocation Latency and Error Metrics

Latency, error, and throughput (TPS) metrics are captured in the table below. This should help inform the minimum MaxConcurrency configuration that can support the expected traffic.

Request Latency Distribution

The charts below summarize the latency distributions under different load patterns (number of concurrent clients) and MaxConcurrency settings