The table below provides a list of configuration options that were supplied to this benchmark
{{ context.benchmark_configuration }}The table below provides a summary of invocation latency metrics as measured from the client side. The metrics include the minimum, mean, median, and max latencies in addition to the interquartile range (iqr) which shows the difference in latency between the 75th and 25th percentiles. Only the successful memory configurations are included.
{{ context.stability_benchmark_summary }}The distribution of latencies is summarized in the chart below. Longer latencies due to cold start are not included.
The average values of the metrics monitored by CloudWatch are captured below. The ModelSetupTime metric represents the time it takes to launch new compute resources for a serverless endpoint and indicates the impact of a cold start. This metric may not appear as endpoints are launched in a warm state. You can invoke a cold start by increasing the cold_start_delay parameter when configuring the benchmark. Alternatively, the CloudWatch metrics for the concurrency benchmark bellow are more likely to capture this metric due to the larger number of compute resources involved. Refer to the documentation for an explanation of each metric.
{{ context.stability_endpoint_metrics }}This section provides an analysis of cost and performance of each memory configuration. Additionally it provides and overview of the expected cost savings compared to a Real Time endpoint running on a comparable SageMaker hosting instance.
The graph below graph visualizes the performance and cost trade-off of each memory configuration.
The table below provides an estimate of the savings compared against a real-time hosting instance based on the number of monthly invocations.
Optimal memory configuration: {{ context.optimal_memory_config }}
Comparable SageMaker Hosting Instance: {{ context.comparable_instance }}
{{ context.cost_savings_table }}This benchmark tests the performance of specified MaxConcurrency configurations. It helps determine the right setting to support the expected invocation volumes.
Latency, error, and throughput (TPS) metrics are captured in the table below. This should help inform the minimum MaxConcurrency configuration that can support the expected traffic.
{{ context.concurrency_benchmark_summary }}The charts below summarize the latency distributions under different load patterns (number of concurrent clients) and MaxConcurrency settings
The average values of the metrics monitored by CloudWatch are captured below. The ModelSetupTime metric represents the time it takes to launch new compute resources for a serverless endpoint and indicates the impact of a cold start. Refer to the documentation for an explanation of each metric.
{{ context.concurrency_cloudwatch_metrics }}