# Application level metrics The AWS IoT Fleetwise Edge agent includes a [TraceModule](../src/platform/linux/logmanagement/src/TraceModule.cpp). The TraceModule provides a set of metrics that are used as an entry point to efficiently diagnose issues, saving you time since you no longer need to review the entire log of all edge agent instances running. - **`RFrames0` - `RFrames19`** are monotonic counters of the number of raw can frames read on each bus. If these counters remain null or remain fixed for a longer runtime, the system might either have no CAN bus traffic or there is no CAN bound data collection campaign ( e.g. OBD2 only campaign ). - **`ConInt`** and **`ConRes`** enable you to monitor the the number of MQTT connection interruptions and connection resumptions. If and how long it takes to detect a connection loss depends on the kernel configuration parameters `/proc/sys/net/ipv4/tcp/keepalive*` and the compile time constants of AWS IoT Fleetwise Edge: `MQTT_CONNECT_KEEP_ALIVE_SECONDS` and `MQTT_PING_TIMOUT_MS`. If the values of the metric `ConInt` are not null, the internet coverage in the tested environment might be unreliable, or `MQTT_PING_TIMOUT_MS`, which defaults to 3 seconds, needs to be increased because there's high latency to the IoT Core endpoint. Changing the AWS Region can help to decrease latency. - **`CeTrgCnt`** is a monotonic counter that monitors the number of triggers (inspection rules) detected since the AWS IoT Fleetwise Edge process started. Triggers are detected if one or more data collection campaign conditions are true. If this counter is larger than zero, but no data appears in the cloud, either no actual data was collected ( such as a time-based data collection campaign with no bus activity), or the data has been ingested to the cloud but there was an error processing it. To debug this, [enable cloud logs in AWS IoT Fleetwise settings](https://docs.aws.amazon.com/iot-fleetwise/latest/developerguide/logging-cw.html). - **`QUEUE_CONSUMER_TO_INSPECTION_SIGNALS`** monitors the current count of signals in queue to the signal history buffer. If this value is close to the value defined in the static config `decodedSignalsBufferSize`, increase the static config, decrease `inspectionThreadIdleTimeMs`, reduce the bus load or reduce the amount of decoded signals in the decoder manifest in the cloud. - **`ConRej`** monitors the number of MQTT connection rejects. If this is not zero check the certificates and make sure you use a unique client id for each vehicle. - **`ConFail`** monitors the number of MQTT connection failures. This can have multiple root causes. If this is not zero please check the logs and search for `Connection failed with error` - **`FWE_STARTUP`** and **`FWE_SHUTDOWN`** provide the amount of time it takes to start and stop the AWS IoT Edge Fleetwise process. If any value is more than 5 seconds, review the logs and make sure all required resources such as internet and buses are available before starting the process. - **`ObdE0`** to **`ObdE3`** monitors errors related to the OBD session. If you see non-zero values, make sure you're connected to a compatible OBD vehicle which is powered on. Otherwise turn off the OBD signals collection in the cloud. - **`PmE3`** provides hints on whether the data persistency framework (a mechanism used to store and forward vehicle data when no connectivity is available) has an error. If this error counter is not zero, make sure that the directory defined in `persistencyPath` is writeable and that there is space available in the filesystem - **`SysKerTimeDiff`** shows the difference between the CAN frame RX timestamp from the kernel and the system time. If this is significantly higher than `socketCANThreadIdleTimeMs`, which is 50 milliseconds in the default configuration, the timestamps from the kernel are out of sync. Make sure an updated SocketCAN driver for your CAN device is used. Alternatively, consider switching `timestampType` in the static config to `Polling`. This will affect timestamp precision. Consider reducing the polling time `socketCANThreadIdleTimeMs` to mitigate. - `CeSCnt` is a monotonic counter that counts the signals decoded and processed since startup. This can be used for performance evaluations. - `CpuPercentageSum` and `CpuThread_*` tracks the CPU usage for the complete process and per thread. In multi-core systems this can be above 100%. AWS IoT Fleetwise Edge uses the linux `/proc/` directory to calculate this information. - `MemoryMaxResidentRam` gives the maximum bytes of resident RAM used by the process. If this is above 50 MB high consider switching from cmake Debug to Release build. Also the queue sizes in the static config can be reduced. - `CampaignFailures` monitors errors related to the campaign activation. If you see non-zero values, please check the logs. Make sure not to deploy more campaigns in parallel than defined in `MAX_NUMBER_OF_ACTIVE_CONDITION`, which defaults to 256. Also check that the `maxSampleCount` of all collected signals fits into the memory used for the signals history buffer defined in `MAX_SAMPLE_MEMORY`, which defaults to 20MB. - `CampaignRxToDataTx` provides the amount of time it takes from changing the set of active campaigns to the first signal data being published. If at least one time based collection scheme is active this should be at most the time period of that collection scheme. # How to collect metrics from a FWE There are multiple ways to collect metrics depending on how AWS IoT Fleetwise Edge (FWE) is integrated. We describe two methods: using the RemoteProfiler and collecting processed logs and extract metrics (like through the AWS Systems Manager). Each method incurs charges for different AWS services like [AWS IoT Core](https://aws.amazon.com/iot-core/pricing/), [Amazon CloudWatch](https://aws.amazon.com/cloudwatch/pricing/), [AWS System Manager](https://aws.amazon.com/systems-manager/pricing/) and more. For example, using the [RemoteProfiler](#method-1-use-the-remoteprofiler-module) method, AWS Iot Fleetwise Edge uploads at your configured interval, which is currently ~300 metrics. Per 10 metrics data points uploaded, at least one message will be published to AWS IoT Core and one AWS IoT Rules Engine Action will be executed. If `profilerPrefix` is different for every vehicle, ~300 new Amazon CloudWatch metrics will be used per vehicle. ## Method 1: Use the RemoteProfiler module The RemoteProfiler module is provided as part of the AWS IoT FleetWise Edge C++ code base. If activated, it will regularly ingest the metrics and logs to AWS IoT Core topics, which have underlying AWS IoT Core Rules and actions to route the data to Amazon CloudWatch. The same existing MQTT connection used to ingest the data collection campaign is reused for this purpose. In order to activate the RemoteProfile, add the following parameters to your config file: ```json { ... "staticConfig": { ... "mqttConnection": { ... "metricsUploadTopic": "aws-iot-fleetwise-metrics-upload", "loggingUploadTopic": "aws-iot-fleetwise-logging-upload" }, "remoteProfilerDefaultValues": { "loggingUploadLevelThreshold": "Warning", "metricsUploadIntervalMs": 60000, "loggingUploadMaxWaitBeforeUploadMs": 60000, "profilerPrefix": "TestVehicle1" }, } } ``` In the above example configuration, a plain text json file with metrics will be uploaded to the AWS IoT Core topic: `aws-iot-fleetwise-metrics-upload` and log messages of level Warning and Error to the topic `aws-iot-fleetwise-logging-upload`. If `profilerPrefix` is unique for every vehicle, such as if it's the same as `clientId`, there will be separate Amazon CloudWatch metrics for each vehicle. If all vehicles have the same `profilerPrefix`, Amazon CloudWatch metrics are aggregated. Two AWS IoT Core rule actions are needed for these topics to forward the data to Amazon CloudWatch metrics and logs. They can be created by using the following AWS CloudFormation stack template: [fwremoteprofiler.yml](../tools/cfn-templates/fwremoteprofiler.yml) Click here to [**Launch CloudFormation Template**](https://us-east-1.console.aws.amazon.com/cloudformation/home?region=us-east-1#/stacks/quickcreate?templateUrl=https%3A%2F%2Faws-iot-fleetwise.s3.us-west-2.amazonaws.com%2Flatest%2Fcfn-templates%2Ffwremoteprofiler.yml&stackName=fwremoteprofiler). After the first vehicle uploads metrics, they can be found under the namespace **AWSIotFleetWiseEdge**. The format is `{profilerPrefix}_(variableMaxSinceStartup|variableMaxSinceLast|)_{name}` for variables and `{profilerPrefix}_(sectionAvgSinceStartup|sectionCountSinceStartup|sectionMaxSinceLast|sectionMaxSinceStartup)_{name}` for measuring the time in seconds needed for certain code sections. After running a vehicle with the above config the metrics `TestVehicle1_variableMaxSinceStartup_RFrames0` and ~ 300 more will appear in Amazon CloudWatch. Every minute new values will appear as `metricsUploadIntervalMs` is set to 60000. For the direct upload of every log message above the specified threshold (`loggingUploadLevelThreshold`), log messages are cached at edge for a maximum of 60 seconds (`loggingUploadMaxWaitBeforeUploadMs`) before being uploaded over MQTT. The RemoteProfile module will not cache any metrics or logs during the loss of connectivity. The local system log file can be used in that case, see the following section. ## Method 2: Retrieving metrics from logs e.g. over SSH This method uses remote access, such as over SSH leveraging AWS Systems Manager or AWS IoT secure tunneling to access the logs/metrics. In our examples, we use journald to manage the FWE logs. This has the benefits of log rotation which might be necessary as FWE logs on TRACE level under high load can produce multiple gigabytes of logs per day. These logs can be collected fully or aggregated like over ssh from single vehicles in case of need for debugging or cyclically from the whole fleet. To manage easy remote connections to multiple vehicles AWS Systems Manager or AWS IoT secure tunneling could be used. For aggregation, custom scripts can be used to filter certain log levels. The log levels in AWS IoT Fleetwise Edge logs go from `[ERROR]` to `[TRACE]`. To make the metrics easier to parse, you can set the parameter `.staticConfig.internalParameters.metricsCyclicPrintIntervalMs` in the static config an interval like 60000. This will cause the metrics to print in an easy parsable format to the log every minute. The following regex expression can be used by any log/metrics aggregator/uploader that supports Python. For lines that start with `TraceModule-ConsoleLogging-TraceAtomicVariable` or `TraceModule-ConsoleLogging-Variable`: ```python regex_variable = re.compile( r".*\'(?P.*?)\'" r" \[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" ) ``` For lines starting with `TraceModule-ConsoleLogging-Section` ```python regex_section = re.compile( r".*\'(?P.*?)\'" r" \[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" r" .*\[(?P.*?)\]" ) ``` After the metrics are parsed from the local log files, a local health monitoring program can decide if and how to upload them to the cloud. # Adding new metrics Adding new metrics requires changing the C++ code and recompiling AWS IoT Fleetwise Edge. Add the Metrics to the `TraceVariable` enum in [TraceModule.h](../src/platform/linux/logmanagement/include/TraceModule.h) and assign a short name in the function `getVariableName` of [TraceModule.cpp](../src/platform/linux/logmanagement/src/TraceModule.cpp). Then you can set the metrics anywhere by using: ```cpp TraceModule::get().setVariable( TraceVariable::MAX_SYSTEMTIME_KERNELTIME_DIFF, observedNewValue); ``` The metric will be automatically included in both methods described above. There are no changes needed in the cloud, and the new metric will just show up in the same namespace.