## Concepts ## This package endeavors to provide Root Cause Analysis(RCA) of the myriad problems that may cripple a system. The RCA is done using the metrics that the observability code (in the same package) gathers. The RCA execution and evaluation is done as a data-flow connectedComponent execution. The data flows from the top into the leaf nodes; each node makes sense of the data that it gets from the upstream nodes and synthesizes the evaluation in terms of a FlowUnit that is passed to its downstream nodes, which in turn do their own evaluation and finally comes out one or more RCAs. So to re-iterate, this system consumes metrics and provides RCAs. The output or RCA is essentially a non-empty set of resources that are behaving anomalously when the system was noticed to be in an unhealthy state. It will be an empty set if nothing is found to be wrong and the system is in perfect health based on the sensors available. All components are Nodes, in this data-flow connectedComponent. Nodes have different role based on how they are defined. Before we go into the details of the RCA framework and its evaluation, we will go through some concepts that will run through the entirety of this discussion. ### Metrics ### This is the input to the system. These are essentially key-value pairs. For details on what metrics are available, please take a look at the online reference [here](https://opensearch.org/docs/monitoring-plugins/pa/reference/). ### Symptoms ### A symptom is a boolean question about the state of a resource. When we notice that the system is running slow, we might be interested in questions such as "is the CPU contended" ? In this question, CPU is the resource and the state we are interested here is contended. So, the symptom is a 2-tuple of a resource and the state. Similar questions can be whether the disk has a low service-rate and so forth. Answer to this question returns a value from the set {True, False, InsufficientData}. True and False are self explanatory but in what cases can this return InsufficientData ? We might want to know the write -throughput but not enough data has been written in the last few minutes for us to conclude the state with high probability. In such cases the Symptom for would output InsufficientData. Symptoms consume one or more metrics and may or may not consume other symptoms to to arrive at a conclusion. As astute reader might notice that when we often say a CPU is high or something as such, we often compare the current value against a threshold. Worry not, we do support quite an elaborate threshold definition that can be used to evaluate a symptom. It is discussed further down in the doc. Threshold is not a node of the data-flow and so we will defer the discussion. ### RCA ### This is what we have been waiting for. A root cause is a non-empty set of resources in the cluster that is diagnosed in a not-healthy state at a given point in time. One thing to keep in mind is that the state changes based on whether the problems are transitory others are not. So, for this system, the output of RCA will be a set of one or many 2-tuples containing the resource and the state that it is diagnosed to be in. In terms of data-flow, input to the RCA node can be - one or more metric(s) - one or more symptom(s) - one or more rca(s) ## Framework ## Now that we are passed the basic concepts, lets get the details on the framework. At the top of the dataflow node types is the node. A node can be further categorized into `Leaf` and `NonLeaf` nodes. Recall that the metrics are the leaves and they have no dependency on other nodes. So `Metrics` extend the ***Leaf*** node. `RCA`s and `Symptom`s cannot be at top as they need data to evaluate. So `RCA` and `Symptom` extend from the `NonLeaf` nodes. Because ***RCA***s and ***Symptom*** needs to be evaluated, they implement the Evaluable and as a virtue of which anytime you create a class that extends from `Symptom` or `Rca` the responsibility falls on you to implement the `evaluate` method. ### Analysis Graph ### This is the plot of how RCAs will be evaluated. The way to think about RCA is a dataflow graphs. The data enters through the top and RCAs come out at the bottom. Its an inverted DAG where the metrics are the leaf nodes. Every downstream node depends on one or more upstream nodes. The Metric-nodes being the leaf of the inverted connectedComponent, have no dependencies themselves. ### Node Tagging ### Whether a graph node is executed at a location or not is determined by a combination of two things, what tags the node has and what tag is present in the rca.conf. Before execution of a node, all the node tags (keys:values) are matched with the tags in the rca.conf. A node will usually have less tags than the rca.conf or else it will not execute. The way tag matching works is for each key:value pair in the tag lust of the node, the key is looked up for in the rca.conf and the corresponding values are matched. If a match happens, then we move on to the next tag of the node and so on. Only if all the tags of the node matches (rca.conf can have extra tags and that does not affect the tag matching), then the run-time goes on to execute the node. If the node has some extra tags that are not present in the rca.conf, then it is considered as a no-match. A node that has no tags associated with it, will execute in all locations. Note: One thing to note is that if a node does not run on any location , then none of its dependencies can run on that location because the data required will not be available. So if a node does not run on a location, the runtime also turns off all its dependents in that location. Node tagging is also how the runtime decides whether a node's data is required by a remote location and how a remote location knows that the data it needs to evaluate a local node is supposed to be obtained from a remote upstream node. This is a good segue into the the intent based message passing. #### Intent Based Message passing between Network separated Graph Nodes This section describes the interfacing of the runtime and the network thread for message passing between the graph nodes that reside on different physical machines. The underlying invariant is that the data always flows downstream and the control or intent always flows upstream. The RCA runtime in all the locations have the entire Analysis graph. So when a runtime present in a downstream location figures out that it needs data from an upstream location, it initiates an intent to get the data from the upstream node. This intent message contains the Graph Node name that is requesting the data, and the tags associated with that graph node. This message is passed to the local network thread. The network thread broadcasts this to all the peers upstream. The receivers of this message adds an entry to a table with this information and then initiates a long lived connection with the downstream node that expressed the intention. When the runtime at the data generation side, based on tag -matching, figures out that a given node cannot run locally, then it send it the way of the network thread. This message has the data and also the Graph node and the tags of the node that is expected to receive it. Given this information, the network thread looks up its local table and see if there is a downstream node that has expressed an intension to receive such a message; if so, then the network thread sends the message to the interested parties. If there are no takers for it then the network thread drops it on the floor. On the message receiver side, when the network thread receives such a message, it puts it in a location and provides a handle to the Runtime. In the next execution of the tasks, the Runtime uses this data for evaluation. ## Specification ## This is the way of specifying the Analysis Flow Field or constructing it by creating the Metrics, Symptoms and RCA objects and defining their upstreams . This is defining what will run to calculate the RCA and this is provided by the user of the RCA system. Specifying RCA is a three step process: 1. Create all the Symptoms and RCA classes. 2. Define the threshold and their environment based overrides using the .json file. 3. Extend the AnalysisFlowField class and override the constructor method. In this method, instantiate the Symptom and RCA classes and define their upstream dependencies. An upstream dependency can be a Metric or a Symptom or an RCA. This is essentially a full connectedComponent specification. 4. For each of the class that extends the Symptom or RCA class, fill in the evaluate method. An evaluate method can be arbitrary Java code making use of the dependencies defined in step 3 and thresholds. ```java // 1. Create your classes by extending the Symptoms and Rca classes. class MySymptom extends Symptom { @Override public BooleanFlowUnit evaluate(Map> dependencies) { return NULL; } } class MyRca extends Rca { @Override public FlowUnit evaluate(Map> dependencies) { return NULL; } } // 2. Extend the AnalysisFLowField and fill in the construct method. class MyAnalysisFlowField extends AnalysisFlowField { @Override public void construct() { // 1. Instantiate all the Metrics and Symptoms and RCAs here. // 2. Add the metrics to the flow field by calling addLeaf() // 2. Connect node to other nodes by stating its dependencies by // callling .addAllUpstreams() } } operate // with some meaningful calculations using the static methods from // NumericAggregator and/or BooleanAggregator ``` ## Runtime Evaluation of RCAs ## The RCAs defined in the specification are executed by the RCA agent on the nodes of the cluster. This is how the runtime instances the AnalysisFlowField and evaluates it: 1. Call the construct() method on the class that overrides the AnalysisFlowField. 2. call the validateAndProcess() on the same object to verify the connectedComponent creation and instantiation of some connectedComponent data structures. 3. Call the Stats.getInstance().getGraphs() to get all the graphs in the flowField. 4. foreach connectedComponent as returned in 3. call the getAllNodesByDependencyOrder() to get a hierarchical list of Nodes to evaluate. 5. For each node, call the getUpstreams(), and create a list of samples of the dependencySpec thus returned. 6. With all such dependencies, call the evaluate() method of the node. ```java class Runtime { AnalysisFlowField flowField = new MyAnalysisFlowField(); flowField.construct(); flowField.validateAndProcess(); for(Graph connectedComponent: Stats.getInstance().getGraphs()) { for(List list: connectedComponent.getAllNodesByDependencyOrder()) { for (Node node: list) { Map> map = new HashMap(); for (DependencySpec dep: node.getUpstreams()) { // Gather the dependencies // create a list with number of samples as mentioned in DependencySpec.windowLength. // map.put(DependencySpec.dependencyClass.class, ArrayList<>()); } node.evaluate(map); } } } } ```