============= ml ============= .. rubric:: Table of contents .. contents:: :local: :depth: 2 Description ============ | The ``ml`` command is to train/predict/trainandpredict on any algorithm in the ml-commons plugin on the search result returned by a PPL command. List of algorithms supported ============ AD(RCF) KMEANS AD - Fixed In Time RCF For Time-series Data Command Syntax ===================================================== ml action='train' algorithm='rcf' * number_of_trees(integer): optional. Number of trees in the forest. The default value is 30. * shingle_size(integer): optional. A shingle is a consecutive sequence of the most recent records. The default value is 8. * sample_size(integer): optional. The sample size used by stream samplers in this forest. The default value is 256. * output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32. * time_decay(double): optional. The decay factor used by stream samplers in this forest. The default value is 0.0001. * anomaly_rate(double): optional. The anomaly rate. The default value is 0.005. * time_field(string): mandatory. It specifies the time field for RCF to use as time-series data. * date_format(string): optional. It's used for formatting time_field field. The default formatting is "yyyy-MM-dd HH:mm:ss". * time_zone(string): optional. It's used for setting time zone for time_field filed. The default time zone is UTC. * category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted. AD - Batch RCF for Non-time-series Data Command Syntax ================================================= ml action='train' algorithm='rcf' * number_of_trees(integer): optional. Number of trees in the forest. The default value is 30. * sample_size(integer): optional. Number of random samples given to each tree from the training data set. The default value is 256. * output_after(integer): optional. The number of points required by stream samplers before results are returned. The default value is 32. * training_data_size(integer): optional. The default value is the size of your training data set. * anomaly_score_threshold(double): optional. The threshold of anomaly score. The default value is 1.0. * category_field(string): optional. It specifies the category field used to group inputs. Each category will be independently predicted. Example 1: Detecting events in New York City from taxi ridership data with time-series data =========================================================================================== The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data. PPL query:: os> source=nyc_taxi | fields value, timestamp | ml action='train' algorithm='rcf' time_field='timestamp' | where value=10844.0 fetched rows / total rows = 1/1 +---------+---------------------+---------+-----------------+ | value | timestamp | score | anomaly_grade | |---------+---------------------+---------+-----------------| | 10844.0 | 2014-07-01 00:00:00 | 0.0 | 0.0 | +---------+---------------------+---------+-----------------+ Example 2: Detecting events in New York City from taxi ridership data with time-series data independently with each category ============================================================================================================================ The example trains an RCF model and uses the model to detect anomalies in the time-series ridership data with multiple category values. PPL query:: os> source=nyc_taxi | fields category, value, timestamp | ml action='train' algorithm='rcf' time_field='timestamp' category_field='category' | where value=10844.0 or value=6526.0 fetched rows / total rows = 2/2 +------------+---------+---------------------+---------+-----------------+ | category | value | timestamp | score | anomaly_grade | |------------+---------+---------------------+---------+-----------------| | night | 10844.0 | 2014-07-01 00:00:00 | 0.0 | 0.0 | | day | 6526.0 | 2014-07-01 06:00:00 | 0.0 | 0.0 | +------------+---------+---------------------+---------+-----------------+ Example 3: Detecting events in New York City from taxi ridership data with non-time-series data =============================================================================================== The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data. PPL query:: os> source=nyc_taxi | fields value | ml action='train' algorithm='rcf' | where value=10844.0 fetched rows / total rows = 1/1 +---------+---------+-------------+ | value | score | anomalous | |---------+---------+-------------| | 10844.0 | 0.0 | False | +---------+---------+-------------+ Example 4: Detecting events in New York City from taxi ridership data with non-time-series data independently with each category ================================================================================================================================ The example trains an RCF model and uses the model to detect anomalies in the non-time-series ridership data with multiple category values. PPL query:: os> source=nyc_taxi | fields category, value | ml action='train' algorithm='rcf' category_field='category' | where value=10844.0 or value=6526.0 fetched rows / total rows = 2/2 +------------+---------+---------+-------------+ | category | value | score | anomalous | |------------+---------+---------+-------------| | night | 10844.0 | 0.0 | False | | day | 6526.0 | 0.0 | False | +------------+---------+---------+-------------+ KMEANS ====== ml action='train' algorithm='kmeans' * centroids: optional. The number of clusters you want to group your data points into. The default value is 2. * iterations: optional. Number of iterations. The default value is 10. * distance_type: optional. The distance type can be COSINE, L1, or EUCLIDEAN, The default type is EUCLIDEAN. Example: Clustering of Iris Dataset =================================== The example shows how to classify three Iris species (Iris setosa, Iris virginica and Iris versicolor) based on the combination of four features measured from each sample: the length and the width of the sepals and petals. PPL query:: os> source=iris_data | fields sepal_length_in_cm, sepal_width_in_cm, petal_length_in_cm, petal_width_in_cm | ml action='train' algorithm='kmeans' centroids=3 +--------------------+-------------------+--------------------+-------------------+-----------+ | sepal_length_in_cm | sepal_width_in_cm | petal_length_in_cm | petal_width_in_cm | ClusterID | |--------------------+-------------------+--------------------+-------------------+-----------| | 5.1 | 3.5 | 1.4 | 0.2 | 1 | | 5.6 | 3.0 | 4.1 | 1.3 | 0 | | 6.7 | 2.5 | 5.8 | 1.8 | 2 | +--------------------+-------------------+--------------------+-------------------+-----------+