## Example: Ingesting car auction data For this example, we will continue to use the car auction data from the `sample_data_generator/USAGE.md`. As a primer, the data will contain car makes and models, the date of the auction, and the winning final bid price as a range between $20,000 and $40,000. However, to simulate a data trend, we modify `auction_date` to be the auctions that happened in a particular day, so this field becomes `unix_time`. ``` car_template = { "make": "vehicle_make", "model": "vehicle_model", "year": "vehicle_year", "auction_date": ["date_between", "-10y"], "final_winning_bid": ["float", {"right_digits": 2, "min_value": 20000, "max_value": 40000}] } ``` This would be template to use. For the purposes of this example, the `index_name` might be named `"car_auction_data"`. Since `car_template` is a JSON "Short-hand" (again, see the `sample_data_generator/README.md` for more information), `mapping = False`. The `car_template` is data we want to generate, not read from, so `file_provided = False`. `chunk` can be any number but because we are only generating a day's worth of data, a good chunk size might be something easily divisible, like `chunk = 10`. The `timestamp = "auction_date"`. For the purposes of this example, we can assume `minutes = 30` as that could be the average amount of time an auction takes. Since `minutes` are defined, `number` should be the total amount of minutes per day divided by `minutes` or 1440/30 or `number = 48` (in my startup and refresh jobs, this is automatically configured). As for `current_date`, let's set the date to Jun 8, 2020 (which was the real launch date for a car auction website). As for `max_bulk_size`, which is the maximum request body byte size, let's leave it unchanged at 100000 bytes since we already defined `chunk`. Finally, `anomaly_detection_trend` should have a list of payloads as we want to simulate a trend using `AverageTrend`. Now, to use `AverageTrend`, it requires the following arguments: - `timestamp`: we defined this earlier so its still `"auction_date"` - `feature_trend`: These are the configurations specific to `AverageTrend`: - `feature`: This is the field which we want to generate data trend. In our case, `feature = final_winning_bid` - `avg_min`: This should be $20000 (`avg_min = 20000`) as that is the average minimum winning bid price. - `avg_max`: Same logic as for `avg_min`; `avg_max = 40000` - `abs_min`: This is the absolute lowest a bid can go, which in this case is $0 (`abs_min = 0`) - `abs_max`: This is the absolute max a bid can go, which we cap the price off at say $150000 (`abs_max = 150000`) - `anomaly_percentage`: Lets assume the probability of an abnormal auction happening is 0.01. Thus, `anomaly_percentage = 0.01` - `other_args`: Because anomalies might generate the default arguments and we want a pricetag, let's set `other_args = {"right_digits": 2}` `ingest()` after all these arguments looks as follows: ``` from sample_data_ingestor import ingest from opensearchpy import OpenSearch from datetime import datetime from json import loads # See OpenSearch Python client for reference on using the OpenSearch Python client object # Dummy Serializer to make mock API calls class DummySerializer(): def dumps(self): return "" # Transport class that makes a mock API call class DummyTransport(object): def __init__(self, hosts, responses=None, **kwargs): self.serializer = DummySerializer def perform_request(self, method, url, params=None, headers=None, body=None): return None # OpenSearch client object that will mock API calls client = OpenSearch(transport_class = DummyTransport) feature_trend = { "data_trend": "AverageTrend", "feature" : "final_winning_bid", "anomaly_percentage" : 0.01, "avg_min" : 20000, "avg_max" : 40000, "abs_min" : 0, "abs_max" : 150000, "other_args": {"right_digits": 2} } car_template = { "make": "vehicle_make", "model": "vehicle_model", "year": "vehicle_year", "auction_date": ["date_between", "-10y"], "final_winning_bid": ["float", {"right_digits": 2, "min_value": 20000, "max_value": 40000}] } # This variable is a list as documents can have multiple features anomaly_detection_trend = [feature_trend] result = ingest( client = client, data_template = car_template, index_name = "car_auction_data", file_provided = False, mapping = False, number = 48, chunk = 10, timestamp = "auction_date", minutes = 30, current_date = datetime(2020, 6, 8), anomaly_detection_trend = anomaly_detection_trend ) # The entire list print(result) # Each pricetag generated for doc in result: from json import loads doc_as_dict = loads(doc) print(doc_as_dict["final_winning_bid"]) ``` To spare the trouble of looking at many lines of output, there should be 48 document entries of JSON strings, with most of the `final_winning_bid` being between 20000 and 40000.