{ "metadata": { "version": 1, "disable_limits": false, "instance_type": "ml.m5.4xlarge" }, "nodes": [ { "node_id": "1f7fb244-77d9-4d90-9dba-8d9956ba5f79", "type": "SOURCE", "operator": "sagemaker.s3_source_0.1", "parameters": { "dataset_definition": { "__typename": "S3CreateDatasetDefinitionOutput", "datasetSourceType": "S3", "name": "hotel_bookings.csv", "description": null, "s3ExecutionContext": { "__typename": "S3ExecutionContext", "s3Uri": "s3://hotelbookings/hotel_bookings.csv", "s3ContentType": "csv", "s3HasHeader": true, "s3FieldDelimiter": ",", "s3DirIncludesNested": false, "s3AddsFilenameColumn": false } } }, "inputs": [], "outputs": [ { "name": "default" } ] }, { "node_id": "9ed99534-7891-44ba-972c-849614c96e20", "type": "TRANSFORM", "operator": "sagemaker.spark.infer_and_cast_type_0.1", "parameters": {}, "trained_parameters": { "schema": { "hotel": "string", "is_canceled": "long", "lead_time": "long", "arrival_date_year": "long", "arrival_date_month": "string", "arrival_date_week_number": "long", "arrival_date_day_of_month": "long", "stays_in_weekend_nights": "long", "stays_in_week_nights": "long", "adults": "long", "children": "long", "babies": "long", "meal": "string", "country": "string", "market_segment": "string", "distribution_channel": "string", "is_repeated_guest": "long", "previous_cancellations": "long", "previous_bookings_not_canceled": "long", "reserved_room_type": "string", "assigned_room_type": "string", "booking_changes": "long", "deposit_type": "string", "agent": "long", "company": "string", "days_in_waiting_list": "long", "customer_type": "string", "adr": "float", "required_car_parking_spaces": "long", "total_of_special_requests": "long", "reservation_status": "string", "reservation_status_date": "date" } }, "inputs": [ { "name": "default", "node_id": "1f7fb244-77d9-4d90-9dba-8d9956ba5f79", "output_name": "default" } ], "outputs": [ { "name": "default" } ] }, { "node_id": "fd325a95-b061-46bb-a9bc-f0d4b1b10b85", "type": "VISUALIZATION", "operator": "sagemaker.visualizations.data_insights_report_0.1", "parameters": { "report_content": "Warning summary", "insights_report_parameters": { "target_column": "is_canceled", "problem_type": "Classification" } }, "trained_parameters": { "insights_report_parameters": { "duplicate_ratio": 0.3364184605075802, "high warnings": [ { "title": "Duplicate rows", "warningText": "
We found that 33.6% of the data are duplicate. Some data sources could include valid duplicates and in other cases these duplicates could point to problems in data collection. Duplicate samples resulting from faulty data collection, could derail machine learning processes that rely on splitting to independent training and validation folds. For example quick model scores, prediction power estimation and automatic hyper parameter tuning. Duplicate samples could be removed from the dataset using the Drop duplicates transform under Manage rows.
", "severity": "high_sev" }, { "title": "Target leakage", "warningText": "The feature reservation_status predicts the target extremely well on it's own. A feature this predictive often indicates an error called target leakage. The cause is typically data that is not available at time of prediction. For example, a duplicate of the target column in the dataset can result in target leakage.
Alternatively, if the machine learning task is \"easy\", then a single feature can have legitimately high prediction power. If you think that a single feature is very highly predictive, you don't need to do anything further. However, if you think there's target leakage, we recommended that you remove the highly predictive column from the dataset using the Drop column transform under Manage columns.