Metadata-Version: 2.1 Name: awswrangler Version: 2.10.0 Summary: Pandas on AWS. Home-page: Author: Igor Tavares License: Apache License 2.0 Platform: UNKNOWN Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Programming Language :: Python :: 3.8 Classifier: Programming Language :: Python :: 3.9 Requires-Python: >=3.6, <3.10 Description-Content-Type: text/markdown License-File: LICENSE.txt License-File: NOTICE.txt License-File: THIRD_PARTY.txt Requires-Dist: boto3 (<2.1.0,>=1.16.8) Requires-Dist: botocore (<2.1.0,>=1.19.8) Requires-Dist: numpy (<2.1.0,>=1.18.0) Requires-Dist: pandas (<2.1.0,>=1.1.0) Requires-Dist: pyarrow (<4.1.0,>=2.0.0) Requires-Dist: redshift-connector (~=2.0.882) Requires-Dist: pymysql (<1.1.0,>=0.9.0) Requires-Dist: pg8000 (<1.21.0,>=1.16.0) Requires-Dist: openpyxl (~=3.0.0) Provides-Extra: excel-py3.6 Requires-Dist: xlrd (>=2.0.1) ; extra == 'excel-py3.6' Requires-Dist: xlwt (>=1.3.0) ; extra == 'excel-py3.6' Provides-Extra: sqlserver Requires-Dist: pyodbc (~=4.0.30) ; extra == 'sqlserver' # AWS Data Wrangler *Pandas on AWS* Easy integration with Athena, Glue, Redshift, Timestream, QuickSight, Chime, CloudWatchLogs, DynamoDB, EMR, SecretManager, PostgreSQL, MySQL, SQLServer and S3 (Parquet, CSV, JSON and EXCEL). ![AWS Data Wrangler](docs/source/_static/logo2.png?raw=true "AWS Data Wrangler") > An [AWS Professional Service]( open source initiative | [![Release](]( [![Python Version](]( [![Code style: black](]( [![License](]( [![Checked with mypy](]( [![Coverage](]( ![Static Checking]( [![Documentation Status](]( | Source | Downloads | Installation Command | |--------|-----------|----------------------| | **[PyPi](** | [![PyPI Downloads](]( | `pip install awswrangler` | | **[Conda](** | [![Conda Downloads](]( | `conda install -c conda-forge awswrangler` | > ⚠️ **For platforms without PyArrow 3 support (e.g. [EMR](, [Glue PySpark Job](, MWAA):**
➡️ `pip install pyarrow==2 awswrangler` Powered By []( ## Table of contents - [Quick Start](#quick-start) - [Read The Docs](#read-the-docs) - [Getting Help](#getting-help) - [Community Resources](#community-resources) - [Logging](#logging) - [Who uses AWS Data Wrangler?](#who-uses-aws-data-wrangler) - [What is Amazon Sagemaker Data Wrangler?](#what-is-amazon-sageMaker-data-wrangler) ## Quick Start Installation command: `pip install awswrangler` > ⚠️ **For platforms without PyArrow 3 support (e.g. [EMR](, [Glue PySpark Job](, MWAA):**
➡️`pip install pyarrow==2 awswrangler` ```py3 import awswrangler as wr import pandas as pd from datetime import datetime df = pd.DataFrame({"id": [1, 2], "value": ["foo", "boo"]}) # Storing data on Data Lake wr.s3.to_parquet( df=df, path="s3://bucket/dataset/", dataset=True, database="my_db", table="my_table" ) # Retrieving the data directly from Amazon S3 df = wr.s3.read_parquet("s3://bucket/dataset/", dataset=True) # Retrieving the data from Amazon Athena df = wr.athena.read_sql_query("SELECT * FROM my_table", database="my_db") # Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrum con = wr.redshift.connect("my-glue-connection") df = wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table", con=con) con.close() # Amazon Timestream Write df = pd.DataFrame({ "time": [,], "my_dimension": ["foo", "boo"], "measure": [1.0, 1.1], }) rejected_records = wr.timestream.write(df, database="sampleDB", table="sampleTable", time_col="time", measure_col="measure", dimensions_cols=["my_dimension"], ) # Amazon Timestream Query wr.timestream.query(""" SELECT time, measure_value::double, my_dimension FROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3 """) ``` ## [Read The Docs]( - [**What is AWS Data Wrangler?**]( - [**Install**]( - [PyPi (pip)]( - [Conda]( - [AWS Lambda Layer]( - [AWS Glue Python Shell Jobs]( - [AWS Glue PySpark Jobs]( - [Amazon SageMaker Notebook]( - [Amazon SageMaker Notebook Lifecycle]( - [EMR]( - [From source]( - [**Tutorials**]( - [001 - Introduction]( - [002 - Sessions]( - [003 - Amazon S3]( - [004 - Parquet Datasets]( - [005 - Glue Catalog]( - [006 - Amazon Athena]( - [007 - Databases (Redshift, MySQL, PostgreSQL and SQL Server)]( - [008 - Redshift - Copy & Unload.ipynb]( - [009 - Redshift - Append, Overwrite and Upsert]( - [010 - Parquet Crawler]( - [011 - CSV Datasets]( - [012 - CSV Crawler]( - [013 - Merging Datasets on S3]( - [014 - Schema Evolution]( - [015 - EMR]( - [016 - EMR & Docker]( - [017 - Partition Projection]( - [018 - QuickSight]( - [019 - Athena Cache]( - [020 - Spark Table Interoperability]( - [021 - Global Configurations]( - [022 - Writing Partitions Concurrently]( - [023 - Flexible Partitions Filter]( - [024 - Athena Query Metadata]( - [025 - Redshift - Loading Parquet files with Spectrum]( - [026 - Amazon Timestream]( - [027 - Amazon Timestream 2]( - [028 - Amazon DynamoDB]( - [**API Reference**]( - [Amazon S3]( - [AWS Glue Catalog]( - [Amazon Athena]( - [Amazon Redshift]( - [PostgreSQL]( - [MySQL]( - [SQL Server]( - [DynamoDB]( - [Amazon Timestream]( - [Amazon EMR]( - [Amazon CloudWatch Logs]( - [Amazon Chime]( - [Amazon QuickSight]( - [AWS STS]( - [AWS Secrets Manager]( - [**License**]( - [**Contributing**]( - [**Legacy Docs** (pre-1.0.0)]( ## Getting Help The best way to interact with our team is through GitHub. You can open an [issue]( and choose from one of our templates for bug reports, feature requests... You may also find help on these community resources: * The #aws-data-wrangler Slack [channel]( * Ask a question on [Stack Overflow]( and tag it with `awswrangler` ## Community Resources Please [send a Pull Request]( with your resource reference and @githubhandle. - [Optimize Python ETL by extending Pandas with AWS Data Wrangler]( [[@igorborgest](] - [Reading Parquet Files With AWS Lambda]( [[@anand086](] - [Transform AWS CloudTrail data using AWS Data Wrangler]( [[@anand086](] - [Rename Glue Tables using AWS Data Wrangler]( [[@anand086](] - [Getting started on AWS Data Wrangler and Athena]( [[@dheerajsharma21](] - [Simplifying Pandas integration with AWS data related services]( [[@bvsubhash](] - [Build an ETL pipeline using AWS S3, Glue and Athena]( [[@taupirho](] ## Logging Enabling internal logging examples: ```py3 import logging logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s") logging.getLogger("awswrangler").setLevel(logging.DEBUG) logging.getLogger("botocore.credentials").setLevel(logging.CRITICAL) ``` Into AWS lambda: ```py3 import logging logging.getLogger("awswrangler").setLevel(logging.DEBUG) ``` ## Who uses AWS Data Wrangler? Knowing which companies are using this library is important to help prioritize the project internally. Please [send a Pull Request]( with your company name and @githubhandle if you may. - [Amazon]( - [AWS]( - [Cepsa]( [[@alvaropc](] - [Cognitivo]( [[@msantino](] - [DNX]( [[@DNXLabs](] - [Funcional Health Tech]( [[@webysther](] - [Informa Markets]( [[@mateusmorato]]( - [LINE TV]( [[@bryanyang0528](] - [Magnataur]( [[@brianmingus2](] - [M4U]( [[@Thiago-Dantas](] - [NBCUniversal]( [[@vibe](] - []( [[@mrtns](] - [OKRA Technologies]( [[@JPFrancoia](, [@schot](] - [Pier]( [[@flaviomax](] - [Pismo]( [[@msantino](] - [ringDNA]( [[@msropp](] - [Serasa Experian]( [[@andre-marcos-perez](] - [Shipwell]( [[@zacharycarter](] - [strongDM]( [[@mrtns](] - [Thinkbumblebee]( [[@dheerajsharma21]]( - [Zillow]( [[@nicholas-miles]]( ## What is Amazon SageMaker Data Wrangler? **Amazon SageMaker Data Wrangler** is a new SageMaker Studio feature that has a similar name but has a different purpose than the **AWS Data Wrangler** open source project. - **AWS Data Wrangler** is open source, runs anywhere, and is focused on code. - **Amazon SageMaker Data Wrangler** is specific for the SageMaker Studio environment and is focused on a visual interface.