U Dx`3h@sdZddlZddlmZmZddlmZmZm Z m Z m Z m Z m Z mZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZmZm Z m!Z!m"Z"m#Z#m$Z$ddZ%ddZ&d d d Z'd d Z(d dZ)d!ddZ*d"ddZ+d#ddZ,d$ddZ-d%ddZ.d&ddZ/ddZ0d'ddZ1dS)(zEDataset is currently unstable. APIs subject to change without notice.N)_stringify_path _is_path_like) CsvFileFormat ExpressionDatasetDatasetFactoryDirectoryPartitioning FileFormat FileFragmentFileSystemDatasetFileSystemDatasetFactoryFileSystemFactoryOptionsFileWriteOptionsFragmentHivePartitioning IpcFileFormatIpcFileWriteOptionsParquetDatasetFactoryParquetFactoryOptionsParquetFileFormatParquetFileFragmentParquetFileWriteOptionsParquetReadOptions PartitioningPartitioningFactory RowGroupInfoScannerScanTask UnionDatasetUnionDatasetFactory_get_partition_keys_filesystemdataset_writecCs t|S)adReference a named column of the dataset. Stores only the field's name. Type and other information is known only when the expression is bound to a dataset having an explicit scheme. Parameters ---------- name : string The name of the field the expression references to. Returns ------- field_expr : Expression )rZ_field)namer#6/tmp/pip-target-oguziej0/lib/python/pyarrow/dataset.pyfield9sr%cCs t|S)aExpression representing a scalar value. Parameters ---------- value : bool, int, float or string Python value of the scalar. Note that only a subset of types are currently supported. Returns ------- scalar_expr : Expression )rZ_scalar)valuer#r#r$scalarKs r'cCs|dkrd|dk r*|dk r tdt||S|dk rZt|trFt|Stdt|qtdn^|dkr|dk r~tdq|dk rt|tjrt ||Stdt|qt SntddS) a Specify a partitioning scheme. The supported schemes include: - "DirectoryPartitioning": this scheme expects one segment in the file path for each field in the specified schema (all fields are required to be present). For example given schema the path "/2009/11" would be parsed to ("year"_ == 2009 and "month"_ == 11). - "HivePartitioning": a scheme for "/$key=$value/" nested directories as found in Apache Hive. This is a multi-level, directory based partitioning scheme. Data is partitioned by static values of a particular column in the schema. Partition keys are represented in the form $key=$value in directory names. Field order is ignored, as are missing or unrecognized field names. For example, given schema, a possible path would be "/year=2009/month=11/day=15" (but the field order does not need to match). Parameters ---------- schema : pyarrow.Schema, default None The schema that describes the partitions present in the file path. If not specified, and `field_names` and/or `flavor` are specified, the schema will be inferred from the file path (and a PartitioningFactory is returned). field_names : list of str, default None A list of strings (field names). If specified, the schema's types are inferred from the file paths (only valid for DirectoryPartitioning). flavor : str, default None The default is DirectoryPartitioning. Specify ``flavor="hive"`` for a HivePartitioning. dictionaries : List[Array] If the type of any field of `schema` is a dictionary type, the corresponding entry of `dictionaries` must be an array containing every value which may be taken by the corresponding column or an error will be raised in parsing. Returns ------- Partitioning or PartitioningFactory Examples -------- Specify the Schema for paths like "/2009/June": >>> partitioning(pa.schema([("year", pa.int16()), ("month", pa.string())])) or let the types be inferred by only specifying the field names: >>> partitioning(field_names=["year", "month"]) For paths like "/2009/June", the year will be inferred as int32 while month will be inferred as string. Create a Hive scheme for a path like "/year=2009/month=11": >>> partitioning( ... pa.schema([("year", pa.int16()), ("month", pa.int8())]), ... flavor="hive") A Hive scheme can also be discovered from the directory structure (and types will be inferred): >>> partitioning(flavor="hive") Nz.Cannot specify both 'schema' and 'field_names'z$Expected list of field names, got {}zSFor the default directory flavor, need to specify a Schema or a list of field namesZhivez.Cannot specify 'field_names' for flavor 'hive'z$Expected Schema for 'schema', got {}zUnsupported flavor) ValueErrorr isinstancelistdiscoverformattypepaZSchemar)schema field_namesflavorZ dictionariesr#r#r$ partitioning[s>F       r2cCs\|dkr nNt|tr t|d}n8t|tr6t|d}n"t|ttfrFntdt||S)z~ Validate input and return a Partitioning(Factory). It passes None through if no partitioning scheme is defined. N)r1)r0z4Expected Partitioning or PartitioningFactory, got {}) r)strr2r*rrr(r,r-schemer#r#r$_ensure_partitionings    r6cCsJt|tr|S|dkrtS|dkr*tS|dkr8tStd|dS)Nparquet>featherarrowipccsvzformat '{}' is not supported)r)r rrrr(r,)objr#r#r$_ensure_formats r=c sddlm}m}m}m}m}dkr,|n|t||fpVt|oVtj|}fdd|D}|rԈ|D]Z}|j } | |j krqxqx| |j krt |j qx| |jkrtd|j qxtd|j qx|fS)aA Treat a list of paths as files belonging to a single file system If the file system is local then also validates that all paths are referencing existing *files* otherwise any non-file paths will be silently skipped (for example on a remote filesystem). Parameters ---------- paths : list of path-like Note that URIs are not allowed. filesystem : FileSystem or str, optional If an URI is passed, then its path component will act as a prefix for the file paths. Returns ------- (FileSystem, list of str) File system object and a list of normalized paths. Raises ------ TypeError If the passed filesystem has wrong type. IOError If the file system is local and a referenced path is not available or not a file. r)LocalFileSystemSubTreeFileSystem_MockFileSystemFileType_ensure_filesystemNcsg|]}t|qSr#)normalize_pathr).0p filesystemr#r$ sz,_ensure_multiple_sources..zPath {} points to a directory, but only file paths are supported. To construct a nested or union dataset pass a list of dataset objects instead.zPath {} exists but its type is unknown (could be a special file such as a Unix socket or character device, or Windows NUL / CON / ...)) pyarrow.fsr>r?r@rArBr)Zbase_fs get_file_infor-FileNotFoundFileNotFoundErrorpath DirectoryIsADirectoryErrorr,IOError) pathsrGr>r?r@rArBis_localinfo file_typer#rFr$_ensure_multiple_sourcess8      rVcCstddlm}m}m}|||\}}||}||}|j|jkrP||dd}n|j|jkrd|g}nt |||fS)a Treat path as either a recursively traversable directory or a single file. Parameters ---------- path : path-like filesystem : FileSystem or str, optional If an URI is passed, then its path component will act as a prefix for the file paths. Returns ------- (FileSystem, list of str or fs.Selector) File system object and either a single item list pointing to a file or an fs.Selector object pointing to a directory. Raises ------ TypeError If the passed filesystem has wrong type. FileNotFoundError If the referenced file or directory doesn't exist. r)rA FileSelector_resolve_filesystem_and_pathT) recursive) rIrArWrXrCrJr-rOrKrM)rNrGrArWrX file_infopaths_or_selectorr#r#r$_ensure_single_source0s    r\c Csht|pd}t|}t|ttfr2t||\}} nt||\}} t||||d} t|| || } | |S)z Create a FileSystemDataset which can be used to build a Dataset. Parameters are documented in the dataset function. Returns ------- FileSystemDataset r7)r2partition_base_direxclude_invalid_filesselector_ignore_prefixes) r=r6r)r*tuplerVr\r r finish) sourcer/rGr2r,r]r^r_fsr[optionsfactoryr#r#r$_filesystem_dataset_s rfc sVtdd|Drtddkr:tdd|Dfdd|D}t|S)Ncss|]}|dk VqdSNr#)rDvr#r#r$ sz!_union_dataset..zIWhen passing a list of Datasets, you cannot pass any additional argumentscSsg|] }|jqSr#r/rDchildr#r#r$rHsz"_union_dataset..csg|]}|qSr#)Zreplace_schemarkrjr#r$rHs)anyvaluesr(r.Z unify_schemasr)childrenr/kwargsr#rjr$_union_datasetsrqc Csddlm}m}|dkr t}nt|ts2td|dkrB|}n||}|t|}t|t |d}t ||||d} | |S)ax Create a FileSystemDataset from a `_metadata` file created via `pyarrrow.parquet.write_metadata`. Parameters ---------- metadata_path : path, Path pointing to a single file parquet metadata file schema : Schema, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source. filesystem : FileSystem or URI string, default None If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI's optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow 'file:///C:...' or 'file:/C:...' patterns. format : ParquetFileFormat An instance of a ParquetFileFormat if special options needs to be passed. partitioning : Partitioning, PartitioningFactory, str, list of str The partitioning scheme specified with the ``partitioning()`` function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred. partition_base_dir : str, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. Returns ------- FileSystemDataset rr>rBNz+format argument must be a ParquetFileFormat)r]r2)rd) rIr>rBrr)r(rCrrr6rra) Z metadata_pathr/rGr,r2r]r>rBrdrer#r#r$parquet_datasets(' rsc Cst|||||||d}t|r*t|f|St|ttfrtdd|DrVt|f|Stdd|Drtt|f|Stdd|D} d dd| D} t d | nt d t |j d S) as Open a dataset. Datasets provides functionality to efficiently work with tabular, potentially larger than memory and multi-file dataset. - A unified interface for different sources, like Parquet and Feather - Discovery of sources (crawling directories, handle directory-based partitioned datasets, basic schema normalization) - Optimized reading with predicate pushdown (filtering rows), projection (selecting columns), parallel reading or fine-grained managing of tasks. Note that this is the high-level API, to have more control over the dataset construction use the low-level API classes (FileSystemDataset, FilesystemDatasetFactory, etc.) Parameters ---------- source : path, list of paths, dataset, list of datasets or URI Path pointing to a single file: Open a FileSystemDataset from a single file. Path pointing to a directory: The directory gets discovered recursively according to a partitioning scheme if given. List of file paths: Create a FileSystemDataset from explicitly given files. The files must be located on the same filesystem given by the filesystem parameter. Note that in contrary of construction from a single file, passing URIs as paths is not allowed. List of datasets: A nested UnionDataset gets constructed, it allows arbitrary composition of other datasets. Note that additional keyword arguments are not allowed. schema : Schema, optional Optionally provide the Schema for the Dataset, in which case it will not be inferred from the source. format : FileFormat or str Currently "parquet" and "ipc"/"arrow"/"feather" are supported. For Feather, only version 2 files are supported. filesystem : FileSystem or URI string, default None If a single path is given as source and filesystem is None, then the filesystem will be inferred from the path. If an URI string is passed, then a filesystem object is constructed using the URI's optional path component as a directory prefix. See the examples below. Note that the URIs on Windows must follow 'file:///C:...' or 'file:/C:...' patterns. partitioning : Partitioning, PartitioningFactory, str, list of str The partitioning scheme specified with the ``partitioning()`` function. A flavor string can be used as shortcut, and with a list of field names a DirectionaryPartitioning will be inferred. partition_base_dir : str, optional For the purposes of applying the partitioning, paths will be stripped of the partition_base_dir. Files not matching the partition_base_dir prefix will be skipped for partitioning discovery. The ignored files will still be part of the Dataset, but will not have partition information. exclude_invalid_files : bool, optional (default True) If True, invalid files will be excluded (file format specific check). This will incur IO for each files in a serial and single threaded fashion. Disabling this feature will skip the IO, but unsupported files may be present in the Dataset (resulting in an error at scan time). ignore_prefixes : list, optional Files matching any of these prefixes will be ignored by the discovery process. This is matched to the basename of a path. By default this is ['.', '_']. Note that discovery happens only if a directory is passed as source. Returns ------- dataset : Dataset Either a FileSystemDataset or a UnionDataset depending on the source parameter. Examples -------- Opening a single file: >>> dataset("path/to/file.parquet", format="parquet") Opening a single file with an explicit schema: >>> dataset("path/to/file.parquet", schema=myschema, format="parquet") Opening a dataset for a single directory: >>> dataset("path/to/nyc-taxi/", format="parquet") >>> dataset("s3://mybucket/nyc-taxi/", format="parquet") Opening a dataset from a list of relatives local paths: >>> dataset([ ... "part0/data.parquet", ... "part1/data.parquet", ... "part3/data.parquet", ... ], format='parquet') With filesystem provided: >>> paths = [ ... 'part0/data.parquet', ... 'part1/data.parquet', ... 'part3/data.parquet', ... ] >>> dataset(paths, filesystem='file:///directory/prefix, format='parquet') Which is equivalent with: >>> fs = SubTreeFileSystem("/directory/prefix", LocalFileSystem()) >>> dataset(paths, filesystem=fs, format='parquet') With a remote filesystem URI: >>> paths = [ ... 'nested/directory/part0/data.parquet', ... 'nested/directory/part1/data.parquet', ... 'nested/directory/part3/data.parquet', ... ] >>> dataset(paths, filesystem='s3://bucket/', format='parquet') Similarly to the local example, the directory prefix may be included in the filesystem URI: >>> dataset(paths, filesystem='s3://bucket/nested/directory', ... format='parquet') Construction of a nested dataset: >>> dataset([ ... dataset("s3://old-taxi-data", format="parquet"), ... dataset("local/path/to/data", format="ipc") ... ]) )r/rGr2r,r]r^r_css|]}t|VqdSrg)rrDelemr#r#r$rigszdataset..css|]}t|tVqdSrg)r)rrtr#r#r$riiscss|]}t|jVqdSrg)r-__name__rtr#r#r$rilsz, css|]}d|VqdS)z{}N)r,)rDtr#r#r$rimsz`Expected a list of path-like or dataset objects. The given list contains the following types: {}z\Expected a path-like, list of path-likes or a list of Datasets instead of the given type: {}N)dictrrfr)r`r*allrqsetjoin TypeErrorr,r-rv) rbr/r,rGr2r]r^Zignore_prefixesrpZ unique_types type_namesr#r#r$datasets:     r~cCs,|dkrttg}t|ts(td|S)Nz3partitioning needs to be actual Partitioning object)r2r.r/r)rr(r4r#r#r$_ensure_write_partitioningys  rTc Csddlm} m} t|tr&|p"|j}nFt|tjtjfrJ|p@|j}|g}n"t|t rd|p`|dj}nt d|dkrt|t r|j }nt |}|dkr|}||j krtd |||dkrd|j}| dkrd} t|}|dkr| }n| |}t||||||||| dS)aO Write a dataset to a given format and partitioning. Parameters ---------- data : Dataset, Table/RecordBatch, or list of Table/RecordBatch The data to write. This can be a Dataset instance or in-memory Arrow data. base_dir : str The root directory where to write the dataset. basename_template : str, optional A template string used to generate basenames of written data files. The token '{i}' will be replaced with an automatically incremented integer. If not specified, it defaults to "part-{i}." + format.default_extname format : FileFormat or str The format in which to write the dataset. Currently supported: "parquet", "ipc"/"feather". If a FileSystemDataset is being written and `format` is not specified, it defaults to the same format as the specified FileSystemDataset. When writing a Table or RecordBatch, this keyword is required. partitioning : Partitioning, optional The partitioning scheme specified with the ``partitioning()`` function. schema : Schema, optional filesystem : FileSystem, optional file_options : FileWriteOptions, optional FileFormat specific write options, created using the ``FileFormat.make_write_options()`` function. use_threads : bool, default True Write files in parallel. If enabled, then maximum parallelism will be used determined by the number of available CPU cores. max_partitions : int, default 1024 Maximum number of partitions any batch may be written into. rrrzUOnly Dataset, Table/RecordBatch or a list of Table/RecordBatch objects are supported.NzTSupplied FileWriteOptions have format {}, which doesn't match supplied FileFormat {}z part-{i}.i)rIr>rBr)rr/r.ZTableZ RecordBatchr*r(r r,r=Zmake_write_optionsr|Zdefault_extnamerr!) database_dirZbasename_templater,r2r/rGZ file_optionsZ use_threadsZmax_partitionsr>rBr#r#r$ write_datasetsP'      r)NNNN)N)N)NNNNNNN)N)NNNNN)NNNNNNN)NNNNNNTN)2__doc__Zpyarrowr.Z pyarrow.utilrrZpyarrow._datasetrrrrrr r r r r rrrrrrrrrrrrrrrrrrr r!r%r'r2r6r=rVr\rfrqrsr~rrr#r#r#r$sP" h L /  > ,