U Dx`@sdZddlmZddlZddlmZddlZddlZddlZddl m Z m Z m Z m Z mZmZmZmZmZddlZddlmZddlZddlmZddlmZdd lmZmZmZm Z m!Z!dd l"m#Z#m$Z$dd l%m&Z&m'Z'm(Z(dd l)m*Z*m+Z+m,Z,m-Z-m.Z.m/Z/m0Z0m1Z1dd l2m3Z3ddl4m5Z5ddl6m7Z7ddl8m9Z9ddl:m;Z;dZdZ?dZ@dZAde=de>de?de@deAd ZBde=de>dZCde=de>de?deAd ZDd d!d"d#d$d%d&d'd(g ZEed)d*d*ZFe9d+d,d-ZGe9eHe9d.d/d0ZId1ZJGd2d3d3eKZLd4ZMGd5d6d6eKZNd7ZOGd8d9d9eKZPd:ZQGd;d<dd?d@ZTGdAdBdBZUGdCdDdDZVGdEdFdFZWGdGdHdHeWejXZYe#eBdpeeZeZeeHeZeZeeeHeZee[eZe!ee5eYfdK dLdMZ\eHeHdNdOdPZ]e e[e dQdRdSZ^eHej_dTdUdVZ`e e e e dWdXdYZaej_e9e[dZd[d\Zbdqe9e[eZeHd^d_d`Zce$e3jddadbGdcddddeWZeej_e9eZe[dedfdgZfeeHegfe[egdQdhdiZhGdjdkdkZiGdldmdmeeZjGdndodoejZkdS)ra Module contains tools for processing Stata files into DataFrames The StataReader below was originally written by Joe Presbrey as part of PyDTA. It has been extended and improved by Skipper Seabold from the Statsmodels project who also developed the StataWriter and was finally added to pandas in a once again improved version. You can find more information on http://presbrey.mit.edu/PyDTA and https://www.statsmodels.org/devel/ )abcN)BytesIO) AnyAnyStrDictListOptionalSequenceTupleUnioncast) relativedelta) infer_dtype)max_len_string_array)BufferCompressionOptionsFilePathOrBufferLabelStorageOptions)Appenderdoc) ensure_objectis_categorical_dtypeis_datetime64_dtype) Categorical DatetimeIndexNaT Timestampconcatisna to_datetime to_timedelta)generic) DataFrame)Index)Series) get_handlezVersion of given Stata file is {version}. pandas supports importing versions 105, 108, 111 (Stata 7SE), 113 (Stata 8/9), 114 (Stata 10/11), 115 (Stata 12), 117 (Stata 13), 118 (Stata 14/15/16),and 119 (Stata 15/16, over 32,767 variables).zconvert_dates : bool, default True Convert date variables to DataFrame time values. convert_categoricals : bool, default True Read value labels and convert columns to Categorical/Factor variables.aindex_col : str, optional Column to set as index. convert_missing : bool, default False Flag indicating whether to convert missing values to their Stata representations. If False, missing values are replaced with nan. If True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects. preserve_dtypes : bool, default True Preserve Stata datatypes. If False, numeric data are upcast to pandas default types for foreign data (float64 or int64). columns : list or None Columns to retain. Columns will be returned in the given order. None returns all columns. order_categoricals : bool, default True Flag indicating whether converted categorical data are ordered.zzchunksize : int, default None Return StataReader object for iterations, returns chunks with given number of lines.z=iterator : bool, default False Return StataReader object.zNotes ----- Categorical variables read through an iterator may not have the same categories and dtype. This occurs when a variable stored in a DTA file is associated to an incomplete set of value labels that only label a strict subset of the values.a> Read Stata file into DataFrame. Parameters ---------- filepath_or_buffer : str, path object or file-like object Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: ``file://localhost/path/to/table.dta``. If you want to pass in a path object, pandas accepts any ``os.PathLike``. By file-like object, we refer to objects with a ``read()`` method, such as a file handle (e.g. via builtin ``open`` function) or ``StringIO``.  z Returns ------- DataFrame or StataReader See Also -------- io.stata.StataReader : Low-level reader for Stata data files. DataFrame.to_stata: Export Stata data files. z Examples -------- Read a Stata dta file: >>> df = pd.read_stata('filename.dta') Read a Stata dta file in 10,000 line chunks: >>> itr = pd.read_stata('filename.dta', chunksize=10000) >>> for chunk in itr: ... do_something(chunk) zReads observations from Stata file, converting them into a dataframe Parameters ---------- nrows : int Number of lines to read from data file, if None read whole file. z Returns ------- DataFrame zClass for reading Stata dta files. Parameters ---------- path_or_buf : path (string), buffer or path object string, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary read() functions. z %tc%tC%td%d%tw%tm%tq%th%tyreturncsftjjtjjtjtdddjtjtdddjddddddtdfdd }tdfd d }tdfd d }t|}d }| rd}t|}d||<| tj }| drt }|} ||| d} n^| dr*tdt|td} |r&t| |<| S| drNt }|} ||| d} n| drt j|d} |dd} || | } n| drt j|d} |dd} || | } n| drt j|d} |ddd}|| |} nl| drt j|d } |d d!d} || | } n6| d"rD|} t|}|| |} ntd#|d$|rbt| |<| S)%a Convert from SIF to datetime. https://www.stata.com/help.cgi?datetime Parameters ---------- dates : Series The Stata Internal Format date to convert to datetime according to fmt fmt : str The format to convert to. Can be, tc, td, tw, tm, tq, th, ty Returns Returns ------- converted : Series The converted dates Examples -------- >>> dates = pd.Series([52]) >>> _stata_elapsed_date_to_datetime_vec(dates , "%tw") 0 1961-01-01 dtype: datetime64[ns] Notes ----- datetime/c - tc milliseconds since 01jan1960 00:00:00.000, assuming 86,400 s/day datetime/C - tC - NOT IMPLEMENTED milliseconds since 01jan1960 00:00:00.000, adjusted for leap seconds date - td days since 01jan1960 (01jan1960 = 0) weekly date - tw weeks since 1960w1 This assumes 52 weeks in a year, then adds 7 * remainder of the weeks. The datetime value is the start of the week in terms of days in the year, not ISO calendar weeks. monthly date - tm months since 1960m1 quarterly date - tq quarters since 1960q1 half-yearly date - th half-years since 1960h1 yearly date - ty years since 0000 r1r2ir3csX|kr,|kr,td||ddSt|dd}tddt||D|dSdS) z Convert year and month to datetimes, using pandas vectorized versions when the date range falls within the range supported by pandas. Otherwise it falls back to a slower but more robust method using datetime. dz%Y%mformatindexNcSsg|]\}}t||dqSr2)datetime).0ymr@6/tmp/pip-target-zr53vnty/lib/python/pandas/io/stata.py szX_stata_elapsed_date_to_datetime_vec..convert_year_month_safe..r:)maxminr getattrr%zip)yearmonthr:MAX_YEARMIN_YEARr@rAconvert_year_month_safes zD_stata_elapsed_date_to_datetime_vec..convert_year_month_safecsd|dkr4|kr4t|ddt|ddSt|dd}dd t||D}t||d SdS) z{ Converts year (e.g. 1999) and days since the start of the year to a datetime or datetime64 Series r2%Yr8dunitr:NcSs,g|]$\}}t|ddtt|dqS)r2days)r<r int)r=r>rOr@r@rArB szW_stata_elapsed_date_to_datetime_vec..convert_year_days_safe..rC)rDrEr r!rFrGr%)rHrSr:valuerJr@rAconvert_year_days_safes zC_stata_elapsed_date_to_datetime_vec..convert_year_days_safecst|dd}|dkrL|ks,|krfdd|D}t||dSnH|dkr|ksl|krfdd|D}t||dSntd tt||d }|S) z Convert base dates and deltas to datetimes, using pandas vectorized versions if the deltas satisfy restrictions required to be expressed as dates in pandas. r:NrOcsg|]}tt|dqS)rRr rTr=rObaser@rArBszS_stata_elapsed_date_to_datetime_vec..convert_delta_safe..rCmscs"g|]}tt|ddqS)r6) microsecondsrWrXrYr@rArBszformat not understoodrP)rFrDrEr% ValueErrorr r!)rZZdeltasrQr:values) MAX_DAY_DELTA MAX_MS_DELTA MIN_DAY_DELTA MIN_MS_DELTArYrAconvert_delta_safes   z?_stata_elapsed_date_to_datetime_vec..convert_delta_safeFTg?r(tcr[r)ZtCz9Encountered %tC format. Leaving in Stata Internal Format.dtype)r*tdr+rOrOr,tw4r-tm r.tqr/thr0tyz Date fmt  not understood)rrErHrDr<rSr%npisnananyastypeint64 startswith stata_epochwarningswarnobjectrZ ones_liker])datesfmtrMrVrcZbad_locsZhas_bad_valuesZdata_colrZr[ conv_datesrSrHrIZ quarter_monthZ first_monthr@)r_r`rKrarbrLrA#_stata_elapsed_date_to_datetime_vecsj.                    r)rrr4cs|jddd"fdd }t|}|j|r`t|}t|rXtt||<nt||<|dkr||dd}|jd}n<|d krt d |}n"|d kr||dd}|j}n|d kr||ddd }d|j tj |j d}n|dkr"||dd}d|j tj |j d}n|dkrX||dd}d|j tj |j dd}nf|dkr||dd}d|j tj |j dk t}n.|dkr||dd}|j }ntd|dt|tjd}tddd }|||<t|d!S)#aX Convert from datetime to SIF. https://www.stata.com/help.cgi?datetime Parameters ---------- dates : Series Series or array containing datetime.datetime or datetime64[ns] to convert to the Stata Internal Format given by fmt fmt : str The format to convert to. Can be, tc, td, tw, tm, tq, th, ty l"R:r6Fc sVi}t|jr|r0|t}|jtjd|d<|s8|rXt|}|jj |d<|jj |d<|r|tjt |dddtj}||d<nt |dd d krB|r|jt}t jtd fd d }t|} | ||d<|r|dd} | jd|d<| j|dd|d<|rJt j td dd} t| } | ||d<ntdt|dS)Nr6deltarHrIrNr8rSFZskipnar<xr4cs|jd|j|jS)Ni@B)rSsecondsr\r) US_PER_DAYr@rAfszC_datetime_to_stata_elapsed_vec..parse_dates_safe..fcSsd|j|jS)Nr7)rHrIrr@r@rAzJ_datetime_to_stata_elapsed_vec..parse_dates_safe..r7cSs|t|jddjS)Nr2)r<rHrSrr@r@rAgszC_datetime_to_stata_elapsed_vec..parse_dates_safe..gzZColumns containing dates must contain either datetime64, datetime.datetime or null values.rC)rrhr_valuesrr|rr_datarHrIr rr< timedeltafloatZ vectorizeapplyrTr]r#) rrrHrSrOZ time_deltaZ date_indexZ days_in_nsrvZ year_monthrZ NS_PER_DAYrr:r@rAparse_dates_safepsF        z8_datetime_to_stata_elapsed_vec..parse_dates_saferdT)rrfz'Stata Internal Format tC not supported.)r*rirj)rHrSrlrmrn)rHrpr2rqrsrtrurwrxryFormat z! is not a known Stata date formatrg.)keyrr2i}zaStata value labels for a single variable must have a combined length less than 32,000 characters.rgrs)r]namelabname _encodingcat categorieslistrGr|arangelen value_labelssorttext_lentxtn isinstancestrrrvalue_label_mismatch_docr9rencodeappendarrayroffval)selfrrroffsetsr^vlcategoryr@r@rA__init__gsB       zStataValueLabel.__init__) byteorderr4c Cs*|j}t}d}|t|d|jt|jdd|}|dkrLdnd}t ||d}||t dD]}|td |qp|t|d|j |t|d|j |j D]}|t|d|q|jD]} |t|d| q|jD]} || |q|d |S) a! Generate the binary representation of the value labels. Parameters ---------- byteorder : str Byte order of the output Returns ------- value_label : bytes Bytes containing the formatted value label iN )rutf8r2rtcr)rrwriterpackrrrr _pad_bytesrangerrrrrseekread) rrrbio null_byterZlab_lenroffsetrUtextr@r@rAgenerate_value_labels(      z$StataValueLabel.generate_value_labelN)r) rrr__doc__r%rrbytesrr@r@r@rAr[s ,rc@seZdZUdZiZeeefed<dZ e D]4Z dee <e ddD]Z de de ee e <q@q*dZed d d Ze dD]dZ ed ed Zdee<e d kreee de 7<ed ed ed eZed eZq|d Zeddd Ze dD]fZ eded Zdee<e d kr Integer missing values make the code '.', '.a', ..., '.z' to the ranges 101 ... 127 (for int8), 32741 ... 32767 (for int16) and 2147483621 ... 2147483647 (for int32). Missing values for floating point data types are more complex but the pattern is simple to discern from the following table. np.float32 missing values (float in Stata) 0000007f . 0008007f .a 0010007f .b ... 00c0007f .x 00c8007f .y 00d0007f .z np.float64 missing values (double in Stata) 000000000000e07f . 000000000001e07f .a 000000000002e07f .b ... 000000000018e07f .x 000000000019e07f .y 00000000001ae07f .z MISSING_VALUES)e.r2`zs          z(StataMissingValue.get_base_missing_valueN)&rrrrrrrr__annotations__basesbrrchrZ float32_baserr incrementrrZ int_valueZ float64_baserr rTrpropertyr rUr rrboolr classmethodr|rhrr@r@r@rArsR %     rc@seZdZddZdS) StataParserc CsttttddddtddDdttjfdttjfdttjfdttj fd ttj fg|_ ttj ttj ttj ttjttjttjd |_ ttdtd |_d d ddddd |_d}d}d}d}dddt td|dt td|dft td|dt td|dfd|_ddddd d|_ddd t tdd!dt tdd"dd|_d#d$d%d&d'd(d)|_d*|_dS)+Nr2cSsg|]}tdt|qS)a)r|rhr)r=rr@r@rArBbsz(StataParser.__init__..)ZbhlfdQrOrlhrsrsr)rr7)rr)rrrrr)rr,r+rrO)bilfr7rrrrri1i2i4Zf4Zf8u8)rr,r+rrOr*)d?d@Z.e(e(dAdBdCZ/e(e ee(dDdEdFZ0e(e1ee1ee2e feffe eee(dGdHdIZ3e4eddJdKZ5e1eefddLdMZ6e1ee1ee2e feffddNdOZ7Z8S)S StataReaderTNF) path_or_buf convert_datesconvert_categoricals index_colconvert_missingpreserve_dtypescolumnsorder_categoricals chunksizestorage_optionsc stg|_||_||_||_||_||_||_||_ d|_ | |_ d|_ |j dkr^d|_ nt | trp| dkrxtdd|_d|_d|_d|_d|_d|_d|_d|_ttj|_t|d| dd} | j} W5QRXt| |_ |!|"dS)NrFr2rz.chunksize must be a positive integer when set.rb)rgis_text)#superr col_sizes_convert_dates_convert_categoricals _index_col_convert_missing_preserve_dtypes_columns_order_categoricalsr _chunksize_using_iteratorrrTr]Z_has_string_dataZ_missing_values_can_read_value_labels_column_selector_set_value_labels_read _data_read_dtype _lines_read_set_endiannesssysr_native_byteorderr&handlerrr^ _read_header _setup_dtype) rr^r_r`rarbrcrdrerfrghandlescontents __class__r@rArsH    zStataReader.__init__r3cCs|S)z enter context manager r@rr@r@rA __enter__-szStataReader.__enter__cCs |dS)z exit context manager N)close)rexc_type exc_value tracebackr@r@rA__exit__1szStataReader.__exit__cCs|jdS)z close the handle if its open N)r^rrr@r@rAr5szStataReader.closecCs|jdkrd|_nd|_dS)zC Set string encoding which depends on file version vrrN)format_versionrrr@r@rA _set_encoding9s zStataReader._set_encodingcshjd}td|ddkr*n |tddjDdk_fddjD_ dS)Nr2rr.csg|]}|qSr@) _calcsizer=typrr@rArBLs) r^rrr_read_new_header_read_old_headerrtyplistZhas_string_datark)r first_charr@rrArBs    zStataReader._read_headercCs|jdt|jd|_|jdkr:ttj|jd||jd|jddkrbdpdd|_|jd |jd krd nd }|jd krd nd}t |j||j|d|_ |jd| |_ |jd||_|jd||_|jd|jd|jdt |jd|jddd|_t |jd|jddd|_t |jd|jddd|_t |jd|jddd|_t |jd|jddd|_||_|jdt |jd|jddd|_t |jd|jddd|_t |jd|jddd|_||j\|_|_|j|j| |_!|j|jt |jd|j d|jd |j ddd|_"|j|j|#|_$|j|j|%|_&|j|j|'|_(dS)NrrturwversionsMSF><rHIrwrsrrm rr rxr,r2))r^rrTrr]_version_errorr9rrrrnvar _get_nobsnobs_get_data_label _data_label_get_time_stamp time_stampZ_seek_vartypesZ_seek_varnamesZ_seek_sortlistZ _seek_formats_seek_value_label_names_get_seek_variable_labelsZ_seek_variable_labels data_location seek_strlsseek_value_labels _get_dtypesrdtyplistr _get_varlistvarlistsrtlist _get_fmtlistfmtlist _get_lbllistlbllist_get_variable_labels_variable_labels)r nvar_typeZ nvar_sizer@r@rArNsv                           zStataReader._read_new_header) seek_vartypesr4csj|fddtjD}ttttfdfdd fdd|D}ttttjfdfdd fd d|D}||fS) Ncs*g|]"}tjdjddqS)rrwr)rrrr^rr=_rr@rArBsz+StataReader._get_dtypes..)rr4c sR|dkr |Sz j|WStk rL}ztd|d|W5d}~XYnXdS)Ncannot convert stata types [])rXKeyErrorr]rerrrr@rArs  z"StataReader._get_dtypes..fcsg|] }|qSr@r@r)rr@rArBsc sV|dkrt|Sz j|WStk rP}ztd|d|W5d}~XYnXdS)Nrzcannot convert stata dtype [r)rrVrr]rrr@rArs  z"StataReader._get_dtypes..gcsg|] }|qSr@r@r)rr@rArBs) r^rrrrTr rr|rh)rrZ raw_typlistrrr@)rrrrArs  zStataReader._get_dtypescs,jdkrdndfddtjDS)Nr!csg|]}jqSr@_decoder^rrrrr@rArBsz,StataReader._get_varlist..rrrrr@rrArszStataReader._get_varlistcsNjdkrdn$jdkr dnjdkr0dndfdd tjDS) Nr9q1hrprmcsg|]}jqSr@rrrr@rArBsz,StataReader._get_fmtlist..rrr@rrArs   zStataReader._get_fmtlistcs>jdkrdnjdkr dndfddtjDS)Nrrr/rrcsg|]}jqSr@rrrr@rArBsz,StataReader._get_lbllist..rrr@rrArs   zStataReader._get_lbllistcsdjdkr$fddtjD}n<jdkrHfddtjD}nfddtjD}|S)Nrcsg|]}jdqS)iArrrr@rArBsz4StataReader._get_variable_labels..r.csg|]}jdqS)Qrrrr@rArBscsg|]}jdqS)rrrrr@rArBsr)rZvlblistr@rrArs     z StataReader._get_variable_labelscCsJ|jdkr(t|jd|jddSt|jd|jddSdS)Nrr*rrrrs)rrrrr^rrr@r@rArs zStataReader._get_nobscCs|jdkr:t|jd|jdd}||j|S|jdkrntd|jdd}||j|S|jdkr||jd S||jd SdS) Nrrrwrrrr2r.rr)rrrrr^rrrZstrlenr@r@rArs   zStataReader._get_data_labelcCs|jdkr4td|jdd}|j|dS|jdkrhtd|jdd}||j|S|jdkr||jdStdS) Nrrr2rrrr)rrrr^rdecoderr]rr@r@rArs   zStataReader._get_time_stampcCsd|jdkr.|jd|jd|jddS|jdkrZt|jd|jdddStdS) Nrrrrrr) rr^rrrrrrr]rr@r@rArs    "z%StataReader._get_seek_variable_labels)rr4c s@td|d_jdkr.ttjjdtdjdddkrVdpXd_ tdjdd_ jdtj djd d_ _ __jd krfd d tj D}nZjj }tj|tjd }g}|D]2}|jkr,|j|n||dq zfdd |D_WnJtk r}z*ddd|D}td|d|W5d}~XYnXzfdd |D_WnJtk r}z*ddd|D}td|d|W5d}~XYnXjd kr.fdd tj D_nfdd tj D_tj dj djd j ddd___ !_"jdkr0tj djdd} jd krtj djdd} ntj djd d} | dkr q0j| qj#_$dS)Nrr)rr.r/orrsrr2rrrrwr/csg|]}tjdqSr;)ordr^rrrr@rArB$sz0StataReader._read_old_header..rgcsg|]}j|qSr@)rWrrr@rArB0s,css|]}t|VqdSr rrr@r@rA 2sz/StataReader._read_old_header..rrcsg|]}j|qSr@)rUrrr@rArB5scss|]}t|VqdSr rrr@r@rAr7szcannot convert stata dtypes [csg|]}jdqS)rrrrr@rArB;scsg|]}jdqS)rrrrr@rArB?sr,rrrrs)%rrrr]rr9rr^rrZfiletyperrrrrrrrr| frombufferrrZrrjoinrrrrrrrrrtellr) rrrbufZtyplistbtprZ invalid_typesZinvalid_dtypesZ data_typeZdata_lenr@rrArs "       $$             zStataReader._read_old_headercCs|jdk r|jSg}t|jD]^\}}||jkr^tt|}|dt||j|j|fq|dt|dt|fqt ||_|jS)z"Map between numpy and state dtypesNsS) ry enumeraterr[r rrrr|rh)rdtypesrrr@r@rArfs   $  zStataReader._setup_dtyperr4cCst|tr|St|j|Sr )rrTrcalcsizerrrr@r@rArvs zStataReader._calcsizerr4cCs^|dd}z||jWStk rX|j}d|d}t|t|dYSXdS)Nrrz@ One or more strings in the dta file could not be decoded using z, and so the fallback encoding of latin-1 is being used. This can happen when a file has been incorrectly encoded by Stata or some other software. You should verify the string values returned are correct.r) partitionrrUnicodeDecodeErrorrrUnicodeWarning)rrrmsgr@r@rAr{s zStataReader._decodec Cs |jr dS|jdkr$d|_i|_dS|jdkr>|j|jn.|jdk sLt|j|jj }|j|j |d|_i|_|jdkr|j ddkrq|j d}|sq|jdkr| |j d}n| |j d}|j d t |jd |j dd }t |jd |j dd }tj|j d||jd |d }tj|j d||jd |d }t|}||}||}|j |} i|j|<t|D]H} | |dkr|| dn|} | | || | |j||| <q|jdkrx|j dqxd|_dS)Nr/Trs|j?|}|sg}d }|D]}||j2} | t2tj@t2tjAfkr`t2tjB} d}n8| t2tjCt2tjDt2tjEfkrt2tjF} d}|4|||G| fq"|rt6t7|}|dk r|%|H|}|S)NrTrdrr)Z convert_dtypecSsg|] }|dk qSr r@)r=Zdtypr@r@rArB[sz$StataReader.read..FrcstfddtDS)Nc3s|]}|VqdSr )r)r=rrr@rArtsz;StataReader.read..any_startswith..)r~ _date_formatsrr@rrAany_startswithssz(StataReader.read..any_startswithcsg|] }|qSr@r@rr r@rArBvsr/)Irrurxrr#rrlrmrorprqrrrnrrwrryrrzrrEr StopIterationr^rrr|rrrr}Zbyteswap newbyteorderrZ from_recordsrdrZ set_index_do_select_columnsr]rGrr rTrr _insert_strlswhererr:rhrrr% from_dictrT_do_convert_missingrrrr_do_convert_categoricalsrrfloat16rrrrrrrpop)rrr_r`rarbrcrdrerhZ max_read_lenread_lenr read_linesrixrrZcols_Zrequires_type_conversiondata_formattedrcolsZ retyped_dataconvertr@r rArs                          zStataReader.read)rrbr4cCsJi}t|D]\}}|j|}||jkr,q tt|}|j|\}}||} t| |k| |k} | sjq |rtt | d} tj | | dd\} } t | t d}t| D]&\}}t |}| | |k}||j|<qn2| j}|tjtjfkrtj}t | |d}tj|| <|||<q |rF|j}t|}t||jd|gd}||}|S)NrT)Zreturn_inversergr2)rrrYr rr| logical_orr~ZnonzeroZasarrayuniquer%rrilocrhrrnanrdr#rZdrop)rrrb replacementsrZcolnamerZnminZnmaxZseriesmissingZ missing_locZumissingZ umissing_loc replacementjZumrlocrhrdZreplacement_dfZreplacedr@r@rArs>        zStataReader._do_convert_missingrcsltdrtjdkr|StjD]@\}}|dkr8q&fdd|jdd|fD|jdd|f<q&|S)Nrrr*csg|]}jt|qSr@)rr)r=krr@rArBsz-StataReader._insert_strls..)hasattrrrrrr)rrrrr@rrArs0zStataReader._insert_strls)rrdr4c Cs|jst|}t|t|kr&td||j}|rRdt|}td|g}g}g}g} |D]P} |j| } | |j | | |j | | |j | | |j | qf||_ ||_ ||_ | |_ d|_||S)Nz"columns contains duplicate entriesz, z "%DDs" where DD is the length of the string. If not a string, raise ValueError float64 -> "%10.0g" float32 -> "%9.0g" int64 -> "%9.0g" int32 -> "%12.0g" int16 -> "%8.0g" int8 -> "%8.0g" strl -> "%9s" rrz%9srCr2rz%10.0gz%9.0gz%12.0gz%8.0grFrGN)r r|rIrrrr]excessive_string_length_errorr9rrrDrrrrrrA)rhr1rKrL max_str_lenrr@r@rA_dtype_to_default_stata_fmts*    rPrg)rgc s<eZdZdZdZdZdOeeee e e fe ee ee j ee ee e e feed fdd Ze dd d d Zedd ddZeedddZeedddZddddZe e dddZeedddZeddddZeddd d!Zddd"d#Zddd$d%Zddd&d'Zddd(d)Zddd*d+Z ddd,d-Z!ddd.d/Z"ddd0d1Z#ddd2d3Z$dPee ee j dd4d5d6Z%ddd7d8Z&ddd9d:Z'ddd;d<Z(ddd=d>Z)ddd?d@Z*dddAdBZ+eeddCdDZ,e-j.ddEdFZ/e-j.ddGdHdIZ0e1e e dJdKdLZ2e edJdMdNZ3Z4S)Q StataWritera A class for writing Stata binary dta files Parameters ---------- fname : path (string), buffer or path object string, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() functions. If using a buffer then the buffer will not be automatically closed after the file is written. data : DataFrame Input to save convert_dates : dict Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are 'tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to 'tc'. Raises NotImplementedError if a datetime column has timezone information write_index : bool Write the index to Stata dataset. byteorder : str Can be ">", "<", "little", or "big". default is `sys.byteorder` time_stamp : datetime A datetime to use as file creation date. Default is the current time data_label : str A label for the data set. Must be 80 characters or smaller. variable_labels : dict Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller. compression : str or dict, default 'infer' For on-the-fly compression of the output dta. If string, specifies compression mode. If dict, value at key 'method' specifies compression mode. Compression mode must be one of {{'infer', 'gzip', 'bz2', 'zip', 'xz', None}}. If compression mode is 'infer' and `fname` is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no compression). If dict and compression mode is one of {{'zip', 'gzip', 'bz2'}}, or inferred as one of the above, other entries passed as additional compression options. .. versionadded:: 1.1.0 {storage_options} .. versionadded:: 1.2.0 Returns ------- writer : StataWriter instance The StataWriter instance has a write_file method, which will write the file to the given `fname`. Raises ------ NotImplementedError * If datetimes contain timezone information ValueError * Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column dtype is not representable in Stata * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters Examples -------- >>> data = pd.DataFrame([[1.0, 1]], columns=['a', 'b']) >>> writer = StataWriter('./data_file.dta', data) >>> writer.write_file() Directly write a zip file >>> compression = {{"method": "zip", "archive_name": "data_file.dta"}} >>> writer = StataWriter('./data_file.zip', data, compression=compression) >>> writer.write_file() Save a DataFrame with dates >>> from datetime import datetime >>> data = pd.DataFrame([[datetime(2000,1,1)]], columns=['date']) >>> writer = StataWriter('./date_data_file.dta', data, {{'date' : 'tw'}}) >>> writer.write_file() rMrNTinfer) fnamerr_ write_indexrrr2r3 compressionrgc st|dkrin||_||_||_||_||_| |_d|_| || |_ |dkr^t j }t ||_||_tjtjtjd|_i|_dS)N)r!r r)rjrrl _write_index _time_stamprr _compression _output_file_prepare_pandasrgr|rr{ _byteorder_fnamer|rrrZtype_converters_converted_names) rrSrr_rTrrr2r3rUrgrr@rAras   zStataWriter.__init__)to_writer4cCs|jj||jdS)zS Helper to call encode before writing to file for Python 3 compat. N)rr~rrr)rr^r@r@rA_writes zStataWriter._write)rUr4cCs|jj|dS)z? Helper to assert file is open before writing. N)rr~rrr@r@rA _write_bytesszStataWriter._write_bytesrc s&fddD}||_g|_t|s*Stj}g}t|D]\}}|rt||jd}|j||j j j }|t j krtd|j j j} | ||kr|t jkrt j}n|t jkrt j}nt j}t j| |d} ||| | dk<||| fq>|||fq>tt|S)z Check for categorical columns, retain categorical information for Stata file and convert categorical data to int csg|]}t|jqSr@)rrhr=rrr@rArBsz5StataWriter._prepare_categoricals..)rzCIt is not possible to export int64-based categorical data to Stata.rgr) _is_col_cat _value_labelsr~rrrGrrrrcodesrhr|rr]rcopyrDrrrrrr#rrT) rrZis_catrrrZ col_is_catZsvlrhr^r@rbrA_prepare_categoricalss8    z!StataWriter._prepare_categoricalscCsZ|D]P}||j}|tjtjfkr|tjkr8|jd}n |jd}|||||<q|S)z Checks floating point data columns for nans, and replaces these with the generic Stata for missing value (.) rrO)rhr|rrrfillna)rrrrhr#r@r@rA _replace_nanss    zStataWriter._replace_nansr3cCsdS)zNo-op, forward compatibilityNr@rr@r@rA_update_strl_namesszStataWriter._update_strl_namesrr4cCsR|D]H}|dks|dkr|dks(|dkr|dks8|dkr|dkr||d}q|S)a Validate variable names for Stata export. Parameters ---------- name : str Variable name Returns ------- str The validated name with invalid characters replaced with underscores. Notes ----- Stata 114 and 117 support ascii characters in a-z, A-Z, 0-9 and _. AZrzr9r)replacerrrr@r@rA_validate_variable_names"z#StataWriter._validate_variable_namecCsi}t|j}|dd}d}t|D]\}}|}t|tsDt|}||}||jkr`d|}d|dkrxdkrnnd|}|dtt|d}||ks| |dkrdt||}|dtt|d}|d7}q|||<|||<q&t ||_|j r z )rrdrrrrrr\rErrr$rlrGitemsrinvalid_name_docr9rrrrr]rj)rrZconverted_namesrdZoriginal_columnsZduplicate_var_idr$r orig_nameroZconversion_warningrrr@r@rA_check_column_namessJ            zStataWriter._check_column_namesrr4cCsRg|_g|_|D]8\}}|jt||j||jt||j|qdSr )rrrsrrPrrJ)rrrrhr@r@rA_set_formats_and_types1 s z"StataWriter._set_formats_and_typescCs|}|jr$|}t|tr$|}||}t|}||}||}|j \|_ |_ ||_ |j |_|j}|D]&}||jkrqtt||rtd|j|<qtt|j|j|_|jD] }t|j|}t|||<q||||jdk r|jD]}t|tr|j||j|<qdS)Nre)rfrVZ reset_indexrr#rwrrirgshaperrrrdtolistrrrlrrErBr|rh_encode_stringsryrTr)rrtemprrrnew_typer@r@rArZ8 s>             zStataWriter._prepare_pandasc Cs|j}t|dg}t|jD]\}}||ks||kr6q|j|}|j}|jtjkrt|dd}|dkst |dks|j }t d|d|j|j |j}tt|j|jkr||j|<qdS) z Encode strings in dta-specific encoding Do not encode columns marked for date conversion or for strL conversion. The strL converter independently handles conversion and also accepts empty string arrays. _convert_strlTrr rzColumn `a` cannot be exported. Only string-like object arrays containing all strings or a mix of strings and None can be exported. Object arrays containing only null values are prohibited. Other object types cannot be exported and must first be converted to one of the supported types.N)rlrFrrrhr r|rIrrrr]rrrrrr_max_string_length) rr_ convert_strlrrr1rhZinferred_dtypeencodedr@r@rAr|o s,     zStataWriter._encode_stringsc Cst|jd|jd|jdn|_|jjddk rV|jjt|_|j_|jj |jjz|j |j |j d||||||||||}|||||||Wntk r}zt|jt|jt t!j"frnt!j#$|jrnzt!%|jWn,t&k rlt'(d|jdt)YnX|W5d}~XYnXW5QRXdS)NwbF)rUrirgmethod)r2rz!This save was not successful but z. could not be deleted. This file is not valid.)*r&r\rXrgrrUr~rrYZcreated_handlesr _write_headerrrW _write_map_write_variable_types_write_varnames_write_sortlist_write_formats_write_value_label_names_write_variable_labels_write_expansion_fields_write_characteristics _prepare_data _write_data _write_strls_write_value_labels_write_file_close_tag_close ExceptionrrrosPathLikepathisfileunlinkOSErrorrrResourceWarning)rrecordsexcr@r@rA write_file sZ     zStataWriter.write_filecCsF|jdk rBt|jjtst|jj|j}|j_|jj|dS)z Close the file if it was created by the writer. If a buffer or file-like object was passed in, for example a GzipFile, then leave this file open for the caller to close. N)rYrrr~rrrgetvalue)rrr@r@rAr s zStataWriter._closecCsdSNo-op, future compatibilityNr@rr@r@rAr szStataWriter._write_mapcCsdSrr@rr@r@rAr sz!StataWriter._write_file_close_tagcCsdSrr@rr@r@rAr sz"StataWriter._write_characteristicscCsdSrr@rr@r@rAr szStataWriter._write_strlscCs|tdddS)z"Write 5 zeros for expansion fieldsrrN)r_rrr@r@rAr sz#StataWriter._write_expansion_fieldscCs"|jD]}|||jqdSr )rdr`rr[)rrr@r@rAr s zStataWriter._write_value_labelsr2rr4c CsH|j}|tdd||dkr(dp*d|d|d|t|d|jdd|t|d |jdd |dkr||td d n||t|dd d |dkrt j }nt |t j st d ddddddddddddg }ddt |D}|d||j|d}|||dS)Nrrrr?r,rwrrsrP"time_stamp should be datetime typeJanFebMarAprMayJunJulAugSepOctNovDeccSsi|]\}}|d|qSr;r@r=rrIr@r@rA  sz-StataWriter._write_header..%d %Y %H:%M)r[r`rrr_rr_null_terminate_bytesrr<nowrr]rstrftimerI)rr2rrmonths month_lookuptsr@r@rAr sJ      zStataWriter._write_headercCs"|jD]}|td|qdS)Nr)rr`rr)rrr@r@rAr" s z!StataWriter._write_variable_typescCs6|jD]*}||}t|ddd}||qdS)Nrr)r_null_terminate_strrr_)rrr@r@rAr& s  zStataWriter._write_varnamescCs"tdd|jd}||dS)Nrrwr2)rrr_)rrr@r@rAr. szStataWriter._write_sortlistcCs |jD]}|t|dqdS)Nr)rr_rrr@r@rAr3 s zStataWriter._write_formatscCs`t|jD]P}|j|rJ|j|}||}t|ddd}||q |tddq dS)Nrrr)rrrcrrrr_)rrrr@r@rAr8 s    z$StataWriter._write_value_label_namescCstdd}|jdkr2t|jD]}||qdS|jD]f}||jkr|j|}t|dkrdtdtdd|D}|std|t|dq8||q8dS)Nrrr.Variable labels must be 80 characters or fewercss|]}t|dkVqdS)N)r)r=rr@r@rArR sz5StataWriter._write_variable_labels..zKVariable labels must contain only characters that can be encoded in Latin-1) rrrrr_rrr]r.)rblankrrr0Z is_latin1r@r@rArD s"       z"StataWriter._write_variable_labelscCs|S)rr@)rrr@r@rA_convert_strls\ szStataWriter._convert_strlsc Cs|j}|j}|j}|jdk rNt|D](\}}||kr$t|||j|||<q$||}i}|jtt j k}t|D]\}}||}||j kr|| dj t|fd||<d|} | ||<||| ||<qt||j} |s| |j} | ||<qt|jd|dS)Nr)argsrF)r:Z column_dtypes)rrrlrrrrr[r{r|rrrhrrrrhrZ to_records) rrrr_rrrZnative_byteorderrstyperhr@r@rAr` s2        zStataWriter._prepare_data)rr4cCs||dSr )r`tobytesrrr@r@rAr szStataWriter._write_datarcCs |d7}|S)Nr?r@)rr@r@rAr szStataWriter._null_terminate_strcCs|||jSr )rrr)rrr@r@rAr sz!StataWriter._null_terminate_bytes)NTNNNNrRN)NN)5rrrrrrrr#rrrrrr<rrrr_rr`rgrirjrrrwr%ryrZr|rrrrrrrrrrrrrrrrr|Zrecarrayrr staticmethodrrr6r@r@rrArQ sxQ *E7%2  7  rQ)rhr1rLr4cCs|rdS|jtjkr", "<", "little", or "big". default is `sys.byteorder` Notes ----- Supports creation of the StrL block of a dta file for dta versions 117, 118 and 119. These differ in how the GSO is stored. 118 and 119 store the GSO lookup value as a uint32 and a uint64, while 117 uses two uint32s. 118 and 119 also encode all strings as unicode which is required by the format. 117 uses 'latin-1' a fixed width encoding that extends the 7-bit ascii table with an additional 128 characters. rN)dfrdrrcCs|dkrtd||_||_||_ddi|_|dkr:tj}t||_d}d}d|_ |dkrjd }d}d |_ n|d krxd }nd }ddd||_ ||_ ||_ dS)Nrz,Only dta versions 117, 118 and 119 supportedrrrrr*rrrsrrrxrrwr) r]Z_dta_verrrd _gso_tabler|rr{r[r_o_offet _gso_o_type _gso_v_type)rrrdrrZ gso_v_typeZ gso_o_typeZo_sizer@r@rAr s,  zStataStrLWriter.__init__)rr4cCs|\}}||j|Sr )r)rrrrvr@r@rA _convert_key szStataStrLWriter._convert_keyr3cs|j}|j}t|j||j}fdd|jD}tj|jtjd}t| D]x\}\}}t|D]b\} \} } || } | dkrdn| } | | d} | dkr| d|df} | || <| | ||| f<qfqRt|jD]\}} |dd|f|| <q||fS)a Generates the GSO lookup table for the DataFrame Returns ------- gso_table : dict Ordered dictionary using the string found as keys and their lookup position (v,o) as values gso_df : DataFrame DataFrame where strl columns have been converted to (v,o) values Notes ----- Modifies the DataFrame in-place. The DataFrame returned encodes the (v,o) values as uint64s. The encoding depends on the dta version, and can be expressed as enc = v + o * 2 ** (o_size * 8) so that v is stored in the lower bits and o is in the upper bits. o_size is * 117: 4 * 118: 6 * 119: 5 csg|]}||fqSr@rCrar r@rArB" sz2StataStrLWriter.generate_table..rgNrr2) rrrrdr|emptyrzrrZiterrowsgetr)r gso_tableZgso_dfselectedZ col_indexr,rvidxrowr$rrrrrr@r rAgenerate_table s$   zStataStrLWriter.generate_table)rr4cCst}tdd}t|jdd}t|jdd}|j|j}|j|j}|jd}|D]\} } | dkrpq^| \} } |||t|| |t|| ||t| d} |t|t | d || ||q^| d| S) a Generates the binary blob of GSOs that is written to the dta file. Parameters ---------- gso_table : dict Ordered dictionary (str, vo) Returns ------- gso : bytes Binary content of dta file to be placed between strl tags Notes ----- Output format depends on dta version. 117 uses two uint32s to express v and o while 118+ uses a uint32 for v and a uint64 for o. rasciirrrrrrr2) rrrrr[rrrsrrrr)rrrZgsoZgso_typenullZv_typeZo_typeZlen_typeZstrlZvorrvZ utf8_stringr@r@rA generate_blob4 s*          zStataStrLWriter.generate_blob)rN)rrrrr#r rrTrrr rrrrrr@r@r@rAr s  !&3rc seZdZdZdZdZd6eeee e e fe ee ee j ee ee e e feee eed fdd Zeee efe ed d d Ze dd ddZd7ee ee j ddddZddddZddddZddddZddddZddddZddddZddd d!Zddd"d#Zddd$d%Z ddd&d'Z!ddd(d)Z"ddd*d+Z#ddd,d-Z$ddd.d/Z%eed0d1d2Z&e'dd3d4d5Z(Z)S)8StataWriter117a A class for writing Stata binary dta files in Stata 13 format (117) Parameters ---------- fname : path (string), buffer or path object string, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() functions. If using a buffer then the buffer will not be automatically closed after the file is written. data : DataFrame Input to save convert_dates : dict Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are 'tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to 'tc'. Raises NotImplementedError if a datetime column has timezone information write_index : bool Write the index to Stata dataset. byteorder : str Can be ">", "<", "little", or "big". default is `sys.byteorder` time_stamp : datetime A datetime to use as file creation date. Default is the current time data_label : str A label for the data set. Must be 80 characters or smaller. variable_labels : dict Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller. convert_strl : list List of columns names to convert to Stata StrL format. Columns with more than 2045 characters are automatically written as StrL. Smaller columns can be converted by including the column name. Using StrLs can reduce output file size when strings are longer than 8 characters, and either frequently repeated or sparse. compression : str or dict, default 'infer' For on-the-fly compression of the output dta. If string, specifies compression mode. If dict, value at key 'method' specifies compression mode. Compression mode must be one of {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If compression mode is 'infer' and `fname` is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no compression). If dict and compression mode is one of {'zip', 'gzip', 'bz2'}, or inferred as one of the above, other entries passed as additional compression options. .. versionadded:: 1.1.0 Returns ------- writer : StataWriter117 instance The StataWriter117 instance has a write_file method, which will write the file to the given `fname`. Raises ------ NotImplementedError * If datetimes contain timezone information ValueError * Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column dtype is not representable in Stata * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters Examples -------- >>> from pandas.io.stata import StataWriter117 >>> data = pd.DataFrame([[1.0, 1, 'a']], columns=['a', 'b', 'c']) >>> writer = StataWriter117('./data_file.dta', data) >>> writer.write_file() Directly write a zip file >>> compression = {"method": "zip", "archive_name": "data_file.dta"} >>> writer = StataWriter117('./data_file.zip', data, compression=compression) >>> writer.write_file() Or with long strings stored in strl format >>> data = pd.DataFrame([['A relatively long string'], [''], ['']], ... columns=['strls']) >>> writer = StataWriter117('./data_file_with_long_strings.dta', data, ... convert_strl=['strls']) >>> writer.write_file() rrNTrR) rSrr_rTrrr2r3rrUrgc sJg|_| dk r|j| tj||||||||| | d i|_d|_dS)N)rrr2r3rUrgr)rextendrjr_map _strl_blob) rrSrr_rTrrr2r3rrUrgrr@rAr s"  zStataWriter117.__init__)rtagr4cCs<t|trt|d}td|dd|td|ddS)zSurround val with rrrzrreleaserZMSFZLSFrrrrKrr*NNrrrr0rrrrrrrrrrrrrcSsi|]\}}|d|qSr;r@rr@r@rAr+ sz0StataWriter117._write_header..rr timestamprheader)r[r`rrrrr _dta_versionrrrrrrrr<rrr]rrrIrr)rr2rrrrZ nobs_sizer0Z encoded_labelZ label_sizeZ label_lenrrrZstata_tsr@r@rAr sV      zStataWriter117._write_headerr3cCs|js2d|jjddddddddddddd|_|jj|jdt}|jD]}|t |j d|qV|d| | | ddS)z Called twice during file write. The first populates the values in the map with 0s. The second call writes the final map locations when all blocks have been written. r)Z stata_datamapvariable_typesvarnamessortlistformatsvalue_label_namesr3characteristicsrstrlsrstata_data_close end-of-filerr*N)rrr~rrrr^rrrr[r`rr)rrrr@r@rAr7 s,  zStataWriter117._write_mapcCsX|dt}|jD]}|t|jd|q|d|| | ddS)Nrrr) rrrrrrr[rr`rr)rrrr@r@rArV s    z$StataWriter117._write_variable_typescCs|dt}|jdkrdnd}|jD]6}||}t|dd|j|d}||q(| d| | | ddS)Nrrrrr2r) rrrrrrrrrrr`rr)rrZvn_lenrr@r@rAr^ s     zStataWriter117._write_varnamescCs@|d|jdkrdnd}||d||jdddS)Nrrrwrsrr2)rrr`rr)rZ sort_sizer@r@rArj s zStataWriter117._write_sortlistcCsj|dt}|jdkrdnd}|jD]}|t||j|q(|d| | | ddS)Nrrrrr) rrrrrrrrrr`rr)rrZfmt_lenrr@r@rAro s   zStataWriter117._write_formatscCs|dt}|jdkrdnd}t|jD]N}d}|j|rH|j|}||}t|dd |j |d}| |q,| d| ||ddS)Nrrrrrr2r)rrrrrrcrrrrrrrr`rr)rrvl_lenrr encoded_namer@r@rArx s      z'StataWriter117._write_value_label_namesc Cs8|dt}|jdkrdnd}td|d}|jdkrxt|jD]}||qD|d| | | ddS|j D]}||jkr|j|}t |dkrtdz||j}Wn4tk r}ztd |j|W5d}~XYnX|t||dq~||q~|d| | | ddS) Nr3rri@rr2rrzDVariable labels must contain only characters that can be encoded in )rrrrrrrrrr`rrrrr]rrUnicodeEncodeError) rrrrrrr0rrr@r@rAr s6           z%StataWriter117._write_variable_labelscCs |d||dddS)Nrr)rr`rrr@r@rAr s z%StataWriter117._write_characteristicscCs0|d|d|||ddS)Nrss)rr`rrr@r@rAr s  zStataWriter117._write_datacCs"|d|||jddS)Nr)rr`rrrr@r@rAr s zStataWriter117._write_strlscCsdS)zNo-op in dta 117+Nr@rr@r@rAr sz&StataWriter117._write_expansion_fieldscCsb|dt}|jD]&}||j}||d}||q|d||| ddS)NrZlblr) rrrdrr[rrrr`r)rrrZlabr@r@rAr s      z"StataWriter117._write_value_labelscCs(|d|tdd|ddS)Nrz rr)rr`rrr@r@rAr s z$StataWriter117._write_file_close_tagcCs8|jD](\}}||jkr |j|}||j|<q dS)z Update column names for conversion to strl if they might have been changed to comply with Stata naming rules N)r]rsrr:)rorignewrr@r@rArj s  z!StataWriter117._update_strl_namesrcsJfddt|D}|rFt||jd}|\}}|}||_|S)zg Convert columns to StrLs if either very large or in the convert_strl variable cs,g|]$\}}j|dks$|jkr|qS)r$)rr)r=rrrr@rArB s z1StataWriter117._convert_strls..r)rrrrrr)rrZ convert_colsZsswtabZnew_datar@rrAr s   zStataWriter117._convert_strlsrxcCsjg|_g|_|D]P\}}||jk}t||j||j|d}|j||jt||j||qdS)N)rKrL) rrrsrrPrrrr)rrrrhrLrr@r@rAry s  z%StataWriter117._set_formats_and_types) NTNNNNNrRN)NN)*rrrrrrrr#rrrrrr<r rrrrr rrrrrrrrrrrrrrrrrrjrr%ryr6r@r@rrAru sfU " ;  "  rcseZdZdZdZd eeeee e fe ee ee j ee eee e fee e eeeed fdd Ze e d d d ZZS) StataWriterUTF8ub Stata binary dta file writing in Stata 15 (118) and 16 (119) formats DTA 118 and 119 format files support unicode string data (both fixed and strL) format. Unicode is also supported in value labels, variable labels and the dataset label. Format 119 is automatically used if the file contains more than 32,767 variables. .. versionadded:: 1.0.0 Parameters ---------- fname : path (string), buffer or path object string, path object (pathlib.Path or py._path.local.LocalPath) or object implementing a binary write() functions. If using a buffer then the buffer will not be automatically closed after the file is written. data : DataFrame Input to save convert_dates : dict, default None Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are 'tc', 'td', 'tm', 'tw', 'th', 'tq', 'ty'. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to 'tc'. Raises NotImplementedError if a datetime column has timezone information write_index : bool, default True Write the index to Stata dataset. byteorder : str, default None Can be ">", "<", "little", or "big". default is `sys.byteorder` time_stamp : datetime, default None A datetime to use as file creation date. Default is the current time data_label : str, default None A label for the data set. Must be 80 characters or smaller. variable_labels : dict, default None Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller. convert_strl : list, default None List of columns names to convert to Stata StrL format. Columns with more than 2045 characters are automatically written as StrL. Smaller columns can be converted by including the column name. Using StrLs can reduce output file size when strings are longer than 8 characters, and either frequently repeated or sparse. version : int, default None The dta version to use. By default, uses the size of data to determine the version. 118 is used if data.shape[1] <= 32767, and 119 is used for storing larger DataFrames. compression : str or dict, default 'infer' For on-the-fly compression of the output dta. If string, specifies compression mode. If dict, value at key 'method' specifies compression mode. Compression mode must be one of {'infer', 'gzip', 'bz2', 'zip', 'xz', None}. If compression mode is 'infer' and `fname` is path-like, then detect compression from the following extensions: '.gz', '.bz2', '.zip', or '.xz' (otherwise no compression). If dict and compression mode is one of {'zip', 'gzip', 'bz2'}, or inferred as one of the above, other entries passed as additional compression options. .. versionadded:: 1.1.0 Returns ------- StataWriterUTF8 The instance has a write_file method, which will write the file to the given `fname`. Raises ------ NotImplementedError * If datetimes contain timezone information ValueError * Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column dtype is not representable in Stata * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters Examples -------- Using Unicode data and column names >>> from pandas.io.stata import StataWriterUTF8 >>> data = pd.DataFrame([[1.0, 1, 'ᴬ']], columns=['a', 'β', 'ĉ']) >>> writer = StataWriterUTF8('./data_file.dta', data) >>> writer.write_file() Directly write a zip file >>> compression = {"method": "zip", "archive_name": "data_file.dta"} >>> writer = StataWriterUTF8('./data_file.zip', data, compression=compression) >>> writer.write_file() Or with long strings stored in strl format >>> data = pd.DataFrame([['ᴀ relatively long ŝtring'], [''], ['']], ... columns=['strls']) >>> writer = StataWriterUTF8('./data_file_with_long_strings.dta', data, ... convert_strl=['strls']) >>> writer.write_file() rNTrR) rSrr_rTrrr2r3rrrUrgc s|| dkr |jddkrdnd} n0| dkr2tdn| dkrP|jddkrPtdtj||||||||| | | d | |_dS) Nr2irr)rrz"version must be either 118 or 119.zKYou must use version 119 for data sets containing more than32,767 variables) r_rTrrr2r3rrUrg)rzr]rjrr) rrSrr_rTrrr2r3rrrUrgrr@rAr_ s, zStataWriterUTF8.__init__rkcCsz|D]p}t|dkrL|dks$|dkrL|dks4|dkrL|dksD|dkrL|dkshdt|krdd krnq||d}q|S) a Validate variable names for Stata export. Parameters ---------- name : str Variable name Returns ------- str The validated name with invalid characters replaced with underscores. Notes ----- Stata 118+ support most unicode characters. The only limitation is in the ascii range where the characters supported are a-z, A-Z, 0-9 and _. rrlrmrrnrrorr)rrprqr@r@rArr s0  z'StataWriterUTF8._validate_variable_name) NTNNNNNNrRN)rrrrrrr#rrrrrr<r rTrrrrrr6r@r@rrAr s6c )r) TTNFTNTNFN)rF)lr collectionsrr<iorrrr|typingrrrrrr r r r rZdateutil.relativedeltar numpyr|Zpandas._libs.librZpandas._libs.writersrZpandas._typingrrrrrZpandas.util._decoratorsrrZpandas.core.dtypes.commonrrrZpandasrrrrrrr r!Z pandas.corer"Zpandas.core.framer#Zpandas.core.indexes.baser$Zpandas.core.seriesr%Zpandas.io.commonr&rZ_statafile_processing_params1Z_statafile_processing_params2Z_chunksize_paramsZ_iterator_paramsZ _reader_notesZ_read_stata_docr5r4r rrrrrNWarningrrrrrrtrr/rrrrIteratorr]rrTr:r{rrhrBrErJrPZ _shared_docsrQrrrrrrr@r@r@rAs   ,   (     ,   #g  Yn#p  %  % -* 7