U C^a @s ddlmZddlZddlZddlZddlZddlZddlZddlZddl m Z e Z Gddde Z ddZedkrddlZzeWnek rejZYnXzeeWnRek rddlZddlZeejejZeeejed YnXdS) )unicode_literalsN)Pathc@speZdZdZedZedZedZedZ ddifdd Z d d Z d d Z ddZ ddZddZdS)Redditz$Stream cleaned comments from Reddit.z^[`*~]z[`*~]$z\[([^]]+)\]\(%%URL\)z\[([^]]+)\]\(https?://[^\)]+\)Z subredditsectioncCsH||_t|}|s$td||s6|g|_nt||_dS)z file_path (unicode / Path): Path to archive or directory of archives. meta_keys (dict): Meta data key included in the Reddit corpus, mapped to display name in Prodigy meta. RETURNS (Reddit): The Reddit loader. zCan't find file path: {}N) metarexistsIOErrorformatis_dirfileslistiterdir)self file_pathZ meta_keysr2/tmp/pip-install-6_kvzl1k/spacy/bin/load_reddit.py__init__s zReddit.__init__c csp|D]b}tt|J}|D]>}|}|s2q t|}||r ||d}d|iVq W5QRXqdS)Nbodytext) iter_filesbz2openstrstripsrslyZ json_loadsis_valid strip_tags)rrflinecommentrrrr__iter__(s   zReddit.__iter__csfdd|jDS)Ncsi|]\}}||dqS)zn/a)get).0keynameitemrr 5sz#Reddit.get_meta..)ritems)rr&rr%rget_meta4szReddit.get_metaccs|jD] }|VqdSN)r )rrrrrr7s zReddit.iter_filescCsT|jd|}|dddd}|jd|}|jd|}tdd|}|S) Nz\1z>>z<<z\s+ )link_resubreplace pre_format_repost_format_rerer)rrrrrr;s zReddit.strip_tagscCs$|ddk o"|ddko"|ddkS)Nrz [deleted]z [removed]r)rrrrrrCs    zReddit.is_validN)__name__ __module__ __qualname____doc__r4compiler2r3Zurl_rer/rr r)rrrrrrrrs     rcCs$t|}|D]}tt|q dSr*)rprintrZ json_dumps)pathZredditrrrrmainKsr<__main__) __future__rrr4rsysrandomdatetimeZplacpathlibrobjectZ_unsetrr<r5socketBrokenPipeError NameErrorerrorcallosrdevnullO_WRONLYdup2stdoutfilenoexitrrrrs0  ;