ó <żCVc@sČdZddlmZmZddlmZddlmZerUddlm Z nddlm Z yddl Z Wne k rŽdZ nXdefd „ƒYZd „Zed krÄeƒndS( u˜ A module for language identification using the TextCat algorithm. An implementation of the text categorization algorithm presented in Cavnar, W. B. and J. M. Trenkle, "N-Gram-Based Text Categorization". The algorithm takes advantage of Zipf's law and uses n-gram frequencies to profile languages and text-yet to be identified-then compares using a distance measure. Language n-grams are provided by the "An Crubadan" project. A corpus reader was created seperately to read those files. For details regarding the algorithm, see: http://www.let.rug.nl/~vannoord/TextCat/textcat.pdf For details about An Crubadan, see: http://borel.slu.edu/crubadan/index.html i˙˙˙˙(tprint_functiontunicode_literals(tPY3(ttrigrams(tmaxsize(tmaxintNtTextCatcBs\eZdZiZdZdZiZd„Zd„Z d„Z d„Z d„Z d„Z RS( ucCs\tstdƒ‚nddlm}||_x'|jjƒD]}|jj|ƒq>WdS(Nu›classify.textcat requires the regex module that supports unicode. Try '$ pip install regex' and see https://pypi.python.org/pypi/regex for further details.i˙˙˙˙(tcrubadan(tretEnvironmentErrort nltk.corpusRt_corpustlangst lang_freq(tselfRtlang((sg/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/classify/textcat.pyt__init__?s  cCstjdd|ƒS(u+ Get rid of punctuation except apostrophes u [^\P{P}\']+u(Rtsub(Rttext((sg/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/classify/textcat.pytremove_punctuationLsc CsĹddlm}m}|j|ƒ}||ƒ}|ƒ}x„|D]|}t|j||jƒ}g|D]} dj| ƒ^qh} x7| D]/} | |krŻ|| cd7s  ` *