ó <¿CVc@s½ddlmZmZddlZddlZddlZyddlZWnek r]nXddlm Z ddl m Z e de fd„ƒYƒZ d„Z edkr¹e ƒndS( iÿÿÿÿ(tprint_functiontunicode_literalsN(tVectorSpaceClusterer(tpython_2_unicode_compatibletKMeansClusterercBszeZdZddd ed d ed„Zed„Zed„Zd„Zd„Z d„Z d „Z d „Z d „Z RS( uü The K-means clusterer starts with k arbitrary chosen means then allocates each vector to the cluster with the closest mean. It then recalculates the means of each cluster as the centroid of the vectors in the cluster. This process repeats until the cluster memberships stabilise. This is a hill-climbing algorithm which may converge to a local maximum. Hence the clustering is often repeated with random initial means and the most commonly occurring output means are chosen. igíµ ÷Æ°>c Cs²tj|||ƒ||_||_||_| sMt|ƒ|ksMt‚||_|dksht‚|ow|dk st‚||_|r–|n t j ƒ|_ | |_ dS(uè :param num_means: the number of means to use (may use fewer) :type num_means: int :param distance: measure of distance between two vectors :type distance: function taking two vectors and returing a float :param repeats: number of randomised clustering trials to use :type repeats: int :param conv_test: maximum variation in mean differences before deemed convergent :type conv_test: number :param initial_means: set of k initial means :type initial_means: sequence of vectors :param normalise: should vectors be normalised to length 1 :type normalise: boolean :param svd_dimensions: number of dimensions to use in reducing vector dimensionsionality with SVD :type svd_dimensions: int :param rng: random number generator (or None) :type rng: Random :param avoid_empty_clusters: include current centroid in computation of next one; avoids undefined behavior when clusters become empty :type avoid_empty_clusters: boolean iN( Rt__init__t _num_meanst _distancet_max_differencetlentAssertionErrort_meanst_repeatstrandomtRandomt_rngt_avoid_empty_clusters( tselft num_meanstdistancetrepeatst conv_testt initial_meanst normalisetsvd_dimensionstrngtavoid_empty_clusters((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR#s     c Cs›|jr%|jdkr%tdƒng}x‡t|jƒD]v}|rWtd|ƒn|j sm|dkr‘|jjt|ƒ|jƒ|_n|j||ƒ|j |jƒq;Wt |ƒdkr—x|D]}|j dt ƒqÎWd}}x–tt |ƒƒD]‚}d} xGtt |ƒƒD]3} || kr$| |j|||| ƒ7} q$q$W|dkss| |kr| ||}}qqW||_ndS(Niu6Warning: means will be discarded for subsequent trialsu k-means trialtkeyi(R R tprinttrangeRtsampletlistRt_cluster_vectorspacetappendR tsorttsumtNonet_sum_distances( Rtvectorsttracetmeanssttrialtmeanstmin_differencet min_meanstitdtj((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pytcluster_vectorspaceLs, $   %c Csé|jt|ƒkråt}xÇ|ságt|jƒD] }g^q4}x.|D]&}|j|ƒ}||j|ƒqMW|rŠtdƒntt|j ||j ƒƒ}|j |j |ƒ} | |j krÕt }n||_ qWndS(Nu iteration(RR tFalseRtclassify_vectorspaceR!RRtmapt _centroidR R%RtTrue( RR&R't convergedtmtclusterstvectortindext new_meanst difference((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR ks "   cCsud}}xdtt|jƒƒD]M}|j|}|j||ƒ}|dks]||kr ||}}q q W|S(N(R$RR R R(RR9t best_distancet best_indexR:tmeantdist((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR2†s  cCs!|jrt|jƒS|jSdS(N(R R R(R((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyt num_clusters‘s  cCs|jS(u0 The means used for clustering. (R (R((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR*—scCs@d}x3t||ƒD]"\}}||j||ƒ7}qW|S(Ng(tzipR(Rtvectors1tvectors2R<tutv((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR%scCsÐ|jrKtj|ƒ}x|D]}||7}qW|dtt|ƒƒSt|ƒs†tjjdƒtjjdƒts†t‚ntj|dƒ}x|dD]}||7}q¤W|tt|ƒƒSdS(Niu.Error: no centroid defined for empty cluster. u4Try setting argument 'avoid_empty_clusters' to True i( RtcopytfloatR tsyststderrtwriteR1R (RtclusterR?tcentroidR9((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyR4£s   cCsd|j|jfS(Nu%(R R (R((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyt__repr__³sN(t__name__t __module__t__doc__R$R1RR0R R2RAR*R%R4RN(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyRs &      cCsÏddlm}m}gddgddgddgddggD]}tj|ƒ^qA}ddgd d gg}|d|d |ƒ}|j|td tƒ}td |ƒtd |ƒtd|jƒƒtƒgddgddgddgddgddgddggD]}tj|ƒ^q}|d|ddƒ}|j|tƒ}td |ƒtd |ƒtd|jƒƒtƒtjddgƒ}td|ddƒt|j |ƒƒtƒdS(Niÿÿÿÿ(Rteuclidean_distanceiiiiiiiRR'u Clustered:uAs:uMeans:iRi u classify(%s):tendu ( t nltk.clusterRRRtnumpytarrayRLR5RR*tclassify(RRRtfR&R*t clustererR8R9((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pytdemo¹s(F  X  u__main__(t __future__RRRGR RIRUt ImportErrortnltk.cluster.utilRt nltk.compatRRRZRO(((se/private/var/folders/cc/xm4nqn811x9b50x1q_zpkmvdjlphkp/T/pip-build-FUwmDn/nltk/nltk/cluster/kmeans.pyts    ¡ "