3 f_%@sddlZddljZddlZddlZddlZGdddZ dCddZ dDdd Z ddlZd d Z d d Z ddlmZddlZdEddZddZddZddZddddddddd d!d"d#d$d%d&d'd(d)d*d+d,d-d.d/d0d1d2d3d4d5d6d7d8d9d:d;dgfd?d@Zed=gfdAdBZdS)FNc@s4eZdZdZdZdZdZdZdZdZ dZ d Z d Z d S) colorzzzzzzzzzzN) __name__ __module__ __qualname__ZPURPLECYANZDARKCYANBLUEGREENYELLOWREDZBOLDZ UNDERLINEENDr r F/home/ec2-user/SageMaker/aws-ml-healthcare-workshop/util/preprocess.pyrsr?cCsj|djj|d<|jdddgd}|jdddd gd d d d g}||j|kjjjd|j}||fS)NMEDICAL_CONDITIONZ hemostaticZhematomaZ Hemostasis hemostasisZwoundsZmassesZlesionsZpolypswoundmasslesionpolyp)strlowerreplaceScorer value_countsindexto_list)dfZnFeature thresholdZmcListr r r retrieve_mcLists "rcCs|j|jdkj}|d|f}tjddtj|j|jdd}|j |j ddtj d |d tj d d d tj ddd tj|S)Ng?r)figsizeg?)alpha-)rotationztop z# medical conditions in the patientszNumber of Occurrences )fontsizeZ occurance)rr )rrrpltfiguresnsZbarplotrvaluesset_xticklabelsget_xticklabelstitleZylabelZxlabelshow)rZthreshold_scoreZtopNZ df_mcs_vcZ df_mcs_topZchartr r r mc_barplot!s r0cCsg}g}g}g}x|dD]v}|ddkrl|j|d|j|dd}|drb|ddd }|j|tjtj|tj|tj|d }qW|S) NZEntitiesZCategoryrTextrNaNZTraitsrName)rrZTrait)appendpd DataFrameSeries)Z json_fileZ list_icd10Zmedical_conditionsZscorestraitsrowtraitZdf_mcr r r extractMC_v23s  (r;cCsxtj}t|t|krdSx8t||D]*\}}t|}||d<|d}|j|}q(W|jddgdjddgdd}|S)NzError! different length!IDr)bylast)keep)r5r6lenzipr;r4 sort_valuesdrop_duplicates)ZtranscriptionListZ patientIDListdf_finalitemZ patient_idZdf_indr r r extractMCbatchGsrG)tqdmc Cs||j|kj}|j|dd}td|j||jjdk}td|jtjdddd}|d j }g}x*t |d D]}|j |d } |j | qxW||fS) N{)n random_statezoriginal data shape is Tz-data shape after removing missing entries is Zcomprehendmedicalz us-east-1) service_nameuse_ssl region_nameid transcription)r1) medical_specialty reset_indexsampleprintshaperQnotnaboto3clientrrHZdetect_entities_v2r4) rrRZ sampleSizedf_subZ df_sub_subcmZ patient_idsZtranscrption_listtextZcomprehend_resultr r r subpopulation_comprehendcs    r]c Cs`tjdd|jddddfj}tj|dddtjddddd d }|j|jd d d dS)N)r!r=rrrI)rKT)ZvminZvmaxcenterZcmapsquarer#right)r$Zhorizontalalignment)r^r^) r(r)iloccorrr*ZheatmapZdiverging_paletter,r-)rreaxr r r corrPlots  rgcCsD||j|k}x0|jD]$\}}|j}|j|j|j|k|f<qW|S)N)riterrowsr<rloc)Zdf_rawrE conditionrZrr9Zsidr r r dataframe_converts rkcCs*d|_tjj|_tj|j|jdddS)Nztext/csvzutf-8,)sep) content_type sagemaker predictorcsv_serializer serializernp fromstringpredictdecode)rpdatar r r predict_from_numpys rxZ nontenderz foreign bodyZedemaalertZmurmurz chest painZvomitingz hiatal herniaZdistressrzcarpal tunnel syndromeZ endometriosisZweaknessZpainrZ inflammationrZbleedingZ hypertensionZsuppleZfeverZstenosisrZcyanosisZ infectionZerythemaZ normocephalicZfracturerZ ulcerationZnauseaZcoughZtumorZsoftzshortness of breathZinjuryZdiabetesr<LabelcCsl|jdgdj}||}tj|d}||||<x|D]}t|||}q8W|jd}|djt|d<|S)Nr<)subset)columnsrrz)rDcopyr5r6rkfillnaastypeint)df_mcs colname_mc colname_otherdf_1 column_names df_combinedconr r r df_mc_generators    rcCsZ|jdgdj}||}tj|d}||||<x|D]}t|||}q8W|jd}|S)Nr<)r{)r|r)rDr}r5r6rkr~)rrrrrrrr r r df_mc_generator_slims    r)rr)rr)rI)Zseabornr*matplotlib.pyplotpyplotr(pandasr5rXbotocorerrr0r;rGrHr]rgrkrxrrrr r r r s0    #