## Summary
This notebook evaluates an ML model design for its capacity to learn an embedding capable of distinguishing between different "mechanisms of action", or MOA, in the bbbc021 dataset. It does this by considering N trained models, where N corresponds to the number of chemical compounds with known MOA. Each of the N models differs from the others in that one particular compound was left out of its training set. Then, each of these models can be tested against its "left out" compound to evaluate its capacity to accurately classify the MOA of the left-out compound using knowledge learned from other compounds sharing the same MOA. The bbbc021 dataset has 12 MOA and 38 compounds with known MOA (there are several representative compounds per MOA, and up to 8 different concentrations per compound). There are a total of 103 'treatments' in the bbbc021 datasets with known MOA, where a treatment == the application of a particular compound at a particlar concentraion.

During training the network model learns to compute an embedding (vector space) that tries to position compounds with the same MOA close together, while keeping compounds with differing MOA farther apart. Once trained, a model can be used to predict the MOA of an unknown (or untrained) MOA by finding its nearest labeled neighbors in the embedding space.

This notebook assumes each of the N models is trained and available for evalution, and that each image in the dataset has a computed embedding corresponding to each model.

For each of the N "one compound left out" models, the mean embedding for each of M treatments is computed. Then, MOA is assigned to each of the treatments corresponding to the left out compound (i.e., for each concetration separately) based on its nearest-neighbor. This is called NSC, or "Not Same Compound" analysis.

[ TODO: Another analysis is done, NSCB, called "Not Same Compound or Batch", in which in addition to the compound being left out (at all concentrations) for nearest-neighbor consideration, all compounds prepared in the same Batch are also left out, to remove Batch-related characterists from biasing the results. This is only possible for 10 of the 12 MOAs, because 2 only have representatives in a single Batch. ]

In [1]:
!pip install shortuuid

 from cryptography.utils import int_from_bytes
 from cryptography.utils import int_from_bytes
You should consider upgrading via the '/opt/conda/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import sys
import os
import math
import base64
import boto3
import sagemaker
import matplotlib.pyplot as plt
import numpy as np
import collections
from collections import defaultdict
from PIL import Image
import sklearn
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib.ticker import NullFormatter
from sklearn import manifold, datasets
from time import time
from time import sleep

In [3]:
EMBEDDING_NAME = 'bbbc021'
BASELINE_TRAIN_ID = 'bneoLZG9npVDBeLCwx6qoE'

In [4]:
s3c = boto3.client('s3')

In [5]:
%pwd

'/root/bioimage-search/datasets/bbbc-021/notebooks'

In [6]:
bioimsArtifactBucket='bioimage-search-output'
bbbc021Bucket='bioimagesearchbbbc021stack-bbbc021bucket544c3e64-10ecnwo51127'

In [7]:
# assumes cwd=/root/bioimage-search/datasets/bbbc-021/notebooks
sys.path.insert(0, "../../../cli/bioims/src")
import bioims as bi

In [8]:
sys.path.insert(0, "../scripts")
import bbbc021common as bb

In [9]:
sagemaker_session = sagemaker.Session()
bucket = sagemaker_session.default_bucket()

In [10]:
bucket

'sagemaker-us-east-1-580829821648'

Get ImageID->(compound, concentration) maps

In [11]:
image_df, moa_df = bb.Bbbc021PlateInfoByDF.getDataFrames(bbbc021Bucket)
compound_moa_map = bb.Bbbc021PlateInfoByDF.getCompoundMoaMapFromDf(moa_df)

sourceCompoundMap={}
sourceConcentrationMap={}
compoundCountMap={}
moaCountMap={}
for i in range(len(image_df.index)):
 r = image_df.iloc[i]
 imageSourceId = r['Image_FileName_DAPI'][:-4]
 imageCompound=r['Image_Metadata_Compound']
 sourceCompoundMap[imageSourceId]=imageCompound
 sourceConcentrationMap[imageSourceId]=r['Image_Metadata_Concentration']
 if imageCompound not in compoundCountMap:
 compoundCountMap[imageCompound]=1
 else:
 compoundCountMap[imageCompound] = compoundCountMap[imageCompound] + 1
 if imageCompound in compound_moa_map:
 imageMoa=compound_moa_map[imageCompound]
 if imageMoa not in moaCountMap:
 moaCountMap[imageMoa]=1
 else:
 moaCountMap[imageMoa] = moaCountMap[imageMoa] + 1

In [12]:
compoundCountMap

{'5-fluorouracil': 96,
 'acyclovir': 96,
 'AG-1478': 192,
 'ALLN': 96,
 'aloisine A': 96,
 'alsterpaullone': 64,
 'anisomycin': 96,
 'aphidicolin': 96,
 'arabinofuranosylcytosine': 96,
 'atropine': 96,
 'bleomycin': 96,
 'bohemine': 64,
 'brefeldin A': 96,
 'bryostatin': 64,
 'calpain inhibitor 2 (ALLM)': 96,
 'calpeptin': 64,
 'camptothecin': 96,
 'carboplatin': 96,
 'caspase inhibitor 1 (ZVAD)': 96,
 'cathepsin inhibitor I': 96,
 'Cdk1 inhibitor III': 96,
 'Cdk1/2 inhibitor (NU6102)': 96,
 'chlorambucil': 96,
 'chloramphenicol': 64,
 'cisplatin': 96,
 'colchicine': 96,
 'cyclohexamide': 96,
 'cyclophosphamide': 64,
 'cytochalasin B': 96,
 'cytochalasin D': 96,
 'demecolcine': 96,
 'deoxymannojirimycin': 64,
 'deoxynojirimycin': 96,
 "3,3'-diaminobenzidine": 96,
 'docetaxel': 96,
 'doxorubicin': 96,
 'emetine': 96,
 'epothilone B': 96,
 'etoposide': 96,
 'filipin': 64,
 'floxuridine': 96,
 'forskolin': 96,
 'genistein': 96,
 'H-7': 96,
 'herbimycin A': 96,
 'hydroxyurea': 96,
 'ICI-18

In [13]:
moaCountMap

{'Protein degradation': 384,
 'Kinase inhibitors': 192,
 'Protein synthesis': 288,
 'DNA replication': 384,
 'DNA damage': 384,
 'Microtubule destabilizers': 384,
 'Actin disruptors': 288,
 'Microtubule stabilizers': 1608,
 'Cholesterol-lowering': 192,
 'Epithelial': 256,
 'Eg5 inhibitors': 192,
 'Aurora kinase inhibitors': 288,
 'DMSO': 1320}

In [14]:
embeddingClient = bi.client('embedding')

In [15]:
imageClient = bi.client('image-management')

In [16]:
trainingConfigurationClient = bi.client('training-configuration')

In [17]:
embeddingInfo = trainingConfigurationClient.getEmbeddingInfo(EMBEDDING_NAME)

In [18]:
plateList = imageClient.listCompatiblePlates(embeddingInfo['inputWidth'], embeddingInfo['inputHeight'], embeddingInfo['inputDepth'], embeddingInfo['inputChannels'])

In [19]:
trainList = trainingConfigurationClient.getEmbeddingTrainings(EMBEDDING_NAME)

In [20]:
trainList

[{'filterBucket': 'bioimage-search-input',
 'sagemakerJobName': 'bioims-2KrMFC136oXVYJ7YCpNfr6-mxcscf4W7NBPJkUjDWUpaM',
 'messageId': '4d3807c7-64ed-4fc6-b7ba-005a47a2d285',
 'filterKey': 'train-filter/bbbc021/ALLN-filter.txt',
 'trainId': '2KrMFC136oXVYJ7YCpNfr6',
 'embeddingName': 'bbbc021',
 'executeProcessPlate': 'false'},
 {'filterBucket': 'bioimage-search-input',
 'sagemakerJobName': 'bioims-42s7EYRjYWfW3Ly9gxPUzk-WhNXafEfArRztsJ55pZGVC',
 'messageId': 'fd36b765-65d9-4b6b-bb04-966a2cf8b9cf',
 'filterKey': 'train-filter/bbbc021/AZ-J-filter.txt',
 'trainId': '42s7EYRjYWfW3Ly9gxPUzk',
 'embeddingName': 'bbbc021',
 'executeProcessPlate': 'false'},
 {'filterBucket': 'bioimage-search-input',
 'sagemakerJobName': 'bioims-4CiT9BNcMV7ZftmY7YWArf-XfrEvWqzj42HSozCaJWwKM',
 'messageId': 'e75a176f-0590-4fde-b8ed-d3f769b0b809',
 'filterKey': 'train-filter/bbbc021/PP-2-filter.txt',
 'trainId': '4CiT9BNcMV7ZftmY7YWArf',
 'embeddingName': 'bbbc021',
 'executeProcessPlate': 'false'},
 {'filterBuck

In [21]:
compound_moa_map

{'PP-2': 'Epithelial',
 'emetine': 'Protein synthesis',
 'AZ258': 'Aurora kinase inhibitors',
 'cytochalasin B': 'Actin disruptors',
 'ALLN': 'Protein degradation',
 'mitoxantrone': 'DNA replication',
 'AZ-C': 'Eg5 inhibitors',
 'MG-132': 'Protein degradation',
 'AZ841': 'Aurora kinase inhibitors',
 'docetaxel': 'Microtubule stabilizers',
 'mitomycin C': 'DNA damage',
 'PD-169316': 'Kinase inhibitors',
 'proteasome inhibitor I': 'Protein degradation',
 'vincristine': 'Microtubule destabilizers',
 'AZ138': 'Eg5 inhibitors',
 'demecolcine': 'Microtubule destabilizers',
 'mevinolin/lovastatin': 'Cholesterol-lowering',
 'AZ-A': 'Aurora kinase inhibitors',
 'alsterpaullone': 'Kinase inhibitors',
 'etoposide': 'DNA damage',
 'floxuridine': 'DNA replication',
 'AZ-U': 'Epithelial',
 'simvastatin': 'Cholesterol-lowering',
 'anisomycin': 'Protein synthesis',
 'nocodazole': 'Microtubule destabilizers',
 'AZ-J': 'Epithelial',
 'taxol': 'Microtubule stabilizers',
 'camptothecin': 'DNA replication'

In [22]:
def getCompoundLabel(compound): 
 cnws ="".join(compound.split())
 return cnws.replace('/','-')

In [23]:
label_moa_map = {}
labelCountMap = {}
for c, m in compound_moa_map.items():
 label = getCompoundLabel(c)
 label_moa_map[label] = m
 labelCountMap[label]=compoundCountMap[c]

In [24]:
label_moa_map

{'PP-2': 'Epithelial',
 'emetine': 'Protein synthesis',
 'AZ258': 'Aurora kinase inhibitors',
 'cytochalasinB': 'Actin disruptors',
 'ALLN': 'Protein degradation',
 'mitoxantrone': 'DNA replication',
 'AZ-C': 'Eg5 inhibitors',
 'MG-132': 'Protein degradation',
 'AZ841': 'Aurora kinase inhibitors',
 'docetaxel': 'Microtubule stabilizers',
 'mitomycinC': 'DNA damage',
 'PD-169316': 'Kinase inhibitors',
 'proteasomeinhibitorI': 'Protein degradation',
 'vincristine': 'Microtubule destabilizers',
 'AZ138': 'Eg5 inhibitors',
 'demecolcine': 'Microtubule destabilizers',
 'mevinolin-lovastatin': 'Cholesterol-lowering',
 'AZ-A': 'Aurora kinase inhibitors',
 'alsterpaullone': 'Kinase inhibitors',
 'etoposide': 'DNA damage',
 'floxuridine': 'DNA replication',
 'AZ-U': 'Epithelial',
 'simvastatin': 'Cholesterol-lowering',
 'anisomycin': 'Protein synthesis',
 'nocodazole': 'Microtubule destabilizers',
 'AZ-J': 'Epithelial',
 'taxol': 'Microtubule stabilizers',
 'camptothecin': 'DNA replication',
 '

In [25]:
train_compoundLabel_map = {}

In [26]:
for trainInfo in trainList:
 if 'filterKey' in trainInfo and len(trainInfo['filterKey'])>0:
 filterKey = trainInfo['filterKey']
 print(filterKey)
 a1=filterKey.split('/')
 print(a1)
 a2=a1[2].split("-filter")
 print(a2)
 trainId = trainInfo['trainId']
 print(trainId)
 train_compoundLabel_map[trainId]=a2[0]

train-filter/bbbc021/ALLN-filter.txt
['train-filter', 'bbbc021', 'ALLN-filter.txt']
['ALLN', '.txt']
2KrMFC136oXVYJ7YCpNfr6
train-filter/bbbc021/AZ-J-filter.txt
['train-filter', 'bbbc021', 'AZ-J-filter.txt']
['AZ-J', '.txt']
42s7EYRjYWfW3Ly9gxPUzk
train-filter/bbbc021/PP-2-filter.txt
['train-filter', 'bbbc021', 'PP-2-filter.txt']
['PP-2', '.txt']
4CiT9BNcMV7ZftmY7YWArf
train-filter/bbbc021/cytochalasinD-filter.txt
['train-filter', 'bbbc021', 'cytochalasinD-filter.txt']
['cytochalasinD', '.txt']
6he8YfdaT4eLrspDsnJyFe
train-filter/bbbc021/alsterpaullone-filter.txt
['train-filter', 'bbbc021', 'alsterpaullone-filter.txt']
['alsterpaullone', '.txt']
7HRSPABX4n2rAoMs5LVwD8
train-filter/bbbc021/simvastatin-filter.txt
['train-filter', 'bbbc021', 'simvastatin-filter.txt']
['simvastatin', '.txt']
7btuhRyHiQFqhp27Hyh5EW
train-filter/bbbc021/cytochalasinB-filter.txt
['train-filter', 'bbbc021', 'cytochalasinB-filter.txt']
['cytochalasinB', '.txt']
7fTJ1Qjq5RJmk2kPZZ3t8R
train-filter/bbbc021/doceta

In [27]:
train_compoundLabel_map

{'2KrMFC136oXVYJ7YCpNfr6': 'ALLN',
 '42s7EYRjYWfW3Ly9gxPUzk': 'AZ-J',
 '4CiT9BNcMV7ZftmY7YWArf': 'PP-2',
 '6he8YfdaT4eLrspDsnJyFe': 'cytochalasinD',
 '7HRSPABX4n2rAoMs5LVwD8': 'alsterpaullone',
 '7btuhRyHiQFqhp27Hyh5EW': 'simvastatin',
 '7fTJ1Qjq5RJmk2kPZZ3t8R': 'cytochalasinB',
 '8CDun7CBUmC4gAz6MjvPCp': 'docetaxel',
 '9JyEVkSyapQuPfttU3Zuy6': 'AZ-A',
 'a5TDDad6FBHpcz7uWw8nhx': 'epothiloneB',
 'c9aVZXAoCg74i8QsAMPEQP': 'cyclohexamide',
 'dcAeCFupHshz3dbeZiCCQk': 'mitomycinC',
 'g8DMAt72M4VUkoJTgF83n8': 'colchicine',
 'gR5FE1YpCTKRZUU9zRyRNK': 'chlorambucil',
 'gy4pEscXEMRA8ySocm2qY9': 'mitoxantrone',
 'htHgfiEJXvwq4p5SKzfqcX': 'floxuridine',
 'jknhGNHribQdQfmcXzSCTH': 'camptothecin',
 'kGLL5LP2RF2rvN1BgqVGYC': 'PD-169316',
 'n57tLbFxJoJuAkuMmiTeJt': 'etoposide',
 'odEByheKEgLzd7pxxwJuhK': 'demecolcine',
 'p5tzajaTAAJU4XkCFanwKX': 'AZ-C',
 'pbuH8k6n7X1f85wbZig2b1': 'MG-132',
 'pmKd7eDdjgfSgPHqm66cFM': 'lactacystin',
 'qZU9GJ77LRqA2TdEnr6nTz': 'mevinolin-lovastatin',
 'rE3D2myZgJ8hKnSCb

Check that the counts match, we leave out the control DMSO:

In [28]:
len(train_compoundLabel_map)==len(compound_moa_map)-1

True

In [29]:
tagClient = bi.client("tag")

In [30]:
tagList = tagClient.getAllTags()

In [31]:
compoundLabel_tag_map = {}
for tag in tagList:
 id = tag['id']
 value = tag['tagValue']
 type = tag['tagType']
 if (value.startswith('compound:')):
 a1 = value.split(":")
 compoundLabel_tag_map[a1[1]]=id

In [32]:
compoundLabel_tag_map

{'AZ-U': 18,
 'taxol': 51,
 'alsterpaullone': 26,
 'cyclohexamide': 33,
 'PP-2': 25,
 'camptothecin': 29,
 'floxuridine': 41,
 'PD-169316': 24,
 'demecolcine': 36,
 'anisomycin': 27,
 'mitoxantrone': 47,
 'cytochalasinB': 34,
 'simvastatin': 50,
 'AZ138': 19,
 'AZ258': 20,
 'bryostatin': 28,
 'latrunculinB': 43,
 'proteasomeinhibitorI': 49,
 'methotrexate': 44,
 'AZ-C': 16,
 'nocodazole': 48,
 'vincristine': 52,
 'docetaxel': 37,
 'colchicine': 32,
 'AZ841': 21,
 'MG-132': 23,
 'etoposide': 40,
 'lactacystin': 42,
 'AZ-A': 15,
 'DMSO': 22,
 'cytochalasinD': 35,
 'chlorambucil': 30,
 'epothiloneB': 39,
 'ALLN': 14,
 'emetine': 38,
 'mevinolin-lovastatin': 45,
 'mitomycinC': 46,
 'cisplatin': 31,
 'AZ-J': 17}

In [33]:
searchClient = bi.client("search")

We use the search service to construct a histogram of the distribution of matches to MOAs, where we pool the results for the images of a "left out" treatment. Here we survey across a range of pick values (which in practice shows remarkable insensitivity).

In [34]:
def getMoaHistogram(trainId, leftOutCompoundLabel=''):
 testSequence = []
# Uncomment to observe the invariance of this parameter
# for j in range(1,31):
# testSequence.append(j)
 testSequence.append(10)
 print("***")
 print(trainId)
 if leftOutCompoundLabel == '':
 leftOutCompoundLabel=train_compoundLabel_map[trainId]
 print(leftOutCompoundLabel)
 leftOutMoa = label_moa_map[leftOutCompoundLabel]
 print(leftOutMoa)
 print("===")
 imageInfoMap={}
 dmsoTag = compoundLabel_tag_map['DMSO']
 searchPlateMap = {}
 searchCount=0
 imageListPlateMap={}
 for plate in plateList:
 plateId = plate['plateId']
 #print("plate {}".format(plateId))
 images = imageClient.getImagesByPlateId(plateId)
 imageListPlateMap[plateId] = images
 print("Start search")
 for plate in plateList:
 plateId = plate['plateId']
 images = imageListPlateMap[plateId]
 searchResponses = []
 for image in images:
 imageSourceId = image['Item']['imageSourceId']
 imageId = image['Item']['imageId']
 compound = sourceCompoundMap[imageSourceId]
 compoundLabel = getCompoundLabel(compound)
 concentration = sourceConcentrationMap[imageSourceId]
 if compoundLabel==leftOutCompoundLabel:
 #print("{} {} {} {}".format(imageId, compound, compoundLabel, concentration))
 exclusionTags = []
 tag = compoundLabel_tag_map[compoundLabel]
 exclusionTags.append(tag)
 exclusionTags.append(dmsoTag)
 search = {
 "trainId" : trainId,
 "queryImageId" : imageId,
 "exclusionTags" : exclusionTags,
 "requireMoa" : "true",
 "metric" : "Cosine"
 }
 #print(search)
 searchResponse = searchClient.submitSearch(search)
 searchCount += 1
 searchResponses.append(searchResponse)
 searchPlateMap[plateId] = searchResponses
 searchResultsMap={}
 resultCount=0
 for plate in plateList:
 plateId = plate['plateId']
 searchResponses = searchPlateMap[plateId]
 for searchResponse in searchResponses:
 searchId = searchResponse['searchId']
 statusValue = 'submitted'
 while statusValue != 'completed' and statusValue != 'error':
 sleep(1)
 searchStatus = searchClient.getSearchStatus(searchId)
 statusValue = searchStatus['Item']['status']
 if statusValue == 'completed':
 searchResults = searchClient.getSearchResults(searchId)
 if plateId not in searchResultsMap:
 searchResultsMap[plateId] = []
 searchResultsMap[plateId].append(searchResults)
 resultCount += 1
 # Note, these values will not always match because not all images have
 # qualified ROIs from which an embedding can be calculated to serve
 # as a query.
 print("searchCount={} resultCount={}".format(searchCount, resultCount))
 for testCount in testSequence:
 moaBinCounts = {}
 hitCount=0
 binCount=0
 for plate in plateList:
 plateId = plate['plateId']
 if plateId in searchResultsMap:
 searchResultsList = searchResultsMap[plateId]
 for searchResults in searchResultsList:
 for i in range(testCount):
 hitCount += 1
 searchResult = searchResults[i]
 hitImageId = searchResult['imageId']
 if hitImageId not in imageInfoMap:
 imageInfo = imageClient.getImageInfo(hitImageId, 'origin')
 imageInfoMap[hitImageId]=imageInfo
 imageInfo=imageInfoMap[hitImageId]
 imageSourceId = imageInfo['Item']['imageSourceId']
 hitCompound = sourceCompoundMap[imageSourceId]
 if hitCompound in compound_moa_map:
 moa = compound_moa_map[hitCompound]
 else:
 moa = "unknown"
 if moa in moaBinCounts:
 c = moaBinCounts[moa]
 c += 1
 binCount += 1
 moaBinCounts[moa] = c
 else:
 binCount += 1
 moaBinCounts[moa] = 1
 print("hitCount={} binCount={}".format(hitCount, binCount))
 labelCount = labelCountMap[leftOutCompoundLabel]
 labelMoaCount = moaCountMap[leftOutMoa]
 adjustedLabelMoaCount = labelMoaCount - labelCount
 bestMoa=''
 bestScore=0.0
 for moa in moaBinCounts:
 c = moaBinCounts[moa]
 m = moaCountMap[moa]
 if moa == leftOutMoa:
 n = c / adjustedLabelMoaCount
 else:
 n = c / m
 if n > bestScore:
 bestMoa=moa
 bestScore=n
 elif n == bestScore and moa==leftOutMoa:
 bestMoa=moa
 bestScore=n
 for moa in moaBinCounts:
 c = moaBinCounts[moa]
 m = moaCountMap[moa]
 if moa == leftOutMoa:
 n = c / adjustedLabelMoaCount
 else:
 n = c / m
 if moa==bestMoa:
 print("{}> {} {} {}".format(testCount, moa, c, n))
 else:
 print("{} {} {} {}".format(testCount, moa, c, n))
 # Comment out below if observing multiple parameter values
 if bestMoa==leftOutMoa:
 return 1
 else:
 return 0

In [35]:
trainIdList = []
for trainInfo in trainList:
 trainId = trainInfo['trainId']
 if trainId!='origin' and trainId!=BASELINE_TRAIN_ID:
 trainIdList.append(trainInfo['trainId'])
trainIdList.sort()

In [36]:
trainIdList

['2KrMFC136oXVYJ7YCpNfr6',
 '42s7EYRjYWfW3Ly9gxPUzk',
 '4CiT9BNcMV7ZftmY7YWArf',
 '6he8YfdaT4eLrspDsnJyFe',
 '7HRSPABX4n2rAoMs5LVwD8',
 '7btuhRyHiQFqhp27Hyh5EW',
 '7fTJ1Qjq5RJmk2kPZZ3t8R',
 '8CDun7CBUmC4gAz6MjvPCp',
 '9JyEVkSyapQuPfttU3Zuy6',
 'a5TDDad6FBHpcz7uWw8nhx',
 'c9aVZXAoCg74i8QsAMPEQP',
 'dcAeCFupHshz3dbeZiCCQk',
 'g8DMAt72M4VUkoJTgF83n8',
 'gR5FE1YpCTKRZUU9zRyRNK',
 'gy4pEscXEMRA8ySocm2qY9',
 'htHgfiEJXvwq4p5SKzfqcX',
 'jknhGNHribQdQfmcXzSCTH',
 'kGLL5LP2RF2rvN1BgqVGYC',
 'n57tLbFxJoJuAkuMmiTeJt',
 'odEByheKEgLzd7pxxwJuhK',
 'p5tzajaTAAJU4XkCFanwKX',
 'pbuH8k6n7X1f85wbZig2b1',
 'pmKd7eDdjgfSgPHqm66cFM',
 'qZU9GJ77LRqA2TdEnr6nTz',
 'rE3D2myZgJ8hKnSCb8aRMT',
 'rViqQ833enfkBrZESAAADn',
 'rwd7367RPqAr1LDWsnSE6b',
 'skLDWd41qhB5Wt4vYabdjB',
 't3cn8mXr3VouBDNh8zYHJb',
 't6LfEkNbinGSRvFThhiVwc',
 't7gCc4W89ZzqE1Fik6JobC',
 'tBZsAwLBr5tvowghSa2YYk',
 'taJrCuSEAJm5CSKLKNjjsk',
 'ttWJoVUuu4GNmSJu6g91Vh',
 'vUiyUAgFjpXgPMiCQ5pHia',
 'wyudxQaKEvpFzeF8VF5wCF',
 'xpvG7GXwz7iddSf6TteQr9',
 

In [37]:
j=0
correct=0
for trainId in trainIdList:
 print(j+1)
 correct += getMoaHistogram(trainId)
 j += 1
pc = correct/j
print("==")
print("Percentage of compounds with correct predicted MOA={}".format(pc))

1
***
2KrMFC136oXVYJ7YCpNfr6
ALLN
Protein degradation
===
Start search
searchCount=96 resultCount=96
hitCount=960 binCount=960
10> Protein degradation 826 2.8680555555555554
10 DNA damage 24 0.0625
10 Protein synthesis 7 0.024305555555555556
10 Microtubule stabilizers 38 0.0236318407960199
10 DNA replication 60 0.15625
10 Actin disruptors 1 0.003472222222222222
10 Epithelial 1 0.00390625
10 Cholesterol-lowering 3 0.015625
2
***
42s7EYRjYWfW3Ly9gxPUzk
AZ-J
Epithelial
===
Start search
searchCount=96 resultCount=96
hitCount=960 binCount=960
10> Epithelial 327 2.04375
10 Protein synthesis 86 0.2986111111111111
10 Microtubule stabilizers 37 0.023009950248756218
10 DNA damage 450 1.171875
10 Aurora kinase inhibitors 7 0.024305555555555556
10 Protein degradation 34 0.08854166666666667
10 Kinase inhibitors 4 0.020833333333333332
10 Actin disruptors 15 0.052083333333333336
3
***
4CiT9BNcMV7ZftmY7YWArf
PP-2
Epithelial
===
Start search
searchCount=64 resultCount=64
hitCount=640 binCount=640
10 DN