INFO|2019-02-13T22:59:36|/opt/launcher.py|48| Launcher started. INFO|2019-02-13T22:59:36|/opt/launcher.py|73| Command to run: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T22:59:36|/opt/launcher.py|15| Running python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T22:59:37|/opt/launcher.py|27| 2019-02-13 22:59:37.489630: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.177179: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.177921: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 0 with properties: INFO|2019-02-13T22:59:38|/opt/launcher.py|27| name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| pciBusID: 0000:00:17.0 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| totalMemory: 15.78GiB freeMemory: 15.35GiB INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.485360: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:900] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486253: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1064] Found device 1 with properties: INFO|2019-02-13T22:59:38|/opt/launcher.py|27| name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| pciBusID: 0000:00:19.0 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| totalMemory: 15.78GiB freeMemory: 15.37GiB INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486294: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1079] Device peer to peer matrix INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486317: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1085] DMA: 0 1 INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 0: Y Y INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1095] 1: Y Y INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486352: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:17.0, compute capability: 7.0) INFO|2019-02-13T22:59:38|/opt/launcher.py|27| 2019-02-13 22:59:38.486360: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1154] Creating TensorFlow device (/device:GPU:1) -> (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:19.0, compute capability: 7.0) INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.314241: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> tf-cnn-training-ps-0.kubeflow.svc:2222} INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.314280: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222} INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 2019-02-13 23:02:11.317209: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:2222 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| TensorFlow: 1.5 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Model: resnet50 INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Mode: training INFO|2019-02-13T23:02:11|/opt/launcher.py|27| SingleSess: False INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Batch size: 64 global INFO|2019-02-13T23:02:11|/opt/launcher.py|27| 32 per device INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Devices: ['/job:worker/task:0/gpu:0', '/job:worker/task:0/gpu:1'] INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Data format: NHWC INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Optimizer: sgd INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Variables: parameter_server INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Sync: True INFO|2019-02-13T23:02:11|/opt/launcher.py|27| ========== INFO|2019-02-13T23:02:11|/opt/launcher.py|27| Generating model INFO|2019-02-13T23:02:12|/opt/launcher.py|27| WARNING:tensorflow:From /opt/tf-benchmarks/scripts/tf_cnn_benchmarks/convnet_builder.py:372: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. INFO|2019-02-13T23:02:12|/opt/launcher.py|27| Instructions for updating: INFO|2019-02-13T23:02:12|/opt/launcher.py|27| keep_dims is deprecated, use keepdims instead INFO|2019-02-13T23:02:18|/opt/launcher.py|27| 2019-02-13 23:02:18.452487: I tensorflow/core/distributed_runtime/master_session.cc:1011] Start master session 5a288b32a61167f7 with config: intra_op_parallelism_threads: 1 gpu_options { force_gpu_compatible: true } allow_soft_placement: true INFO|2019-02-13T23:02:20|/opt/launcher.py|27| Running warm up INFO|2019-02-13T23:06:30|/opt/launcher.py|27| Done warm up INFO|2019-02-13T23:06:30|/opt/launcher.py|27| Step Img/sec loss INFO|2019-02-13T23:06:31|/opt/launcher.py|27| 1 images/sec: 217.5 +/- 0.0 (jitter = 0.0) 10.611 INFO|2019-02-13T23:06:33|/opt/launcher.py|27| 10 images/sec: 215.8 +/- 3.1 (jitter = 2.6) 8.493 INFO|2019-02-13T23:06:36|/opt/launcher.py|27| 20 images/sec: 217.1 +/- 1.6 (jitter = 3.0) 8.229 INFO|2019-02-13T23:06:39|/opt/launcher.py|27| 30 images/sec: 216.8 +/- 1.1 (jitter = 3.1) 8.105 INFO|2019-02-13T23:06:42|/opt/launcher.py|27| 40 images/sec: 216.8 +/- 0.9 (jitter = 3.1) 7.947 INFO|2019-02-13T23:06:45|/opt/launcher.py|27| 50 images/sec: 216.9 +/- 0.7 (jitter = 3.3) 7.916 INFO|2019-02-13T23:06:48|/opt/launcher.py|27| 60 images/sec: 217.8 +/- 0.7 (jitter = 3.5) 8.252 INFO|2019-02-13T23:06:51|/opt/launcher.py|27| 70 images/sec: 218.0 +/- 0.6 (jitter = 3.3) 7.880 INFO|2019-02-13T23:06:54|/opt/launcher.py|27| 80 images/sec: 217.8 +/- 0.6 (jitter = 3.2) 7.885 INFO|2019-02-13T23:06:57|/opt/launcher.py|27| 90 images/sec: 217.6 +/- 0.5 (jitter = 3.5) 7.823 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| 100 images/sec: 217.7 +/- 0.5 (jitter = 3.3) 7.926 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2019-02-13T23:07:00|/opt/launcher.py|27| total images/sec: 217.95 INFO|2019-02-13T23:07:00|/opt/launcher.py|27| ---------------------------------------------------------------- INFO|2019-02-13T23:07:01|/opt/launcher.py|80| Finished: python tf_cnn_benchmarks.py --batch_size=32 --model=resnet50 --variable_update=parameter_server --flush_stdout=true --num_gpus=2 --local_parameter_device=gpu --device=gpu --data_format=NHWC --job_name=worker --ps_hosts=tf-cnn-training-ps-0.kubeflow.svc:2222 --worker_hosts=tf-cnn-training-worker-0.kubeflow.svc:2222 --task_index=0 INFO|2019-02-13T23:07:01|/opt/launcher.py|84| Command ran successfully sleep for ever.