-
Notifications
You must be signed in to change notification settings - Fork 354
Use infiniband #291
Description
My current configuration encountered some problems.
I1205 17:02:12.401198 7160 layer_factory.hpp:77] Creating layer data
I1205 17:02:12.401211 7160 net.cpp:99] Creating Layer data
I1205 17:02:12.401216 7160 net.cpp:407] data -> data
I1205 17:02:12.401224 7160 net.cpp:407] data -> label
I1205 17:02:12.401321 7160 net.cpp:149] Setting up data
I1205 17:02:12.401330 7160 net.cpp:156] Top shape: 100 1 28 28 (78400)
I1205 17:02:12.401335 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401337 7160 net.cpp:164] Memory required for data: 314000
I1205 17:02:12.401341 7160 layer_factory.hpp:77] Creating layer label_data_1_split
I1205 17:02:12.401347 7160 net.cpp:99] Creating Layer label_data_1_split
I1205 17:02:12.401351 7160 net.cpp:433] label_data_1_split <- label
I1205 17:02:12.401356 7160 net.cpp:407] label_data_1_split -> label_data_1_split_0
I1205 17:02:12.401362 7160 net.cpp:407] label_data_1_split -> label_data_1_split_1
I1205 17:02:12.401396 7160 net.cpp:149] Setting up label_data_1_split
I1205 17:02:12.401402 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401407 7160 net.cpp:156] Top shape: 100 (100)
I1205 17:02:12.401409 7160 net.cpp:164] Memory required for data: 314800
I1205 17:02:12.401412 7160 layer_factory.hpp:77] Creating layer conv1
I1205 17:02:12.401422 7160 net.cpp:99] Creating Layer conv1
I1205 17:02:12.401425 7160 net.cpp:433] conv1 <- data
I1205 17:02:12.401430 7160 net.cpp:407] conv1 -> conv1
I1205 17:02:12.402066 7160 net.cpp:149] Setting up conv1
I1205 17:02:12.402081 7160 net.cpp:156] Top shape: 100 20 24 24 (1152000)
I1205 17:02:12.402084 7160 net.cpp:164] Memory required for data: 4922800
I1205 17:02:12.402097 7160 layer_factory.hpp:77] Creating layer pool1
I1205 17:02:12.402107 7160 net.cpp:99] Creating Layer pool1
I1205 17:02:12.402110 7160 net.cpp:433] pool1 <- conv1
I1205 17:02:12.402115 7160 net.cpp:407] pool1 -> pool1
I1205 17:02:12.402153 7160 net.cpp:149] Setting up pool1
I1205 17:02:12.402161 7160 net.cpp:156] Top shape: 100 20 12 12 (288000)
I1205 17:02:12.402164 7160 net.cpp:164] Memory required for data: 6074800
I1205 17:02:12.402168 7160 layer_factory.hpp:77] Creating layer conv2
I1205 17:02:12.402176 7160 net.cpp:99] Creating Layer conv2
I1205 17:02:12.402180 7160 net.cpp:433] conv2 <- pool1
I1205 17:02:12.402186 7160 net.cpp:407] conv2 -> conv2
I1205 17:02:12.403599 7160 net.cpp:149] Setting up conv2
I1205 17:02:12.403615 7160 net.cpp:156] Top shape: 100 50 8 8 (320000)
I1205 17:02:12.403620 7160 net.cpp:164] Memory required for data: 7354800
I1205 17:02:12.403630 7160 layer_factory.hpp:77] Creating layer pool2
I1205 17:02:12.403637 7160 net.cpp:99] Creating Layer pool2
I1205 17:02:12.403641 7160 net.cpp:433] pool2 <- conv2
I1205 17:02:12.403647 7160 net.cpp:407] pool2 -> pool2
I1205 17:02:12.403690 7160 net.cpp:149] Setting up pool2
I1205 17:02:12.403698 7160 net.cpp:156] Top shape: 100 50 4 4 (80000)
I1205 17:02:12.403702 7160 net.cpp:164] Memory required for data: 7674800
I1205 17:02:12.403705 7160 layer_factory.hpp:77] Creating layer ip1
I1205 17:02:12.403713 7160 net.cpp:99] Creating Layer ip1
I1205 17:02:12.403717 7160 net.cpp:433] ip1 <- pool2
I1205 17:02:12.403723 7160 net.cpp:407] ip1 -> ip1
I1205 17:02:12.406860 7160 net.cpp:149] Setting up ip1
I1205 17:02:12.406877 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.406879 7160 net.cpp:164] Memory required for data: 7874800
I1205 17:02:12.406890 7160 layer_factory.hpp:77] Creating layer relu1
I1205 17:02:12.406898 7160 net.cpp:99] Creating Layer relu1
I1205 17:02:12.406901 7160 net.cpp:433] relu1 <- ip1
I1205 17:02:12.406909 7160 net.cpp:394] relu1 -> ip1 (in-place)
I1205 17:02:12.407634 7160 net.cpp:149] Setting up relu1
I1205 17:02:12.407649 7160 net.cpp:156] Top shape: 100 500 (50000)
I1205 17:02:12.407654 7160 net.cpp:164] Memory required for data: 8074800
I1205 17:02:12.407657 7160 layer_factory.hpp:77] Creating layer ip2
I1205 17:02:12.407667 7160 net.cpp:99] Creating Layer ip2
I1205 17:02:12.407672 7160 net.cpp:433] ip2 <- ip1
I1205 17:02:12.407680 7160 net.cpp:407] ip2 -> ip2
I1205 17:02:12.407815 7160 net.cpp:149] Setting up ip2
I1205 17:02:12.407825 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407829 7160 net.cpp:164] Memory required for data: 8078800
I1205 17:02:12.407835 7160 layer_factory.hpp:77] Creating layer ip2_ip2_0_split
I1205 17:02:12.407840 7160 net.cpp:99] Creating Layer ip2_ip2_0_split
I1205 17:02:12.407843 7160 net.cpp:433] ip2_ip2_0_split <- ip2
I1205 17:02:12.407848 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_0
I1205 17:02:12.407856 7160 net.cpp:407] ip2_ip2_0_split -> ip2_ip2_0_split_1
I1205 17:02:12.407891 7160 net.cpp:149] Setting up ip2_ip2_0_split
I1205 17:02:12.407898 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407902 7160 net.cpp:156] Top shape: 100 10 (1000)
I1205 17:02:12.407904 7160 net.cpp:164] Memory required for data: 8086800
I1205 17:02:12.407908 7160 layer_factory.hpp:77] Creating layer accuracy
I1205 17:02:12.407917 7160 net.cpp:99] Creating Layer accuracy
I1205 17:02:12.407920 7160 net.cpp:433] accuracy <- ip2_ip2_0_split_0
I1205 17:02:12.407924 7160 net.cpp:433] accuracy <- label_data_1_split_0
I1205 17:02:12.407930 7160 net.cpp:407] accuracy -> accuracy
I1205 17:02:12.407939 7160 net.cpp:149] Setting up accuracy
I1205 17:02:12.407944 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.407948 7160 net.cpp:164] Memory required for data: 8086804
I1205 17:02:12.407950 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.407954 7160 net.cpp:99] Creating Layer loss
I1205 17:02:12.407958 7160 net.cpp:433] loss <- ip2_ip2_0_split_1
I1205 17:02:12.407963 7160 net.cpp:433] loss <- label_data_1_split_1
I1205 17:02:12.407966 7160 net.cpp:407] loss -> loss
I1205 17:02:12.407972 7160 layer_factory.hpp:77] Creating layer loss
I1205 17:02:12.408217 7160 net.cpp:149] Setting up loss
I1205 17:02:12.408229 7160 net.cpp:156] Top shape: (1)
I1205 17:02:12.408233 7160 net.cpp:159] with loss weight 1
I1205 17:02:12.408239 7160 net.cpp:164] Memory required for data: 8086808
I1205 17:02:12.408243 7160 net.cpp:225] loss needs backward computation.
I1205 17:02:12.408248 7160 net.cpp:227] accuracy does not need backward computation.
I1205 17:02:12.408252 7160 net.cpp:225] ip2_ip2_0_split needs backward computation.
I1205 17:02:12.408255 7160 net.cpp:225] ip2 needs backward computation.
I1205 17:02:12.408258 7160 net.cpp:225] relu1 needs backward computation.
I1205 17:02:12.408262 7160 net.cpp:225] ip1 needs backward computation.
I1205 17:02:12.408263 7160 net.cpp:225] pool2 needs backward computation.
I1205 17:02:12.408267 7160 net.cpp:225] conv2 needs backward computation.
I1205 17:02:12.408270 7160 net.cpp:225] pool1 needs backward computation.
I1205 17:02:12.408272 7160 net.cpp:225] conv1 needs backward computation.
I1205 17:02:12.408277 7160 net.cpp:227] label_data_1_split does not need backward computation.
I1205 17:02:12.408279 7160 net.cpp:227] data does not need backward computation.
I1205 17:02:12.408282 7160 net.cpp:269] This network produces output accuracy
I1205 17:02:12.408288 7160 net.cpp:269] This network produces output loss
I1205 17:02:12.408299 7160 net.cpp:282] Network initialization done.
I1205 17:02:12.408339 7160 solver.cpp:60] Solver scaffolding done.
I1205 17:02:12.411540 7160 CaffeNet.cpp:240] RDMA adapter: mlx5_0
I1205 17:02:12.414819 7160 CaffeNet.cpp:388] 0-th RDMA addr: 01000000360100000899f800
I1205 17:02:12.414834 7160 CaffeNet.cpp:388] 1-th RDMA addr:
I1205 17:02:12.414849 7160 JniCaffeNet.cpp:145] 0-th local addr: 01000000360100000899f800
I1205 17:02:12.414856 7160 JniCaffeNet.cpp:145] 1-th local addr:
17/12/05 17:02:12 INFO executor.Executor: Finished task 1.0 in stage 2.0 (TID 5). 931 bytes result sent to driver
17/12/05 17:02:12 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 7
17/12/05 17:02:12 INFO executor.Executor: Running task 1.0 in stage 3.0 (TID 7)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 4
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 1565.0 B, free 18.9 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 4 took 14 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_4 stored as values in memory (estimated size 2.6 KB, free 21.4 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 105.0 B, free 21.5 KB)
17/12/05 17:02:12 INFO broadcast.TorrentBroadcast: Reading broadcast variable 3 took 11 ms
17/12/05 17:02:12 INFO storage.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 392.0 B, free 21.9 KB)
I1205 17:02:12.636529 7160 common.cpp:61] 1-th string is NULL
F1205 17:02:12.639581 7160 rdma.cpp:327] Check failed: self_ Failed to register memory region.
infiniband information is as follows
omnisky@slave1:~/zzh/mnist$ ibstat
CA 'mlx5_0'
CA type: MT4115
Number of ports: 1
Firmware version: 12.21.1000
Hardware version: 0
Node GUID: 0xec0d9a0300397dc2
System image GUID: 0xec0d9a0300397dc2
Port 1:
State: Down
Physical state: Polling
Rate: 10
Base lid: 2
LMC: 0
SM lid: 2
Capability mask: 0x2651e84a
Port GUID: 0xec0d9a0300397dc2
Link layer: InfiniBand
I want know spark how to use infiniband , need to modify those configuration files or change infiniband's config . Please help me.