DeepLearning12 the DGX-1.5 Build
We are going to have more on the actual build later, but we are calling DeepLearning12 the DGX-1.5. That is because the system has a number of updates to the original DGX-1 including using a newer architecture.
- Barebones: Gigabyte G481-S80
- CPUs: Intel Xeon Gold 6136
- RAM: 12x 32GB Micron DDR4-2666
- Networking: 4x Mellanox ConnectX-4 EDR/ 100GbE adapters, 1x Broadcom 25GbE OCP, 1x Mellanox ConnectX-4 40GbE
- SSDs: 4x 960GB SATA SSDs
- NVMe SSDs, 4x 2TB NVMe SSDs
We are calling this a DGX-1.5 because DeepLearning12 is based on the Gigabyte G481-S80, Intel Skylake-SP platform. Although for low-cost single root PCIe servers Skylake was a step backward, for the larger NVLink systems, Skylake-SP has a number of benefits. These benefits include an increase in memory bandwidth of over 50%, and better IIO structure for PCIe access, and more PCIe lanes for additional NVMe and networking capabilities.
We started DeepLearning12 from a barebones box. Here is the (heavy) Gigabyte G481-S80 as it arrived in a small pallet box in the data center.
From here, we had to build the system ourselves.
As you can see, we had eight SXM2 NVIDIA Tesla P100’s ready for installation along with a lot of extra gear. This is ~$65,000 worth of gear sitting on the table at Element Critical in Sunnyvale, California waiting for installation.
Just as a note here, while this was successful, I spoke to Rob Ober, Tesla Chief Platform Architect at NVIDIA, at Hot Chips 30. He was surprised that we successfully installed SXM2 GPUs saying that he had only heard of 1-2 other people who have done it themselves. Apparently big OEMs have jigs that they use to ensure that they do not break pins. We did not have a jig.
NVIDIA Tesla SXM2 GPU Installation in DeepLearning12
We made a video of the GPU installation process. Everything else was easy to install. The GPU installation took several days.
Since Element Critical is home to a number of deep learning/ AI clusters in the Silicon Valley, we had an interesting experience. Someone that works on the large clusters walked by and told us “you aren’t using that screwdriver are you?” The ensuing discussion was immensely valuable. Over-torquing the screws, especially the heatsink screws, can damage/ crack the GPU. We have heard stories even from major OEMs that the threshold is somewhere around 15%.
As part of the process, we got an expensive screwdriver, that had the right tolerances, calibration, and accuracy off of Amazon, the CheckLine TSD-50. This expensive digital driver allowed us to successfully install all eight GPUs which worked the first time.
We are going to have more on DeepLearning12 and the Gigabyte G481-S80 in the coming weeks. Thus far the system has been performing flawlessly. We are still working on what our exact CPU recommendation will be. Stay tuned for more.