In this post, we'll build a GPU rig inspired by Lambda's deep learning workstation. Deep learning requires a ton of compute, so to effectively train deep learning models most people use the cloud like AWS or train on their own hardware that they've purchased.
I decided to build my own machine because I was spending a lot of money in the cloud and wanted to save on cost. Introducing... Warmachine! Warmachine is a very capable machine built to take on advanced A.I. tasks from deep learning to reinforcement learning. Warmachine was built using consumer of the shelf hardware.
Check out Warmachine's parts at Pc Parts Picker with the link below.
In this article, I'm going to give tips on how to properly build an A.I. training machine. Towards the end, I’ll talk about the advantages and disadvantages of building your own rig vs using the cloud.
If you want to see the video version of this instead check this out...
So Warmachine was primarily built to tackle deep learning and reinforcement learning problems. I wanted a machine with a healthy amount of cores with 4 GPUs so I can iterate quickly on training my machine learning models. My end goal is to eventually have something similar to a Lambda Quad machine but without paying lambda quad prices.
Deep learning rigs require particular components so it was harder than usual to find reliable resources online on how to build one of these things. Let’s walk through all you need to know to build your own deep learning machine.
At the heart of training deep learning models is the GPU. GPUS are super fast at computing deep learning algorithms because unlike CPUs with a very small amount of complex computing core, GPUS have hundreds or thousands of simple cores that are super-efficient at matrix multiplication. The most reliable brand for GPU’s for deep learning is Nvidia. Most Deep learning frameworks fully support Nvidia’s CUDA SDK which is a software library to interface with their GPUs.
When picking out a GPU, to get the most bang out of your buck, you want something with tensor cores. Tensor cores are a type of processing core that performs specialized matrix math, enabling you to train models using half-precision or mix precision.
This allows more efficient usage of GPU memory which opens up the door for bigger batch sizes, faster training, and bigger models. Tensor cores can be found in the Nvidia RTX GPU models. The memory needs for your GPU is dependent on the type of models you plan on training.
If you only plan on training super small models for embedded devices, you can get away with a GPU with less memory. If you plan on training bigger models like GPT from the NLP domain, I would get as much memory as possible.
Having more GPU memory opens up the door to.. you guess it… bigger batch sizes, faster training, and bigger models. If you plan on doing a multi GPU setup, you need to go with either blower-style fans or the more expensive option liquid cooling.
You need blower-style fans because they are built to expel heat out of the case, which is necessary when you have multiple GPU running. If you don’t have blower-style fans, your system can overheat and potentially damage your hardware.
For Warmachine, I went with an Nvidia RTX 2080 TI Turbo from ASUS. It has 11GB of VRam and blower-style fans for better heating management on a multi GPU set up. I plan on buying 3 more GPU’s down the road to complete my setup.
CPU’s are mainly used for data loading in Deep learning. More threads on a CPU means you can load more data in parallel to feed into your models for training. This is useful if you train on big batch sizes, so the GPU doesn't have to wait too long for the CPU to load data.
CPUs are important if you plan on doing reinforcement learning problems because most of the computation will be done in your learning environment which is most likely done on the CPU. If you use large neural networks with reinforcement learning, then a GPU would definitely help speed up training.
If you only plan on only doing deep learning, then make sure your CPU is compatible with however many GPU’s you plan on having.
When choosing a CPU ask yourself, these questions…
- Do you plan on doing reinforcement learning? Then choose a high-end CPU that performs well on benchmarks if you want faster training.
- Do you only want to do deep learning? Then you can get away with a cheaper CPU, but more threads would help with data loading.
- Do you plan on having a multi GPU setup? Then make sure your CPU supports the amount of GPUs you want.
Warmachine will be used for deep learning and reinforcement learning so it’s equipped with an intel i9-10920X which has 12 cores, and 24 threads. It has a clock speed of up to 4.8 GHZ and supports 4 GPUS. I went with Intel but read a lot of success stories on people using AMD chips which is a bigger bang for your buck. The intel i9-10920X is a very capable CPU to do both deep learning and reinforcement learning so it's perfect for my needs.
A big mistake is thinking you need the fastest RAM with a high clock rate. High Ram Clock speed is a marketing gimmick best explained by Linus Tech Tips. A higher clock rate will show negligible improvements during training so your money is better spent elsewhere.
What’s actually important is the amount of Memory on the RAM. You should strive to have at a minimum as much RAM as you do GPU memory. I went with the brand corsair that has a clock speed of 2666Mhz and 32 GB of RAM. When Warmachine is complete I plan on maxing it out at 128 GB of RAM because I’m extra like that.
When choosing a motherboard make sure it has enough PCIe slots for the numbers of GPUs you want. Also, make sure that PCIe slots have enough space for you to fit the GPU’s. Generally, a GPU will take up space for 2 PCIe slots. Also, make sure the motherboard is compatible with your CPU and RAM.
Warmachine is equipped with an Asus WS X299 SAGE. This motherboard has support for 4 GPUS. The only thing I wish it has was onboard wifi, but I connect using the ethernet cable anyway so it's not too big of a deal.
If you want to optimize your data loading speed, you’ll need faster storage like a solid-state drive. Solid-state drives are more expensive than a standard hard drive so It's useful to buy a smaller SSD for the os and then a standard disk hard drives as a second drive for long term storage of your data and models.
When training, you can transfer the data of interest to your SSD for faster data loading speed. Warmachine is equipped with an NVME Samsung 970 Evo with 1 TB of storage space for the operating system. For a second drive, it has the Seagate Exos Enterprise hard disk with 8TB of storage.
For the Power Supply unit, you’ll want something with enough wattage to handle your entire system. A good rule of thumb is to take the required watts for the CPU and GPU then multiply by 110%.
Make sure your PSU has enough PCIe connectors for your system. Warmachine is equipped with a rose will, 1600 watt PSU. Even though I don't need all this with my setup right now, I will need this much when it's complete so I went ahead and purchased this for future-proofing.
You’ll for sure need a CPU cooler. To reduce fan noise, go with a water cooling set up. If you have the budget you can also look at liquid cooling your GPUS as well. This would make a super quite system. If you want to stick to air cooling your GPU, make sure you have blower-style cooling if you plan on having a multi GPU setup. Warmachine is currently equipped with a corsair H1 15i Pro which is a CPU water cooler. For the GPU it's just the stock blower-style.
When you select a case, you can probably choose anything you want as long as your parts fit. I would recommend getting a case with good airflow just because I’m extra cautious. I went with a corsair air 540 ATX mid-tower case for Warmachine. It has ample airflow, looks pretty cool, and it’s the same case Lambda Labs uses so why not.
Making Sure Your Parts Fit
Use PC Part Picker when building your rig. It has a feature where it checks for part compatibility so you don't totally screw up your build. It's not perfect though, because it alerted me that my parts don't fit in my case but everything fits perfectly, so use it as a compass while building.
Cloud vs Your Own Hardware
Now many of you may be curious, why build my own machine? Can’t I just use the cloud? Well yes, yes you can. But there are benefits to building your own machine, including long term cost savings.
Here are 3 reasons why building your own deep learning rig might be worth it.
1. Cost savings. If you’re frequently using your GPU for training, building a machine can actually save you money long term! If you’re renting a v100, that’s around $3 an hour or about $2100 a month! You can build your own machine for that price.. and keep it forever!
2. Your own hardware is actually faster than the cloud. That’s because the cloud suffers from slow IO between instances and GPUs due to virtualization. Bizon-Tech did an experiment to compare cloud vs your own hardware and they found that cheaper consumer hardware you can get.
3. Having your own rig doubles as a productivity machine to do whatever else you want with it. I can use to play games at max settings, create cool videos for you all, and have more than 10 tabs open on chrome!
Bonus point: Here is a bonus point. I like not having to worry about how much money am I going to spend each model training round. I felt that when I had to pay per training round, I was always hesitant to experiment because I know it would cost me money. I think having my own machine encouraged me to continue experimenting, thus helping me perfect the art of deep learning faster than if I were to use the cloud. Now there are other free GPU enabled options, like Google Colab or Kaggle kernels, but you have a limit on the time you can train which narrows my options on models and problems to tackle. I highly recommend using them for most people starting out with deep learning though.
So that's it! This is a short guide on how to build your own deep learning machine. If you want a more detailed guide I recommend checking out Tim Dettmers Guide. When Warmachine is done it will cost me about $7k which is still cheaper then Lambda Labs comparable machine which costs $11K! for me, that's a big money saver and totally worth building your own machine.