<div class="bbWrapper"><blockquote data-attributes="member: 64387" data-quote="Martin.G" data-source="post: 893617"
class="bbCodeBlock bbCodeBlock--expandable bbCodeBlock--quote js-expandWatch">
<div class="bbCodeBlock-title">
<a href="/community/goto/post?id=893617"
class="bbCodeBlock-sourceJump"
rel="nofollow"
data-xf-click="attribution"
data-content-selector="#post-893617">Martin.G said:</a>
</div>
<div class="bbCodeBlock-content">
<div class="bbCodeBlock-expandContent js-expandContent ">
Maybe someone can help me with this issue.<br />
<br />
I have been playing with a DQN (rainbow) and NES games in and old computer (I5 4670, 8 GB and a 960 GTX 2 GB) because it's the only with a graphic card. I don't expect so much, but at least is faster than my Apple computer for ML task.<br />
<br />
Lately I've been thinking to make an upgrade and I try to find what to upgrade first: CPU, GPU, RAM. So I saw my CPU is 100% while my GPU is around 50-60%. I made some research and I found that the DQN use mainly CPU, because the only part of the code that use GPU is the NN, and because is small it's not going to benefits for a better GPU.<br />
<br />
While the replay memory is filling the CPU is around 50%, but when it starts to send batch to the NN it jumps to 100%. Also, the FPS down from 180 to 30<br />
<br />
So if I buy a CPU (I've thinking a 3800X with 16 GB) I am going to see some improvement, but in the future when I upgrade the GPU is going to be the same. I mean, I can spend money but at the end the improvement is not going to be so great.<br />
<br />
I am missing something?. At first, I thought that maybe the problem was lack of RAM because SO has to paginate, but with a smaller replay memory the same thing happens. Another thing that I thought it could be is the VRAM, but at least TensorFlow should not use RAM (The code is in PyTorch, but I think is kind of the same).<br />
<br />
I wonder if with a better PC part I will see a significant improvement (various magnitudes) because there are something (like RAM or VRAM) that made the CPU work harder than it should. Or the CPU is going to be always the really bottleneck and if I don't but a Threadripper I am not going to see a real improvement.
</div>
<div class="bbCodeBlock-expandLink js-expandLink"><a role="button" tabindex="0">Click to expand...</a></div>
</div>
</blockquote><br />
I'm not an expert, but I do run a fairly beefy system. 2x RTX 2080Ti, i7 7820x, 96gb ram. I run a YouTube channel and teach some Udemy courses on reinforcement learning. Here's what I've found.<br />
<br />
You have a couple things going on: the environment simulations are running on the CPU and then shuttling the data to the GPU for the calculations. The slowest piece of hardware is generally going to be the limiting factor, and I'll tell you how to figure out which in a minute.<br />
<br />
Memory:<br />
<br />
The replay memory lives in your RAM, while the neural network is going to live on your vram.<br />
<br />
If you're doing convolutions, it's going to scale with the number of filters (and the screen image sizes you're trying to process), so vram usage goes up quickly as the number of CNN layers / size of screen images / number of filters increases.<br />
<br />
You can mitigate the RAM usage, to some extent, by using numpy arrays and carefully choosing the data types you use for your replay memory. You don't need int64 to store actions, for instance. You probably don't need fp64 to store your screen images, either (assuming you're normalizing between 0 and 1). Your terminal flags can be stored as boolean, and this makes setting the Q for the terminal states trivial:<br />
Q_value_for_state_s_t+1[dones] = 0<br />
<br />
Processor:<br />
I haven't implemented Rainbow, but if it is like regular DQN In the sense that it's running a single environment at a time, then increasing the # of threads isn't going to help. Your performance will be dominated by the single threaded performance of your processor. Intel is generally ahead in that regard, though I would argue that the better multithreaded performance of AMD (in nearly any other task) is going to be an improvement in general system usage. I wouldn't go with threadripper. I would go with the fastest Ryzen I could afford, as 12-16 threads is going to usually be enough and you'll have some thermal headroom such that your system won't heat up your entire room (my office is easily 5 degrees hotter than anywhere else in the house).<br />
<br />
GPU:<br />
You're correct that the neural network sizes are pretty small and don't generally max out your GPU. This changes if you're using convolutions on large batches of buffered screen images, but for a simple neural net it's going to be pretty easy on the hardware.<br />
<br />
I didn't see significant changes in run times (maybe a 15-20% improvement) going from a 1080Ti to a 2080Ti, which is a better metric than the utilization of your gpu. I saw a much bigger improvement going from a GTX780 to my 1080Ti. In hindsight, I wish I had bought 2 due to the run up in prices after the crypto mania.<br />
<br />
The biggest benefit you can get is to run 2 reasonably fast gpus, with each running a different set of agent hyperparameters. This lets you double the speed of the search of the parameter space, and will ultimately save you more time than a single top tier gpu. You can even run different algorithms (i.e. rainbow and D3QN) to compare performance.<br />
<br />
Power supply: not something you mentioned, but this is the single most critical component of the system. If you don't supply adequate current on the 12V rails you risk frying your hardware. Buy Corsair. They make high end power supplies at a reasonable price point. I would also advise paying extra for the fully modular ones, as they make cable management a breeze (important for thermal management).<br />
<br />
Cooling: Also quite critical. Modern cpus and gpus are set up to run at the max frequency their thermal management will allow. Better cooling means better performance. Pay the extra money for the all in 1 liquid CPU coolers and set up a regular schedule for cleaning out the dust bunnies. Make sure to set the CPU cooler fans up to exhaust the hot air out of the case.<br />
<br />
So how do you know what the limiting factor in your rig is?<br />
<br />
You're setting up your code something like this:<br />
Edit: sorry for the poor formatting. The forum software removes the 4 space indent.<br />
<br />
while episode is not done:<br />
choose_action(state)<br />
new state, reward, done, info = env.step(action)<br />
store_transition_in_memory(state, new_state, reward, action, done)<br />
state = new_state<br />
call_learning_function_for_agent()<br />
print_episode_debug_info_to_terminal<br />
<br />
You can figure out where the bottleneck is by adding in:<br />
<br />
import time<br />
inference_time = 0<br />
train_time = 0<br />
env_time = 0<br />
<br />
while episode is not done:<br />
inference_start = time.time()<br />
choose_action(state)<br />
inference_time += time.time() - inference_start<br />
env_start = time.time()<br />
new state, reward, done, info = env.step(action)<br />
env_time += time.time() - env_start<br />
store_transition_in_memory(state, new_state, reward, action, done)<br />
state = new_state<br />
train_start = time.time()<br />
call_learning_function_for_agent()<br />
train_time += time.time() - train_start<br />
print_episode_debug_info_to_terminal<br />
print_times_to_terminal_after_1000_episodes<br />
<br />
This will give you some clue as to where the bottleneck is. If the sum of the inference and train time is greater than the env time, then your gpus are taking up most of the execution time and an upgrade could benefit you (up to a point).<br />
<br />
Hope that helps!</div>