コンテンツにスキップ

[Advance/NeuralMD Pro] Benchmarks of acceleration by GPU#

We made benchmarks of calculation speed using Advance/NeuralMD Pro which can manage GPUs. In this article, we will run the training of Neural Network (NN) and Molecular Dynamics (MD) calculation with GPUs, and compare with that use only CPUs. Please use it as a reference to grasp the calculation time and the number of parallelism.

Computer Environment#

  • CPU:Intel Xeon Silver 4310, 12 cores 2.1GHz x 2 (Total 24 cores)
  • Memory:8GB DDR4-3200 ECC RDIMM x 8 (Total 64 GB)
  • GPU:NVIDIA A30 x 1
  • OS:AlmaLinux 8.6

Benchmark 1 : Training of Neural Network#

We ran the training of NN using Advance/NeuralMD Pro. We used 4681 structures of Lithium Ion Conductor Li7La3Zr2O12 (LLZO) primitive cell (96 atoms) as training data. The calculation condition of the NN Potential (NNP) was the following;

  • Symmetric Function:Chebyshev polynomial
  • Radial Component:50
  • Angular Component:30
  • Cut-off Radius:6.0 Å
  • Neural Network:2 layers x 40 nodes w/ twisted tanh
  • Optimize Algorithm:L-BFGS
  • Number of Epoch:100

The hybrid parallelism of MPI and OpenMP is applied on CPU, so the MPI parallel number is calculated as various values. OpenMP parallel number was adjusted to MPI parallel number x OpenMP parallel number = 12 or 24. About GPU, the thread number per block of CUDA was changed within a range of 128 – 1024.

Additionally, the evaluation indexes used in below results are shown the following table;

sf Time required to calculate symmetric function
nn Average Time required to train NN per epoch
walltime Total Time required for all calculation

Results using CPU only#

In case of not using GPU and use only CPU, the flat MPI parallelism is applied without OpenMP parallelism. The calculation results of 1 CPU (MPI parallel:12) and 2 CPU (MPI parallel:24) are shown in the following table.

Total CPU Core MPI Parallel OpenMP Parallel sf / s nn / s walltime / s
12 12 1 29.7 3.94 506
24 24 1 17.2 3.05 379

Results using GPU#

The calculation times are measured with changing the thread number of CUDA and the MPI parallel number to various value. The results using 1 CPU only (MPI parallel x OpenMP parallel = 12) are shown on left figure, and the results using 2 CPU (MPI parallel x OpenMP parallel = 24) are shown on right figure.

The calculation time is shortest in the case of the MPI parallel number is 2, and on the contrary tend to increase if the MPI parallel number is over 3. Advance/NeuralMD Pro is designed to use CPU without waste with keeping GPU utilization rate by that multiple processes access same GPU device. For this reason, the MPI parallel number is required to some extent, but the too large process number reduce the calculation speed on the contrary because of the GPU overload. In this benchmark, The result is that the MPI parallel number is suited at 2 per 1 GPU.

In addition, the calculation time when calculate on 2 CPU is shorter than when on 1 CPU only. There is remained some loads in the part which not GPU accelerated too, so the performance of CPU requires consideration if consider the machine which has GPUs.

In the case of the MPI parallel number is 2 (the process number per 1 GPU is suited), the calculating speed is enough if the CUDA thread number is set 256. The perfomance is not improve if the CUDA thread numer is settled over 256. The calculation time when the MPI parallel number is 2 and the CUDA thread number is 256 is shown in the below table.

Total CPU
Core
MPI
parallel
OpenMP
parallel
CUDA
thread
sf / s nn / s walltime / s
12 2 6 256 6.23 1.11 189
24 2 12 256 5.72 1.11 171

Acceleration by GPU#

The relative calculation speed when using GPU, based on that when using only CPU is shown in the below table. The calculation become about 2 – 3 times faster by using GPU.

Total CPU Core sf (relative*) nn (relative*) walltime (relative*)
12 4.8 3.5 2.7
24 3.0 2.8 2.2

* The calculation time which using only CPU divided by that which using GPU

Benchmark 2 : Molecular Dynamics Calculation#

We ran MD calculations using LAMMPS 2Jun2022(the modified version by AdvanceSoft, included in Advance/NanoLabo Tool). We use the NN trained in the preceding paragraph. The target of calculation is the LLZO supercell model which includes 12960 atoms. In this calculation, we adopted the hybrid parallelism with MPI and OpenMP and condition of the MPI parallel number x the OpenMP parallel number = 12 or 24.

The bottleneck in LAMMPS is the symmetric function calculation. Since the symmetric function calculation run on the GPU, it is important that tuning the number of the threads and the blocks of CUDA. The number of the threads is enough when that is 256, so we measured the calculation time with changing the block size to various value. Additionally, the MPI parallel number is also variable like as the preceding paragraph.

Results using CPU only#

The results using only CPU is shown in the below table.

Total CPU Core MPI Parallel OpenMP Parallel walltime / s
12 12 1 156
12 6 2 152
24 24 1 91.4
24 12 2 92.1

Results using GPU#

The calculation times are measured with changing the block size (atomBlock) of CUDA and the MPI parallel number to various values. Since the symmetric function calculation is independent for each atoms, the freedom degree about the number of atoms is allocated to the CUDA blocks. The results using 1 CPU (MPI parallel x OpenMP parallel = 12) is shown in the left below figure, and that using 2 CPU (MPI parallel x OpenMP parallel = 24) in the right below figure.

It was found that the calculation time reduced along with the atomBlock increased. However, the efficiency of acceleration will be slow down if the atomBlock is set too large, so it seems better to keep atomBlock = 4096. In addition, It was made sure of that the effect of acceleration by increasing the atomBlock is small in the case of the MPI parallel number is 1 (not parallelized), on the contrary, that will be large in the case of the MPI parallel number is over and equal to 2. It is inferred both of the GPUs and the CPUs keep the utilization rate to high enough level by multiple processes access to 1 GPU device. The enough acceleration is expected if the MPI parallel number is about 2 – 4 per one device.

In the following, the calculation time is shown in the case of the MPI parallel number is 4.

Total CPU Core MPI Parallel OpenMP Parallel atomBlock walltime / s
12 4 3 4096 43.3
24 4 6 4096 42.3

In the case using GPU, the calculation time is not dependent to the CPU core number so much. There are not-GPU-accelerated processes in the training of the NN too much to ignore the required time by them, but in the MD calculation with LAMMPS, it is inferred that the processes running on CPUs are relatively inexpensive.

Acceleration by GPU#

The relative calculation speed when using GPU, based on that when using only CPU is shown in the below table. The calculation become about 2 – 4 times faster by using GPU.

Total CPU Core walltime (relative*)
12 3.5
24 2.2

* The calculation time which using only CPU divided by that which using GPU

Conclusion#

From these results, It was found that the GPU acceleration perform as expected in both of the training the NNs and the MD calculation. In addition, we used only one NVIDIA A30 as GPU in this time, but Advance/NeuralMD Pro also supportscalculations in an environment with machines with multiple GPU devices and/or machines with multiple GPU nodes.

関連ページ#