[Advance/NeuralMD Pro] GPU Acceleration of Neural Network Potential#

On September 2022, GPU accelerated Advance/NeuralMD: Advance/NeuralMD Pro will be released. The training processes for Neural Network and the Molecular Dynamics calculations with LAMMPS are accelerated, and all of them support multi-CPU or multi-node machine environments by using in conjunction with MPI. In this article, we introduce the mechanism of GPU acceleration. The results of benchmarks with GPU are introduced in other articles.

Calculation procedure of Neural Network Potential#

On the calculation of Neural Network Potential (NNP), the energies and forces are calculated on procedure like shown in below using multi-layer perceptron.
At first, ① calculate differential $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ of the symmetric functions $𝐺_{𝑖,𝛼}$ and neighboring atomic coordinations $𝑹_{𝑗}$ . Using this, after ② calculation of energies $𝐸_{𝑖}$ by feed-forward propagation, ③ calculate differential $\frac{\partial\, 𝐸_{𝑖}}{\partial\, 𝐺_{𝑖,𝛼}}$ by backpropagation. Finally, ④ calculate sum $\sum_{a} \frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}} \frac{\partial\, 𝐸_{𝑖}}{\partial\, 𝐺_{𝑖,𝛼}}$ and then calculate forces $𝑭_{𝑖,𝑗}=\frac{\partial\, 𝐸_{𝑖}}{\partial\, 𝑹_{𝑗}}$ affect from atom $i$ to $j$ .

The most expensive process is the calculation of the symmetric functions and its differentials on the process ①. The propagation processes in Neural Network in the process ② and ③ are executed with matrix - matrix product (Level-3 BLAS), but they are extremely inexpensive compared with others.
④ is second most expensive after the calculation of the symmetric functions. By fixing $𝑖$ , $𝑭_{𝑖,𝑗}$ is calculated with Level-2 BLAS using $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ as matrix and $\frac{\partial\, 𝐸_{𝑖}}{\partial\, 𝐺_{𝑖,𝛼}}$ as vector. The force affects atom $𝑗$ is calculated by appicating Newton's third law as following;

$𝑭_{𝑗}=\sum_{𝑖} 𝑭_{𝑖,𝑗} =𝑭_{𝑗,𝑗}+\sum_{𝑖\ne 𝑗} 𝑭_{𝑖,𝑗}=−\sum_{𝑖\ne 𝑗} 𝑭_{𝑗,𝑖}+\sum_{𝑖\ne 𝑗} 𝑭_{𝑖,𝑗}=\sum_{𝑖\ne 𝑗}\{−𝑭_{𝑗,𝑖}+𝑭_{𝑖,𝑗}\}$

As the rightmost side, $𝑭_{𝑗,𝑗}$ isn't needed to calculate expressly. In addition, $𝑭_{𝑖,𝑗}$ is used to calculate the virial stress.

Moreover, in the training processes of Neural Network, ⑤ calculating the error of forces $𝑭_{𝑖}$ are needed. The forces are calculated as first-order differential of Neural Network, so calculation of second-order differential is needed to calculate error of them. In this differential calculation, due to because a lot of Level-2 and Level-3 BLAS are executed, it is expensive more than 4 and takes more than 90% calculation time of training.

In the training processes of Neural Network, the symmetric functions $𝐺_{𝑖,𝛼}$ and its coordinate differentials $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ are calculated only once at the beginning (①). After that, execute the processes ② – ⑤ and calculate the energies and the force errors at each epochs. The process ① has already GPU accelerated, but it doesn't contribute significantly to reduce total calculating time because it runs only once. The important thing is accelerating the processes ② – ⑤. However, since the processes ② and ③ are originally inexpensive, they are executed on host side (CPU) without GPU acceleration. ④ is not the maximum bottleneck, but it takes a lot of time which cannot ignore. However, the data quantity of $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ is too large (sometimes it takes over 100 GB) to load into the global memory of GPU. Because these reasons, only calculation of second-order differential (⑤) is GPU-accelerated. ⑤ is the calculation with Level-2 and Level-3 BLAS, so it can GPU-accelerate easily using cuBLAS.

GPU acceleration of the training processes#

processes ② – ④ will execute on CPU, and ⑤ will execute on GPU. By accelerating ⑤ with GPU, the calculating time of processes ② – ④ will not become negligible. Because of this, acceleration will not expect enough if calculation is executed as it is. Therefore, further performance improvement is planned with using process parallelism by MPI together. For example, If 4 processes activated by MPI, each processes will act as shown below figure per 1 epoch. Since there is no inter-process communication from start of epochs until end of the process ⑤, each processes can run asynchronously. This create a situation that some processes run on GPU during others run on CPU. By passing threads that the number exceeds the number of physical cores to the GPU, the operating rate of GPU is increased and processing time of CPU can be reduce relatively. As a result, the entire processes of training is sufficiently accelerated. From experience, it is enough that there are 2 – 4 MPI processes per 1 GPU device. In addition, the processes on CPU are threaded in parallel by OpenMP (in short, they are in MPI + OpenMP hybrid parallelism in CPU).

Acceleration of MD calculation via LAMMPS#

Since it is expensive that doing MD calculations using the trained NNP, it need GPU accelerating. Advance/NeuralMD is using LAMMPS for MD calculations. The NNP calculation part is our original implementation. In MD calculation, the processes ② – ④ is needed on each MD steps. Like as the training process, ② and ③ are calculated on CPU. The calculations of the symmetric function (①) and the forces (④) are GPU accelerated. The data size of $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ is smaller than that in training processes, $\frac{\partial\, 𝐺_{𝑖,𝛼}}{\partial\, 𝑹_{𝑗}}$ is reused on the process ④ with keeping them which generated on the process ① in the global memory of GPU. This data is handled only inside the GPU, so transferring them to the host is not needed. The process ① is the maximum bottleneck (over 95% of the calculation time), but the calculation of the symmetric function is compatible with GPU, so the great acceleration is expected. In addition, similar to the training process, the 2 – 4 process MPI parallelism per 1 GPU device can improve the efficiency of calculation.

GPU acceleration of the symmetric function calculation#

We'll introduce the method of GPU acceleration of the calculation of the symmetric functions. The symmetric functions have the radial $𝐺_{𝑖,𝛼}^{rad}$ and angular $𝐺_{𝑖,𝛼}^{ang}$ component, but explain about $𝐺_{𝑖,𝛼}^{rad}$ at first. Generally, $𝐺_{𝑖,𝛼}^{rad}$ is calculated by

$𝐺_{𝑖,𝛼}^{rad}=\sum_{𝑗\ne 𝑖} 𝐺_{𝛼}^{rad} (𝑹_{𝑖},𝑹_{𝑗})$

The right side is the summation about the all neighboring atoms $𝑗$ . The differential by atomic coordination is

$\frac{\partial\, 𝐺_{𝑖,𝛼}^{rad}}{\partial\, 𝑹_{𝑗}}=\frac{\partial\, 𝐺_{𝛼}^{rad} (𝑹_{𝑖},𝑹_{𝑗}) }{\partial\, 𝑹_{𝑗}}$

If $𝑗\ne 𝑖$ , it can be GPU accelerated easily by allocating $𝑖$ and $𝑗$ to the blocks and the threads of CUDA. About the degree of freedom of the symmetric functions $𝛼$ , Factorize $𝛼$ for the appropriate number of threads per block, and direct product to $𝑖$ and $𝑗$ . In addition, the inter-thread dependence is occur when $𝑗=𝑖$ , but this term will not be calculated on purpose to improve performance. Even if this term $\frac{\partial\, 𝐺_{𝑖,𝛼}^{rad}}{\partial\, 𝑹_{𝑖}}$ is not explicitly calculated, its contribution can be taken in after the fact because Newton's third law is established by $𝐺_{𝛼}^{rad} (𝑹_{𝑗},𝑹_{𝑖})=𝐺_{𝛼}^{rad} (𝑹_{𝑖},𝑹_{𝑗})$ .

Next, we'll explain about the angular component $𝐺_{𝑖,𝛼}^{ang}$ . The calculation of $𝐺_{𝑖,𝛼}^{ang}$ need some devises. Generally, $𝐺_{𝑖,𝛼}^{ang}$ is calculated by

$𝐺_{𝑖,𝛼}^{ang}=\sum_{𝑗\ne 𝑖} \sum_{𝑘<𝑗} 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})$

The dependence of $𝑗$ and $𝑘$ is occur by the summation in the right side, it is difficult to GPU-accelerate. Therefore, using the relation $𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑘},𝑹_{𝑗})=𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})$ , transform the expression into like the following;

$𝐺_{𝑖,𝛼}^{ang}=\frac{1}{2} \sum_{𝑗\ne 𝑖} \sum_{𝑘\ne 𝑖} 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})$

Although the amount of calculation is doubled, the dependence of $𝑗$ and $𝑘$ will disappear. The differential of this is the following;

$\frac{\partial\, 𝐺_{𝑖,𝛼}^{ang}}{\partial\, 𝑹_{𝑗}}=\frac{1}{2} \left[ \sum_{𝑘\ne 𝑖} \frac{\partial\, 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})}{\partial\, 𝑹_{𝑗}}+\sum_{𝑘\ne 𝑖} \frac{\partial\, 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑘},𝑹_{𝑗})}{\partial\, 𝑹_{𝑗}} \right]\\=\sum_{𝑘\ne 𝑖} \frac{\partial\, 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})}{\partial\, 𝑹_{𝑗}} \qquad\qquad\qquad\qquad\qquad$

If so, the situation is similar to that of the radial component. In short, we allocate $𝑖$ to the blocks, and $𝑗$ to the threads of CUDA. However, each thread take charge of the calculation $\sum_{𝑘\ne 𝑖} 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})$ and $\sum_{𝑘\ne 𝑖} \frac{\partial\, 𝐺_{𝛼}^{ang} (𝑹_{𝑖},𝑹_{𝑗},𝑹_{𝑘})}{\partial\, 𝑹_{𝑗}}$ . There is a loop about $𝑘$ in the thread, but each thread will access all $𝑹_{𝑘}$ in the loop. Since $\{𝑹_{𝑘}\}$ is equivalent to $\{𝑹_{𝑗}\}$ , an acceleration is expected with reducing the access frequency to global memory by loading $\{𝑹_{𝑗}\}$ to shared memory. Additionary, about calculation of $\frac{\partial\, 𝐺_{𝑖,𝛼}^{ang}}{\partial\, 𝑹_{𝑖}}$ in the case of $𝑖=𝑗$ , The Newton's third law is usable like the radial component.

If the symmetric function is the Chebyshev polynomial, there is a dependence of $𝛼$ when it is calculated in recurrence relation format, so it is also necessary to devise a way to make it easier to use GPUs by daring to calculate inefficiently using cosines.

[Advance/NeuralMD Pro] GPU Acceleration of Neural Network Potential#

Calculation procedure of Neural Network Potential#

GPU acceleration of the training processes#

Acceleration of MD calculation via LAMMPS#

GPU acceleration of the symmetric function calculation#

関連ページ#