GPU Benchmark of General-purpose GNN Force Fields with H200 and A100#
General-purpose graph neural network (GNN) force fields achieve higher versatility and accuracy than conventional force fields by using neural networks. While many have been developed and released by universities, research institutions, and companies, AdvanceSoft has modified LAMMPS to add support for various GNN force fields, making them available through Advance/NanoLabo.
In this case study, we performed molecular dynamics calculations on the sulfide lithium-ion conductor Li10GeP2S12 using various GNN force fields available in NanoLabo with NVIDIA H200 and A100 GPUs. We investigated the differences in calculation time and the maximum size of the calculable system (number of atoms) depending on the force field type and model. Additionally, for the multi-GPU compatible force field SevenNet, we performed calculations using up to eight H200 GPUs to examine the scaling with the number of GPUs.
Computational Environment#
The specifications of the machines used in this case study are shown below.
-
H200-equipped machine
- CPU: Intel Xeon Platinum 8480+ (56 cores) ×2
- GPU: NVIDIA H200 ×8
- CUDA: 12.4
-
A100-equipped machine
- CPU: Intel Xeon Silver 4310 (12 cores)
- GPU: NVIDIA A100 80GB
- CUDA: 12.6
The computational environment was created using the GPU cloud service "GPUSOROBAN" with the cooperation of HIGHRESO Co., Ltd.
Calculation Conditions#
Based on the structure file of Li10GeP2S12 (mp-696128) obtained from the Materials Project, supercell models were created with varying scales to achieve the desired number of atoms. However, for MACE-OFF, which specializes in organic molecules and supports a limited set of elements, we used a system configured to the target number of atoms based on the structure file of serotonin C10H12N2O (pc-5202) obtained from PubChem.
Molecular dynamics calculations were performed for 100 steps under an NVT ensemble (T = 500 K) with a time step of 0.5 fs. Care was taken not to include the download time for pre-trained models, which many GNN force fields download on their first run, in the calculation time.
Comparison of GNN Force Fields and Models#
The execution time (Looptime) versus the number of atoms was plotted for each force field. The force fields shown here do not support MPI parallelization, and calculations were performed with a single thread.
- H200
- A100 80GB
In all cases, the calculation time increases almost linearly with the number of atoms, indicating that scaling is achieved without a loss of efficiency.
This time, we prepared systems with up to 86,400 atoms, but there is a limit to the number of atoms that can be calculated due to the GPU memory (VRAM) capacity. It is clear that the calculation time and the maximum number of calculable atoms vary significantly depending on the type of force field and, even within the same force field, the pre-trained model used.
Additionally, to compare the calculation times of different GNN force fields, the calculation times for a 3,200-atom system are summarized in the table below.
Additionally, to compare the calculation times of different GNN force fields, the respective calculation times for a 6,250-atom system are shown in the table below.
Force Field | Model | H200 Calculation Time (Looptime/s) | A100 Calculation Time (Looptime/s) | H200 Max. Atoms | A100 Max. Atoms |
---|---|---|---|---|---|
MatGL | M3GNet-MatPES-PBE-v2025.1-PES | 14.3 | 19.6 | 32400 | 17150 |
MatGL | M3GNet-MP-2021.2.8-PES | 14.1 | 18.5 | 50000 | 28800 |
CHGNet | 0.3.0 | 89.4 | 107.3 | 17150 | 13500 |
MACE | small-0b2 | 10.5 | 15.5 | >86400 | 50000 |
MACE | medium-0b3 | 21.6 | 37.2 | 28800 | 14700 |
MACE | large-0b2 | 26.3 | 50.9 | 17150 | 12600 |
MACE | mace-osaka24-small | 9.2 | 13.0 | >86400 | 66550 |
MACE | mace-osaka24-medium | 14.0 | 24.2 | 66550 | 36450 |
MACE | mace-osaka24-large | 23.4 | 45.3 | 25600 | 14700 |
Orb | orb-v2 | 3.2 | 6.5 | >86400 | >86400 |
MatterSim | MatterSim-v1.0.0-1M | 16.7 | 20.0 | 66550 | 36450 |
MatterSim | MatterSim-v1.0.0-5M | 19.1 | 29.6 | 32400 | 17150 |
MACE-OFF | small | 7.8 | 11.0 | 66550 | 36450 |
MACE-OFF | medium | 14.8 | 28.2 | 28800 | 14700 |
MACE-OFF | large | 42.5 | 93.9 | 6250 | 3200 |
SevenNet Multi-GPU Benchmark#
We ran calculations with SevenNet on the H200-equipped machine and plotted the calculation time (Looptime) and relative calculation speed versus the number of GPUs for each model. For some models, calculations with a small number of GPUs failed due to insufficient GPU memory. In such cases, the calculation time was based on an extrapolation from successful cases, assuming that the time is proportional to the number of atoms and inversely proportional to the number of GPUs. Therefore, it should be noted that the relative calculation speeds of models with different baselines cannot be directly compared.
Since the A100-equipped machine has only one GPU, a benchmark versus the number of GPUs could not be performed. However, for cases where the calculation could be completed with a single GPU, only the calculation time is included in the charts.
In all cases, a speed improvement is seen up to 8 GPUs when the number of GPUs is increased, but the effect seems to be greater for the 98,000-atom system, suggesting that a larger number of atoms can better leverage the performance of multiple GPUs.
Furthermore, the tendency for calculation speed to be lower with an odd number of GPUs, which has been observed in previous benchmarks, was also seen this time. Focusing on 6 and 7 GPUs, some models show almost no change in calculation speed. This is likely due to reasons such as increased communication volume between GPUs depending on the shape of the cell decomposition. It is recommended to use an even number of GPUs, especially when performing calculations with many GPUs.