TauREx + GPU with CuPy

Ahmed F. Al-Refaie
6 min read
TauREx + GPU with CuPy

Introduction

TauREx actually has GPU acceleration capabilities through the TauREx-CuPy plugin. If radiative transfer is the bottleneck of your model, using TauREx-CuPy can significantly speed up your simulations and retrievals.

Of course before we start we need to have TauREx3 installed:

Copy
1
pip install taurex

Installing CuPy

Note: On Colab, cupy is already installed. You can skip this section!

The most important requirement is the cupy library. This library requires installing the correct version for your CUDA installation. Generally you can find this by checking the output of nvcc --version in your terminal:

Copy
1
nvcc --version

You might get some output like:

Copy
1
2
3
4
5
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

The release number here is 12.5, so you need to install the cupy version compatible with CUDA 12. cupy has a precompiled binary for this version, so you can install it with pip:

Copy
1
pip install cupy-cuda12x

You can find the correct cupy version for your CUDA installation in the Cupy installation guide.

Installing TauREx-CuPy

Once cupy is installed, you can install the TauREx-CuPy plugin:

Copy
1
pip install taurex-cupy --no-deps

The no-deps flag is important here to avoid the cupy being redownloaded and compiled. To quickly test to check that everything is working fine:

Copy
1
taurex --plugin

Should output something like:

Copy
1
2
3
4
5
6
7
8

Successfully loaded plugins
---------------------------
cupy


Failed plugins
---------------------------

Please refer to the Cupy installation guide to install the correct version.

TauREx Input file

TauREx-Cupy works out of the box with any TauREx3 input file. You simply replace the standard radiative transfer module with the cupy one.

Models

Under [Models] you can replace the model_type with one of the following cupy models:

  • transmission -> transmission_cuda
  • transit -> transit_cuda
  • emission -> emission_cuda
  • eclipse -> eclipse_cuda

Thats it! TauREx-CuPy will now be used to run your model and should work with all contributions to the radiative transfer model.

For example in quickstart transmission input file:

Copy
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[Model]
model_type = transmission_cuda

    [[Absorption]]

    [[CIA]]
    cia_pairs = H2-H2, H2-He,

    [[Rayleigh]]

    [[SimpleClouds]]
    clouds_pressure = 5e2

Will leverage CUDA to integrate the transmission model.

Running a small benchmark with 4 opacities on 100 layers we get (Removing the time to load opacities). Benchmark run on a T4 Colab GPU:

  • CPU (Numba): 3.68 s
  • GPU (CuPy): 2.48 s

Pretty nice already! A 25% reduction…. However we do expect a little bit more considering this is GPU acceleration.

Absorption, CIA, Rayleigh and Clouds are still being computed on the CPU. We can change that by using CUDA contributions.

Contributions

For each of the contributions to the radiative transfer model, you can also specify a Cuda version. This will fully leverage the GPU to compute these contributions which are often the bottleneck of the model.

The replacements are:

  • Absorption -> AbsorptionCuda
  • CIA -> CIACuda
  • Rayleigh -> RayleighCuda
  • SimpleClouds -> SimpleCloudsCuda
  • FlatMie -> FlatMieCuda
  • LeeMie -> LeeMieCuda

With the same parameters as the CPU versions.

So taking our last example, we can now have a fully GPU accelerated transmission model with:

Copy
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
[Model]
model_type = transmission_cuda
      [[AbsorptionCuda]]
   
      [[CIACuda]]
      cia_pairs = H2-H2, H2-He,
   
      [[RayleighCuda]]
   
      [[SimpleCloudsCuda]]
      clouds_pressure = 5e2
   

Now lets run the model again:

  • CPU (Numba): 3.68 s
  • GPU (CuPy): 101 ms

Now *THAT is more like it! A full GPU accelerated model is about 36x faster than the CPU version. This will of course depend on your GPU, CPU, number of layers and opacities but you should expect a significant speedup when using the cupy contributions. This potentially reduce your retrievals from days to hours!

For fun lets test it on a real powerhouse GPU like the A100:

  • CPU (Numba): 3.68 s
  • GPU (CuPy): 18 ms

Oh my… A full model in 18 milliseconds. This is about 200x faster than the CPU version. If a retrieval with the CPU version was taking 48 hours, it would now take only about 15 minutes with the A100.

TauREx-CuPy benchmark

Moving on…

Using it in Python

Like the input file, using TauREx-CuPy in Python is very straightforward. You simply need to import the cupy models and contributions and replace them in your own models.

For example if we have a model defined like so:

Copy
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
tm_cpu = TransmissionModel(planet=planet,
                       temperature_profile=isothermal,
                       chemistry=chemistry,
                       star=star,
                       atm_min_pressure=1e-0,
                       atm_max_pressure=1e6,
                       nlayers=100)
tm_cpu.add_contribution(AbsorptionContribution())
tm_cpu.add_contribution(CIAContribution(cia_pairs=['H2-H2','H2-He']))
tm_cpu.add_contribution(RayleighContribution())
tm_cpu.build()

We can easily convert it to a GPU accelerated model by replacing the model and contributions with their cupy versions:

Copy
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
from taurex_cupy import TransmissionCudaModel
from taurex_cupy import AbsorptionCuda, CIACuda, RayleighCuda

tm_gpu = TransmissionCudaModel(planet=planet,
                       temperature_profile=isothermal,
                       chemistry=chemistry,
                       star=star,
                       atm_min_pressure=1e-0,
                       atm_max_pressure=1e6,
                       nlayers=100)
tm_gpu.add_contribution(AbsorptionCuda())
tm_gpu.add_contribution(CIACuda(cia_pairs=['H2-H2','H2-He']))
tm_gpu.add_contribution(RayleighCuda())
tm_gpu.build()

Of course running it on the T4:

Copy
1
2
3
4
>>> %timeit tm_cpu.model()
4.78 s ± 364 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> %timeit tm_gpu.model()
98.6 ms ± 919 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Beautiful!

PS you can find the benchmark scripts on Colab here

High resolution spectra

To really see the power of TauREx-CuPy, let’s try to run a high resolution spectrum, specifically for opacities at \(R=10^6\).

We will focus on only a single opacity (H2O) to keep things simple, 100 layers and use the A100 GPU on Colab.

We will use the opacity from POZAKETEL linelist at \(R=10^6\) in the petitRADTRANS format (also supported by TauREx wink wink) which you can download from here.

Since goes from \(\lambda=0.3 \mu m\) to \(28 \mu m\), we will focus on a smaller wavelength range from \(0.5 \mu m\) to \(5 \mu m\) since this is where we usually operate for things like HST and JWST (NIRSPEC/NIRISS/MIRI), this gives us about 2.5 million wavelength points. We will include CIA and Rayleigh contributions as well.

Testing the model on the CPU and GPU we get:

  • CPU (Numba): 62 s
  • GPU (CuPy, Full): 680 ms

Wow! So we really can run high resolution spectra in no time with TauREx-CuPy. This is 100x speedup which is pretty good. The problem is very memory bound so we don’t get the same speedups as before.

Conclusion

Hope this was a fun time! GPU acceleration is a powerful tool to speed up your simulations and retrievals. With TauREx-CuPy, leveraging the power of GPUs has never been easier.

I would like to discuss more things like how only 10% of the A100 GPU was really used during the first benchmark and how we can pack a bunch of TauREx processes into a single GPU to run multiple retrievals in parallel by sharing GPU memory and leveraging MPS but I’m late for my train…

Another time then!

Resources

Where to go next? Have a look at:

TauREx 3 Platform

The most advanced exoplanet retrieval platform.