GPU Access
Covalent Cloud provides access to a various GPUs. GPUs are utilized in covalent by assigning GPU-equipped Cloud Executors to tasks.
Cloud Executors specify a modular set of compute resources, together with a software environment (i.e. a Python version, Python packages, and any other libraries). Here’s an example of a Cloud Executor that specifies 4x H100 GPUs and 4x CPUs.
import covalent_cloud as cc
gpu_executor = cc.CloudExecutor(
gpu_type="h100",
num_gpus=4,
num_cpus=4,
memory="16GB",
env="huggingface-training"
)
@ct.electron(executor=gpu_executor)
def train_model(model_id, data, parameters):
# Your model training code here
# ...
GPU types
GPU types are specified using a Cloud Executor's gpu_type
parameter.
This parameter accepts either a member of the GPU_TYPE
enum or a GPU name as a lowercase string. For example, executor_1
and executor_2
are equivalent in the following:
import covalent_cloud as cc
from covalent_cloud.cloud_executor import GPU_TYPE
# using GPU_TYPE enum
executor_2 = cc.CloudExecutor(gpu_type=GPU_TYPE.H100, num_gpus=4)
# using name string
executor_1 = cc.CloudExecutor(gpu_type="h100", num_gpus=4)
A list of available GPU types is provided below.
GPU type | GPU name | vRAM per GPU | |
---|---|---|---|
H100 | 'h100' | 80 GB | details |
L40 | 'l40' | 48 GB | details |
A100 | 'a100-80g' | 80 GB | details |
A10 | 'a10' | 24 GB | details |
T4 | 't4' | 16 GB | details |
A6000 | 'a6000' | 48 GB | details |
See here for up-to-date pricing for each GPU type.
Cloud executor parameters
Each CloudExecutor
parameter specifies a compute resource, except gpu_type
and env
.
parameter | type | default value | default value meaning |
---|---|---|---|
num_cpus | int | 1 | task execution uses 1 vCPU |
memory | int or str | 1024 | task execution uses 1024 MB of RAM |
num_gpus | int | 0 | task execution uses no GPUs |
gpu_type | str or GPU_TYPE | '' | GPU type not specified (necessary when num_gpus > 0) |
env | str | 'default' | task executes in the user’s default software environment |
time_limit | int , str , or timedelta | 1800 | task execution will be cancelled after 30 minutes |
Number of CPUs
The num_cpus
parameter must correspond to a positive int
that indicates the number of vCPUs to be make available to a task.
Memory
The memory
parameter indicates the amount of RAM that a task can use. Integer values for this parameter are always interpreted as megabytes (MB). Memory can also be specified in units of GB or GiB (as well as MB) with a string value, e.g. memory="32GB"
. Note that maximum limits on memory
vary with for each GPU type.
Number of GPUs
The num_gpus
parameter indicates the desired number of GPUs. The number of GPUs can be (and is by default) 0
(Note the number of vCPUs must be at least 1
). When an executor specifies one or more GPUs, the gpu_type
must also be specified to indicate the type of GPU to use.
Environment
Software environments can be created in the Covalent Cloud UI or programmatically with cc.create_env()
. See this guide for more on creating software environments. An executor’s env
parameter must refer to an existing software environment in the user’s account. Executors initialized with an invalid env
parameter will immediately raise an error by default.
Time limits
Specifying a time_limit
on a Cloud Executor defines the maximum run time of a task. Overrunning the time limit generally results in exiting with an error. Time limits are intended to be used as a “safety mechanism” to prevent idle or hanging tasks from accruing costs.
GPU details
This section tabulates valid ranges of executor parameters for each available GPU type.
NVIDIA H100 Tensor Core GPU
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 28 | 180 GB |
2 | 60 | 360 GB |
4 | 124 | 720 GB |
8 | 252 | 1440 GB |
NVIDIA L40 GPU
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 28 | 58 GB |
2 | 60 | 116 GB |
4 | 124 | 232 GB |
8 | 252 | 464 GB |
NVIDIA A100 Tensor Core GPU
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 28 | 120 GB |
2 | 60 | 240 GB |
4 | 124 | 480 GB |
8 | 252 | 960 GB |
NVIDIA A10G Tensor Core GPU
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 48 | 103 GB |
4 | 192 | 412 GB |
8 | 192 | 768 GB |
NVIDIA T4 Tensor Core GPU
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 4 | 16 GB |
4 | 48 | 192 GB |
8 | 192 | 768 GB |
NVIDIA RTX A6000 Graphics Card
num_gpus | max num_cpus | max memory |
---|---|---|
1 | 28 | 58 GB |
2 | 60 | 116 GB |
4 | 124 | 232 GB |
8 | 252 | 464 GB |