Skip to main content

Deploying Llama 3 inference for text generation

In this example, we will guide you through the process of setting up and deploying a text generation service using the Llama 3 model on the Covalent Cloud platform. This example demonstrates how to deploy and serve an AI model that can generate text based on provided prompts

Ensure you have the latest version of the Covalent Cloud installed on your machine. If you don't have it installed, you can install it using the following command: pip install covalent-cloud --upgrade

import covalent_cloud as cc

We need to create an environment specifically tailored for serving our model. This is done using the create_env method, which allows us to specify the required packages and wait for the environment to be ready before proceeding. This is a one time setup and can be reused for multiple deployments. Initial container build time may vary depending on the size of the environment, but typically around 10 minutes. You can check your app at https://app.covalent.xyz/environments for the status of the environment build.

cc.create_env("vllm serve",pip=["vllm==0.5.1", "torch==2.3.0"],wait=True)
Environment Already Exists.

The CloudExecutor is a crucial component in deploying our model. It specifies the computational resources that our service will utilize. here we choose A100 GPUs for faster inference. The CloudExecutor also allows us to specify the environment that we created earlier.

executor = cc.CloudExecutor(
env="vllm serve",
num_cpus=10,
num_gpus=1,
memory="100GB",
gpu_type=cc.cloud_executor.GPU_TYPE.A100,
time_limit="30 minutes",
)

We define a service that initializes the Llama model and exposes endpoints for generating text.

Note 1:

We are using the vllm module to automatically download from huggingface into the VM which will add a few minutes to the deployment time, alternatively one can have a workflow step where one downloads the model and adds it to a volume and just point to the volume path in the deployment which will greatly reduce the deployment time.

Note 2:

We have made the auth=False which would make the end point accessible to anyone with the link, you can set it to auth=True and provide an api-key to secure the endpoint access. You can add and manage these api-keys in the UI

@cc.service(executor=executor,name="llama 3",auth=False) # Note one can also define compute shares etc.. which will define the 
def vllm_serve(model):
from vllm import LLM
llm = LLM(model=model, trust_remote_code=True, enforce_eager=True)
return {"llm": llm}

@vllm_serve.endpoint("/generate")
def generate(llm, prompt, num_tokens=1500,temperature=0.7, top_p=0.8):
"""Accepts either a single prompt or a list of prompts and returns the generated text for each prompt."""

from vllm import SamplingParams

sampling_params = SamplingParams(temperature=temperature, top_p=top_p, max_tokens=num_tokens)

is_single_prompt = isinstance(prompt, str)
prompts = [prompt] if is_single_prompt else prompt

outputs = llm.generate(prompts, sampling_params)
generated_texts = [output.outputs[0].text for output in outputs]

return generated_texts[0] if is_single_prompt else generated_texts

Finally, we deploy our service using the deploy method, specifying the model to be used. Once the deploy command is executed, you can jump into the UI at https://app.covalent.xyz/functions to see the deployment status. This function returns a deployment info object that contains the URL of the deployed service along with auth and any extra information needed and can be used to directly interact with the deployed service via functional calls.

MODEL_NAME = "unsloth/llama-3-8b-Instruct"

deployment_info=cc.deploy(vllm_serve)(model=MODEL_NAME)

Once the deployment is active in the UI, you can either reload the deployment info object to get the URL of the deployed service or directly access the service using the URL provided in the UI.

deployment_info.reload()
print(deployment_info.generate(prompt="Once upon a time, there was a", num_tokens=10))
small town called Willow Creek. It was a peaceful
prompts=[f"Story {i}: Once upon a time, there was a" for i in range(5)]

for i,replies in enumerate(deployment_info.generate(prompt=prompts, num_tokens=15)):
print(f"{prompts[i]}..{replies}..")
Story 0: Once upon a time, there was a.. young girl named Sophia who lived in a small village surrounded by a dense forest..
Story 1: Once upon a time, there was a.. young boy named Jack who lived in a small village surrounded by a dense forest..
Story 2: Once upon a time, there was a.. small town called Willowdale. It was a quiet and peaceful place, where..
Story 3: Once upon a time, there was a.. young boy named Jack who lived in a small village surrounded by a dense forest..
Story 4: Once upon a time, there was a.. young girl named Sophia who lived in a small village surrounded by a dense forest..

Or via the command line:

! curl -s -X POST\
"https://fn.prod.covalent.xyz/166380ef3f7d37dbf2a468613/generate"\
-d '{"prompt": "Once upon a time, there was a", "num_tokens": 10}'
" young girl named Maria. She was a kind and"

Once you're done with the deployment, you can do a tear it down using the UI or the teardown method:

deployment_info.teardown()