Batch processing LLM Inference

Notes from running LLM operation over 1M documents.

July 14,2024

I experimented with a lot of models, and as advertised claude3-haiku gave the best results, in terms of quality and cost. At the time of writing 1K input tokens cost $0.00025. Which is mind blowingly cheap, because the next best model was about 8 times costlier not to mention gpt 4o which was about 20 times costlier.

AWS was obvious for my experiments, Apart from chatgpt group of models, they have pretty wide range of models and it’s fairly easy to experiment with them in the playground. But the SDK are still in beta, and the documentation is still strewn across multiple places.

As of the writing of this article, the stable channel of SDKs are not shipped with the bedrock APIs. The best way is to download it from,

Uninstall any botocore or boto3 that you might have installed previously.

pip uninstall botocore boto3

Install the bedrock SDK by running

wget https://d2eo22ngex1n9g.cloudfront.net/Documentation/SDK/bedrock-python-sdk-reinvent.zip
unzip bedrock-python-sdk-reinvent.zip

cd bedrock-python-sdk-reinvent

python3 -m pip install botocore-1.32.4-py3-none-any.whl
python3 -m pip install boto3-1.29.4-py3-none-any.whl

Batch inference works, like this. You upload a jsonl formatted file to s3 and then point an output location.

The SDK supports these operations,

import boto3

bedrock = boto3.client(service_name="bedrock")

# Create a model invocation job
bedrock.create_model_invocation_job(
    roleArn="<AWS roleArn >", # RoleArn with permissions to access the model
    modelId="<ModelId>", # ModelId from the model registry
    jobName="<jobname>", # Could bed anything
    inputDataConfig=inputDataConfig,
    outputDataConfig=outputDataConfig
)

# Get the status of a specific model invocation job
bedrock.get_model_invocation_job(
    jobIdentifier="<job ARN from create_model_invocation_job >"
)

# List all model invocation jobs with a specific filter, api is also paginated.
bedrock.list_model_invocation_jobs(
    status="ALL" # Status could be "ALL", "FAILED", "SUCCEEDED", "RUNNING"
 )

# Stop a specific model invocation job
bedrock.stop_model_invocation_job(
    jobIdentifier="<job ARN from create_model_invocation_job >"
)

The annoying part in setting up the batch inference was the s3 permissioning errors.

Debugging

There were multiple, unable to access bucket with role type errors. Which could be solved by changing the trust boundary of the bucket to include bedrock.amazonaws.com

Max file size allowed is, 512 MB. For my output token count of 1K at least, when I uploaded 250 MB input. The job failed with a Max time reached when it reached 60%. So, more likely the real valid file size is 150 MB.

I fragmented my dataset into 150 MB chunks and made multiple jobs out of them. This worked well.

Batch processing LLM Inference

Debugging

links

Hear from me.