---
title: "Run GLM-OCR on RunPod Serverless: 17-line Dockerfile"
slug: "run-glm-ocr-on-runpod-serverless-dockerfile"
published: "2026-02-09"
updated: "2026-04-06"
validated: "2026-02-15"
categories:
  - "Docker"
tags:
  - "Run GLM-OCR on RunPod"
  - "RunPod Serverless"
  - "GLM-OCR"
  - "Transformers v5"
  - "vllm"
  - "Dockerfile for serverless"
  - "pre-download model weights"
  - "huggingface snapshot_download"
  - "huggingface-hub"
  - "vllm/vllm-openai:nightly"
  - "serverless cold starts"
  - "baked model weights"
llm-intent: "reference"
audience-level: "intermediate"
framework-versions:
  - "transformers@5-dev (install from git+https://github.com/huggingface/transformers.git)"
  - "vllm@nightly (base image vllm/vllm-openai:nightly)"
  - "huggingface_hub@latest (used for snapshot_download)"
  - "python@3.x"
status: "stable"
llm-purpose: "Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly."
llm-prereqs:
  - "Access to RunPod Serverless"
  - "Access to Docker"
  - "Access to vLLM"
  - "Access to Transformers (dev branch)"
  - "Access to HuggingFace Hub"
llm-outputs:
  - "Completed outcome: Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly."
---

**Summary Triples**
- (Dockerfile, installs, Transformers v5/dev branch during image build via pip install -U git+https://github.com/huggingface/transformers.git)
- (Dockerfile, uses, vllm base image vllm/vllm-openai:nightly as the runtime layer)
- (Dockerfile, bakes, GLM-OCR model weights into the image using huggingface_hub.snapshot_download at build time)
- (RunPod Serverless endpoint, serves, vllm on port 8080 using the command 'vllm serve zai-org/GLM-OCR' (or equivalent path))
- (Pre-baked weights, reduce, serverless cold-start latency by preventing multi-gigabyte runtime downloads)
- (Building image, may require, HUGGINGFACE_TOKEN as a build-arg or env var when snapshot_downloading private or rate-limited models)
- (Trade-off, is, longer image build time and larger image size in exchange for faster cold starts)
- (Recommended validation, is, build/run the image locally and confirm vllm serves and loads GLM-OCR before publishing to RunPod)

### {GOAL}
Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly.

### {PREREQS}
- Access to RunPod Serverless
- Access to Docker
- Access to vLLM
- Access to Transformers (dev branch)
- Access to HuggingFace Hub

### {STEPS}
1. Select vLLM nightly base image
2. Install git in image
3. Upgrade to Transformers v5 dev
4. Pre-download GLM-OCR model weights
5. Expose port and set CMD
6. Push Dockerfile to GitHub
7. Create RunPod Serverless endpoint
8. Test endpoint with OpenAI call

<!-- llm:goal="Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly." -->
<!-- llm:prereq="Access to RunPod Serverless" -->
<!-- llm:prereq="Access to Docker" -->
<!-- llm:prereq="Access to vLLM" -->
<!-- llm:prereq="Access to Transformers (dev branch)" -->
<!-- llm:prereq="Access to HuggingFace Hub" -->
<!-- llm:output="Completed outcome: Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly." -->

# Run GLM-OCR on RunPod Serverless: 17-line Dockerfile
> Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly.
Matija Žiberna · 2026-02-09

I wanted to run [GLM-OCR](https://huggingface.co/zai-org/GLM-OCR) as a serverless endpoint. Not a dedicated pod sitting idle burning credits, but a proper serverless setup that scales to zero when nobody is calling it and spins up on demand. RunPod Serverless seemed like the right fit, but getting there was not as straightforward as I expected.

The problem is that GLM-OCR requires a bleeding-edge version of Transformers (v5+ dev branch) that no public Docker image ships with. You cannot just pick `vllm/vllm-openai:nightly` from the registry, set a start command, and go. The model will not load. You need a custom image, and if you are running serverless, you want that image to be as self-contained as possible so cold starts do not punish you with multi-gigabyte downloads every time a new worker spins up.

This guide walks through the exact Dockerfile I built, the gotchas I hit along the way, and how to deploy it on RunPod Serverless. The full source is available at [github.com/matija2209/ocr-docker](https://github.com/matija2209/ocr-docker).

## Why not just use a public image?

RunPod Serverless gives you two options: use a public Docker image or build from a public GitHub repo. The public image route sounds appealing — point it at `vllm/vllm-openai:nightly`, add a start command that installs Transformers from source, and let it rip.

The start command would look something like this:

```bash
bash -lc '
set -e
pip uninstall -y transformers || true
pip install -U git+https://github.com/huggingface/transformers.git
exec vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080
'
```

This technically works, but it has two serious downsides for serverless. First, every cold start pays the cost of installing Transformers from source. That is a `git clone` plus a wheel build on every new worker. Second, the model weights get downloaded from HuggingFace on every cold start too. For a model that is roughly 2GB, that adds meaningful latency before your endpoint can serve its first request.

The better approach is to bake everything into the image at build time: the Transformers upgrade, the model weights, all of it. One build, and every cold start after that just loads from disk.

## The Dockerfile

Here is the complete Dockerfile. It is short, but every line is there for a reason.

```dockerfile
# File: Dockerfile
FROM vllm/vllm-openai:nightly

# git is needed for pip install from GitHub
RUN apt-get update && apt-get install -y --no-install-recommends git \
 && rm -rf /var/lib/apt/lists/*

# Install newer Transformers so GLM-OCR is recognized
RUN pip uninstall -y transformers || true \
 && pip install -U git+https://github.com/huggingface/transformers.git

# Pre-download model weights into the image so cold starts don't hit HuggingFace
ENV HF_HOME=/root/.cache/huggingface
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')"

EXPOSE 8080

CMD ["vllm", "serve", "zai-org/GLM-OCR", "--allowed-local-media-path", "/", "--port", "8080"]
```

Let me walk through what each piece does and why it is necessary.

## Installing git

The `vllm/vllm-openai:nightly` base image does not include `git`. That might seem like a minor detail, but `pip install git+https://...` literally needs git to clone the repository. Without it, the build fails immediately with `executable file not found in $PATH`. A quick `apt-get install` solves this.

```dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends git \
 && rm -rf /var/lib/apt/lists/*
```

The `--no-install-recommends` flag keeps the layer small by skipping suggested packages, and cleaning up `/var/lib/apt/lists/*` removes the package index cache.

## Upgrading Transformers

GLM-OCR requires Transformers v5+, which at the time of writing has not been released to PyPI yet. The only way to get it is to install directly from the main branch on GitHub.

```dockerfile
RUN pip uninstall -y transformers || true \
 && pip install -U git+https://github.com/huggingface/transformers.git
```

The uninstall-then-install pattern ensures a clean replacement. You will see a pip warning that vLLM requires `transformers<5`, but this is safe to ignore. vLLM works fine with the dev branch, and GLM-OCR will not load without it.

## Baking in the model weights

This is the most important step for serverless performance. Without it, every new worker downloads roughly 2GB of model weights from HuggingFace before it can serve a single request.

```dockerfile
ENV HF_HOME=/root/.cache/huggingface
RUN python3 -c "from huggingface_hub import snapshot_download; snapshot_download('zai-org/GLM-OCR')"
```

A few things to note here. The command uses `python3`, not `python`. The base image is Debian-based and does not alias `python` to `python3`, so using `python` will fail with a "not found" error. I also use the Python API directly (`snapshot_download`) rather than the `huggingface-cli` command because the CLI binary can end up outside of `$PATH` after upgrading `huggingface-hub` during the Transformers install.

GLM-OCR is a public model under the MIT license, so no authentication token is needed. You will see a warning about unauthenticated requests having lower rate limits, but the download completes fine. If you want faster downloads during builds, you can set `HF_TOKEN` as an environment variable in RunPod's build settings.

The resulting image is around 10.5GB. That is large by general Docker standards, but completely normal for ML serving images and well within RunPod's limits.

## Deploying on RunPod Serverless

Push the Dockerfile to a public GitHub repository. I use [github.com/matija2209/ocr-docker](https://github.com/matija2209/ocr-docker) for this.

In RunPod, create a new Serverless endpoint and select **Build from GitHub repo**. Point it to your repository. You do not need to set a container start command because the `CMD` in the Dockerfile already handles it.

No environment variables are required for the basic setup. The model is public, the port is configured, and the weights are in the image. Just deploy and wait for the build to complete.

Once the endpoint is live, it serves an OpenAI-compatible API. You can call it like this:

```bash
curl https://<your-runpod-endpoint>/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "zai-org/GLM-OCR",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "image_url", "image_url": {"url": "https://example.com/document.png"}},
          {"type": "text", "text": "Text Recognition:"}
        ]
      }
    ]
  }'
```

## GLM-OCR prompt reference

GLM-OCR is not a general-purpose vision model. It supports a specific set of prompts, and using anything outside of these will give you unreliable results.

For document parsing, use one of these exact prompt strings:

- `Text Recognition:` for extracting raw text
- `Formula Recognition:` for mathematical formulas
- `Table Recognition:` for table structures

For information extraction, provide a JSON schema that defines exactly what fields you want. The model will fill in the values from the document:

```json
{
  "role": "user",
  "content": [
    {"type": "image_url", "image_url": {"url": "path/to/id-card.png"}},
    {"type": "text", "text": "Please output the information in the image in the following JSON format:\n{\"name\": \"\", \"date_of_birth\": \"\", \"id_number\": \"\"}"}
  ]
}
```

The output must strictly follow the JSON schema you provide. This is by design and is what makes the model useful for structured document processing pipelines.

## Gotchas I hit along the way

For reference, here are the build failures I ran into so you can avoid them:

**No git in base image.** The `vllm/vllm-openai:nightly` image does not ship with git. Any `pip install git+https://...` will fail with exit code 127. Install git first.

**`python` vs `python3`.** The base image only has `python3` on PATH. Using `python` gives you "not found". Always use `python3` explicitly.

**`huggingface-cli` not on PATH.** After upgrading `huggingface-hub` as a dependency of Transformers, the CLI binary can land somewhere outside of `$PATH`. Using the Python API directly (`from huggingface_hub import snapshot_download`) bypasses this entirely.

**Transformers version conflict warning.** vLLM pins `transformers<5` in its dependencies. Installing the v5 dev branch triggers a pip warning. It is safe to ignore — vLLM runs fine and GLM-OCR requires it.

## Wrapping up

Running GLM-OCR on serverless comes down to one key decision: bake everything into the image. The model needs a version of Transformers that no public image ships, and serverless cold starts punish you for anything that needs to be installed or downloaded at runtime. A 17-line Dockerfile solves both problems.

The full Dockerfile is at [github.com/matija2209/ocr-docker](https://github.com/matija2209/ocr-docker). Fork it, point RunPod at it, and you have a serverless OCR endpoint that scales to zero and starts fast.

Let me know in the comments if you have questions, and subscribe for more practical development guides.

Thanks, Matija

## LLM Response Snippet
```json
{
  "goal": "Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly.",
  "responses": [
    {
      "question": "What does the article \"Run GLM-OCR on RunPod Serverless: 17-line Dockerfile\" cover?",
      "answer": "Run GLM-OCR on RunPod Serverless: step-by-step Dockerfile to pre-install Transformers v5 and bake GLM-OCR weights so serverless OCR starts instantly."
    }
  ]
}
```