What is covered in Run GPT-OSS-120B on RTX PRO 6000: Performance Guide?

How to run GPT-OSS-120B on a single RTX PRO 6000. Setup steps, VRAM tips, and install options for local high-performance LLMs.

Run GPT-OSS-120B on RTX PRO 6000: Performance Guide

Overview

This guide explains how to run GPT-OSS-120B on a single NVIDIA RTX PRO 6000 workstation GPU. You will learn hardware needs, install options, VRAM optimizations, and expected performance. The RTX PRO 6000 has 96GB of GDDR7 memory and substantial compute power. See the official specs on the NVIDIA product page.

Why run GPT-OSS-120B locally?

Running GPT-OSS-120B locally provides privacy, lower latency, and avoids cloud fees for heavy use. The model weights are distributed in quantized formats to reduce memory needs. Local deployment enables custom workflows and research without external dependencies.

Model and hardware snapshot

GPT-OSS-120B basics

120 billion parameters with a mixture-of-experts (MoE) design.
36 layers, 128 experts per layer; typically 4 experts active per token.
Model weights are available and commonly provided in quantized formats like MXFP4 to lower memory use.

RTX PRO 6000 key specs

96GB GDDR7 VRAM.
24,064 CUDA cores for heavy inference workloads.
High VRAM makes it suitable for single-GPU runs of large models.

For performance and memory test references, see independent hardware guides such as the Hardware Corner notes.

Real-world performance

On a single RTX PRO 6000 you can expect around 120 tokens per second for GPT-OSS-120B in many common setups. Benchmarks vary with software stack, attention kernels, and context size.

131k token context can consume about 83.17 GB VRAM when fully loaded.
Using Flash Attention can reduce peak VRAM from ~91 GB to under ~67 GB at max context.
Other GPUs: some setups report higher tokens/sec on newer consumer GPUs; AMD system performance varies.

Installation and run options

Pick a runtime that fits your comfort and OS. Below are common install paths with example commands; adjust for your environment.

1. Ollama (easiest)

Ollama is simple for desktop users. Install Ollama first, then pull and run the model.

ollama pull gpt-oss:120b
ollama run gpt-oss:120b

2. LM Studio

LM Studio works on Windows, macOS, and Linux and provides a GUI for quick testing. For details, follow LM Studio instructions or the OpenAI Cookbook article on running GPT-OSS with LM Studio at OpenAI Cookbook.

3. llama.cpp / llama-server

Use llama.cpp or llama-server for a lightweight C++ path. Example commands:

# Install llama.cpp (example)
brew install llama.cpp  # macOS
# On Windows use winget or follow repo docs
llama-server -hf ggml-org/gpt-oss-120b-GGUF -c 0 -fa --jinja --reasoning-format none

4. Transformers (Hugging Face)

Use Transformers and PyTorch when you want full Python control and integration.

pip install transformers torch
from transformers import pipeline
import torch
model_id = "openai/gpt-oss-120b"
pipe = pipeline("text-generation", model=model_id, torch_dtype="auto", device_map="auto")
outputs = pipe("Your prompt here", max_new_tokens=256)
print(outputs[0]["generated_text"])

5. vLLM

vLLM is optimized for high throughput. Use the pre-release wheel if available for GPT-OSS builds.

pip install --pre vllm==0.10.1+gptoss --extra-index-url https://wheels.vllm.ai/gpt-oss/

Memory and VRAM optimizations

VRAM is the primary constraint. Use these tactics to reduce footprint and improve runtime stability.

Use quantized weights: Quantized formats like MXFP4 help fit large models within available VRAM.
Enable Flash Attention: Flash Attention can significantly lower peak VRAM usage.
Reduce context when possible: Large context windows rapidly increase memory use—keep context size tuned to the task.
Use device_map="auto" or CUDA memory offloading: Let the runtime place tensors or offload if supported.
Batch and token settings: Lower batch size, set max_new_tokens, and stream tokens to avoid large allocations.

Recommended software and drivers

Use the latest NVIDIA drivers and CUDA toolkit compatible with your frameworks. See the NVIDIA product page for driver references.
Use recent PyTorch builds with GPU support when using Transformers.
Install Flash Attention per the runtime documentation when supported.

Troubleshooting

Out of memory errors

Enable quantization or lower context size.
Use Flash Attention or memory-efficient kernels.
Try offloading to host memory if your runtime supports it.

Slow generation

Ensure correct torch_dtype (for example, float16) and device_map to utilize GPU compute.
Verify CUDA, cuDNN, and driver compatibility with your framework versions.
Consider vLLM for higher throughput workloads.

Use cases and limitations

Use cases for local GPT-OSS-120B on RTX PRO 6000 include private document processing with large contexts, prototyping without cloud costs, and local high-throughput services for teams requiring privacy. These are practical when memory and compute match the task.

Private document processing with very large context windows.
Prototyping models and custom prompts without ongoing cloud costs.
High-throughput local services for privacy-sensitive teams.

Limitations: single GPUs have finite memory—very large contexts or multi-model setups may require multi-GPU or server-class hardware. MoE models may need prompt tuning compared to dense models.

Benchmarks and comparisons

Summary of observed speeds (software- and configuration-dependent):

RTX PRO 6000: ~120 tokens/sec for GPT-OSS-120B in common configurations.
RTX 5090: higher tokens/sec reported in some setups.
AMD systems: performance varies with stack and drivers.

For detailed tests and community reports, consult hardware benchmarking articles such as Hardware Corner.

Checklist: getting started

Update NVIDIA drivers and CUDA.
Choose a runtime: Ollama, LM Studio, llama.cpp, Transformers, or vLLM.
Download GPT-OSS-120B weights from a trusted registry (for example, Hugging Face if available).
Enable Flash Attention and quantized weights when available.
Run a small prompt to confirm setup and measure tokens/sec.

Final notes

Running GPT-OSS-120B on a single RTX PRO 6000 is practical with the right optimizations. Use quantized weights, Flash Attention, and an appropriate runtime to manage VRAM. Start with small prompts, measure tokens/sec, and tune one setting at a time. If you hit limits, consider multi-GPU or server setups.

Quick next step: Pick one runtime and run a 256-token prompt. Measure tokens/sec and adjust settings incrementally.