RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

TL;DR

A user successfully configures an RTX 5080 and RTX 3090 to run Qwen 3.6 27B Q8 at over 80 tokens/sec. This showcases significant AI inference speed improvements with dual-GPU setups.

A user has demonstrated that combining an RTX 5080 with an RTX 3090 enables running Qwen 3.6 27B Q8 at over 80 tokens per second, marking a notable performance milestone for local AI inference setups.

The user configured a dual-GPU setup using an Asus Prime X570-Pro motherboard, enabling PCIe 4.0 split and adjusting BIOS settings such as Above 4G Decoding and ReSize BAR Support. The setup involved installing custom drivers compatible with different GPU models, with specific kernel parameters to ensure both cards are recognized and utilized effectively.

With this configuration, the user achieved an inference speed of over 80 tokens/sec on the Qwen 3.6 27B model, using llama.cpp with specific build flags supporting both Ampere and Blackwell architectures. The model was quantized at Q8, fitting within 39GB of VRAM, and the system was optimized for multi-GPU use, filling VRAM corners efficiently.

Implications for Local AI Performance Enhancement

This development highlights the potential for high-performance local AI inference using consumer-grade GPUs, reducing reliance on cloud services. The demonstrated speed of over 80 tokens/sec with a dual-GPU setup can accelerate AI experimentation and deployment for developers and researchers, especially those working with large language models.

It also underscores the importance of proper hardware configuration and driver management when combining different GPU architectures, which can significantly impact performance and stability in AI workloads.

ASUS TUF Gaming GeForce RTX™ 5080 16GB GDDR7 OC Edition Graphics Card

As an affiliate, we earn on qualifying purchases.

Recent Advances in Multi-GPU AI Setups

Over the past year, AI enthusiasts have increasingly experimented with combining consumer GPUs for local inference, often facing challenges related to driver compatibility and BIOS configuration. The release of high-end GPUs like the RTX 5080 has spurred interest in multi-GPU setups aimed at boosting inference speeds.

This particular user’s experience builds upon earlier efforts with single GPUs, showing that with careful configuration, combining an RTX 5080 and RTX 3090 can yield significant performance gains, approaching or exceeding 80 tokens/sec on complex models like Qwen 3.6 27B.

“Achieving over 80 tokens/sec on Qwen 3.6 27B Q8 with this dual-GPU configuration is a major step forward for local AI inference.”

— the user who shared the setup

Mining Rig Frame for 12GPU, Steel Open Air Miner Mining Frame Rig Case, Support to Dual Power Supply for Crypto Coin Currency Bitcoin ETH ETC ZEC Mining Tools – Frame Only, Fans & GPU is not Included

SLOT – 6/8/12 GPU slots, support 2 ATX power supplies.

As an affiliate, we earn on qualifying purchases.

Unresolved Questions About Compatibility and Stability

It remains unclear how stable the setup is over extended periods or under different workloads. Compatibility issues may arise with other GPU models or driver versions, and the long-term reliability of such configurations has not been tested extensively.

Details about performance consistency across different models or in multi-user environments are still emerging, and further testing is needed to confirm scalability and stability.

GLOTRENDS 300mm PCIe 4.0 X16 Riser Cable for PCIe 4.0/3.0 GPUs, Such as GeForce RTX 40/30 Series and AMD Radeon RX7000/RX6000 Series, etc

PCIE 4.0 x16 Riser Cable – Unleash 32GB/s Peak Speed: To achieve full PCIE 4.0 x16 bandwidth (32GB/s)…

As an affiliate, we earn on qualifying purchases.

Next Steps for Multi-GPU AI Optimization

Further experiments are expected to explore optimizing power management, driver updates, and BIOS settings for even higher speeds or stability. Developers may also attempt to replicate this setup with identical GPUs or newer models to assess performance gains and compatibility.

Community efforts could lead to standardized configurations or driver patches that simplify multi-GPU AI deployment, expanding accessibility for hobbyists and professionals alike.

Mastering Zen 4 GPU for Machine Learning: A Practical Guide to Optimizing Zen 4 GPUs for High-Performance Workflows, Training, and Inference

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this setup be used for commercial AI applications?

While promising, this setup is primarily experimental. Stability and long-term reliability need further testing before it can be recommended for commercial use.

What are the main challenges in configuring dual GPUs like this?

Ensuring driver compatibility, BIOS configuration, and PCIe lane management are key challenges. Differences in GPU models can complicate driver setup and stability.

Is this performance achievable with other GPU combinations?

Performance depends on specific hardware and configuration. Similar results may require careful tuning, and identical GPU models often simplify setup.

How does quantization at Q8 impact inference speed?

Quantization reduces model size and computational load, enabling faster inference speeds like the 80+ tokens/sec achieved here, especially on high VRAM GPUs.

Source: Hacker News

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Up next

Amazon CEO’s talks with U.S. officials triggered crackdown on Anthropic models

Author

GadgetFee

Share article

Implications for Local AI Performance Enhancement

ASUS TUF Gaming GeForce RTX™ 5080 16GB GDDR7 OC Edition Graphics Card

Recent Advances in Multi-GPU AI Setups

Mining Rig Frame for 12GPU, Steel Open Air Miner Mining Frame Rig Case, Support to Dual Power Supply for Crypto Coin Currency Bitcoin ETH ETC ZEC Mining Tools – Frame Only, Fans & GPU is not Included

Unresolved Questions About Compatibility and Stability

GLOTRENDS 300mm PCIe 4.0 X16 Riser Cable for PCIe 4.0/3.0 GPUs, Such as GeForce RTX 40/30 Series and AMD Radeon RX7000/RX6000 Series, etc

Next Steps for Multi-GPU AI Optimization

Mastering Zen 4 GPU for Machine Learning: A Practical Guide to Optimizing Zen 4 GPUs for High-Performance Workflows, Training, and Inference

Key Questions

Can this setup be used for commercial AI applications?

What are the main challenges in configuring dual GPUs like this?

Is this performance achievable with other GPU combinations?

How does quantization at Q8 impact inference speed?

Physical Games Vs Digital Downloads: Are Game Discs Dying Out?

Mechanical Vs Membrane Keyboards: Does It Really Make a Difference?

AR Vs VR: What’s the Difference and Which Will Matter More to Gaming?

Gaming Headsets Vs Studio Headphones: Do You Really Need a ‘Gaming’ Headset?

D&D Rolls a Natural 20 on Nerd Collabs, Announcing World of Warcraft and Star Wars Sets

The Impact Of Responsible AI On Europe’s Tech Industry

The Gaming Industry Is Truly Messed Up…

15 Best USB Flash Drives in 2026

RTX 5080 and RTX 3090 Setup: 80 Tok/s on Qwen 3.6 27B Q8

Up next

Author

GadgetFee

Share article

Implications for Local AI Performance Enhancement

ASUS TUF Gaming GeForce RTX™ 5080 16GB GDDR7 OC Edition Graphics Card

Recent Advances in Multi-GPU AI Setups

Mining Rig Frame for 12GPU, Steel Open Air Miner Mining Frame Rig Case, Support to Dual Power Supply for Crypto Coin Currency Bitcoin ETH ETC ZEC Mining Tools – Frame Only, Fans & GPU is not Included

Unresolved Questions About Compatibility and Stability

GLOTRENDS 300mm PCIe 4.0 X16 Riser Cable for PCIe 4.0/3.0 GPUs, Such as GeForce RTX 40/30 Series and AMD Radeon RX7000/RX6000 Series, etc

Next Steps for Multi-GPU AI Optimization

Mastering Zen 4 GPU for Machine Learning: A Practical Guide to Optimizing Zen 4 GPUs for High-Performance Workflows, Training, and Inference

Key Questions

Can this setup be used for commercial AI applications?

What are the main challenges in configuring dual GPUs like this?

Is this performance achievable with other GPU combinations?

How does quantization at Q8 impact inference speed?

You May Also Like