TL;DR

A user successfully configures an RTX 5080 and RTX 3090 to run Qwen 3.6 27B Q8 at over 80 tokens/sec. This showcases significant AI inference speed improvements with dual-GPU setups.

A user has demonstrated that combining an RTX 5080 with an RTX 3090 enables running Qwen 3.6 27B Q8 at over 80 tokens per second, marking a notable performance milestone for local AI inference setups.

The user configured a dual-GPU setup using an Asus Prime X570-Pro motherboard, enabling PCIe 4.0 split and adjusting BIOS settings such as Above 4G Decoding and ReSize BAR Support. The setup involved installing custom drivers compatible with different GPU models, with specific kernel parameters to ensure both cards are recognized and utilized effectively.

With this configuration, the user achieved an inference speed of over 80 tokens/sec on the Qwen 3.6 27B model, using llama.cpp with specific build flags supporting both Ampere and Blackwell architectures. The model was quantized at Q8, fitting within 39GB of VRAM, and the system was optimized for multi-GPU use, filling VRAM corners efficiently.

Implications for Local AI Performance Enhancement

This development highlights the potential for high-performance local AI inference using consumer-grade GPUs, reducing reliance on cloud services. The demonstrated speed of over 80 tokens/sec with a dual-GPU setup can accelerate AI experimentation and deployment for developers and researchers, especially those working with large language models.

It also underscores the importance of proper hardware configuration and driver management when combining different GPU architectures, which can significantly impact performance and stability in AI workloads.

PNY NVIDIA GeForce RTX™ 5080 Epic-X™ ARGB OC Triple Fan, Graphics Card (16GB GDDR7, 256-bit, Boost Speed: 2775 MHz, PCIe® 5.0, HDMI®/DP 2.1, 2.99-Slot, NVIDIA Blackwell Architecture, DLSS 4)

PNY NVIDIA GeForce RTX™ 5080 Epic-X™ ARGB OC Triple Fan, Graphics Card (16GB GDDR7, 256-bit, Boost Speed: 2775 MHz, PCIe® 5.0, HDMI®/DP 2.1, 2.99-Slot, NVIDIA Blackwell Architecture, DLSS 4)

NVIDIA DLSS 4 – Supreme Speed. Superior Visuals. Powered by AI. DLSS is a revolutionary suite of neural…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Recent Advances in Multi-GPU AI Setups

Over the past year, AI enthusiasts have increasingly experimented with combining consumer GPUs for local inference, often facing challenges related to driver compatibility and BIOS configuration. The release of high-end GPUs like the RTX 5080 has spurred interest in multi-GPU setups aimed at boosting inference speeds.

This particular user’s experience builds upon earlier efforts with single GPUs, showing that with careful configuration, combining an RTX 5080 and RTX 3090 can yield significant performance gains, approaching or exceeding 80 tokens/sec on complex models like Qwen 3.6 27B.

“Achieving over 80 tokens/sec on Qwen 3.6 27B Q8 with this dual-GPU configuration is a major step forward for local AI inference.”

— the user who shared the setup

Amazon

dual GPU mining rig

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Unresolved Questions About Compatibility and Stability

It remains unclear how stable the setup is over extended periods or under different workloads. Compatibility issues may arise with other GPU models or driver versions, and the long-term reliability of such configurations has not been tested extensively.

Details about performance consistency across different models or in multi-user environments are still emerging, and further testing is needed to confirm scalability and stability.

Asus ROG Strix B550-F Gaming WiFi II AMD AM4 (3rd Gen Ryzen) ATX Gaming Motherboard (PCIe 4.0,WiFi 6E, 2.5Gb LAN, BIOS Flashback, HDMI 2.1, Addressable Gen 2 RGB Header and Aura Sync)

Asus ROG Strix B550-F Gaming WiFi II AMD AM4 (3rd Gen Ryzen) ATX Gaming Motherboard (PCIe 4.0,WiFi 6E, 2.5Gb LAN, BIOS Flashback, HDMI 2.1, Addressable Gen 2 RGB Header and Aura Sync)

AM4 socket: Ready for AMD Ryzen 3000 and 5000 series, plus 5000 and 4000 G-series desktop processors.Bluetooth v5.2

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Next Steps for Multi-GPU AI Optimization

Further experiments are expected to explore optimizing power management, driver updates, and BIOS settings for even higher speeds or stability. Developers may also attempt to replicate this setup with identical GPUs or newer models to assess performance gains and compatibility.

Community efforts could lead to standardized configurations or driver patches that simplify multi-GPU AI deployment, expanding accessibility for hobbyists and professionals alike.

msi GeForce RTX 5080 Triple Fan OC Graphics Card, 16GB GDDR7, 30 Gbps, 256-bit, Boost 2640 MHz, NVIDIA Blackwell, DLSS 4, AI Content Creation, Local LLM Inference, PCIe 5.0, DP 2.1b x3, HDMI 2.1b

msi GeForce RTX 5080 Triple Fan OC Graphics Card, 16GB GDDR7, 30 Gbps, 256-bit, Boost 2640 MHz, NVIDIA Blackwell, DLSS 4, AI Content Creation, Local LLM Inference, PCIe 5.0, DP 2.1b x3, HDMI 2.1b

[NVIDIA Blackwell, DLSS 4, 4th Gen Ray Tracing Cores] Built for high fidelity gaming and creator workloads with…

As an affiliate, we earn on qualifying purchases.

As an affiliate, we earn on qualifying purchases.

Key Questions

Can this setup be used for commercial AI applications?

While promising, this setup is primarily experimental. Stability and long-term reliability need further testing before it can be recommended for commercial use.

What are the main challenges in configuring dual GPUs like this?

Ensuring driver compatibility, BIOS configuration, and PCIe lane management are key challenges. Differences in GPU models can complicate driver setup and stability.

Is this performance achievable with other GPU combinations?

Performance depends on specific hardware and configuration. Similar results may require careful tuning, and identical GPU models often simplify setup.

How does quantization at Q8 impact inference speed?

Quantization reduces model size and computational load, enabling faster inference speeds like the 80+ tokens/sec achieved here, especially on high VRAM GPUs.

Source: Hacker News


You May Also Like

Wired Vs Wireless Gaming Mouse: Is Latency Still a Problem in 2025?

In 2025, wireless gaming mice offer near-zero latency, but is there still a reason to prefer wired options for competitive gaming?

Gaming Laptop Vs Desktop: Which Is Better for You?

Gaming laptops versus desktops: which is better for you depends on your priorities, but understanding their differences can help you decide.

Why Are Gaming PCs So Expensive? Breaking Down the Cost of a High-End Rig

The truth about gaming PC costs reveals how premium components and features drive prices higher than you might expect.

The Rise of Esports: Can Gaming Really Be a Full-Time Career?

Yes, gaming can be a full-time career as esports has grown into…