Blog

Local models, major hurdles: How affordable AI on your own hardware is

Written by Sebastian Macke | May 08, 2026

On March 16, the GPU Technology Conference, or GTC for short, opened its doors once again. As always, the real highlight was the keynote speech by Jensen Huang, CEO of NVIDIA. With around 35 million views on YouTube , these presentations are now among the most watched tech keynotes ever. Only some Apple events with major product presentations reach a similar dimension.

This year, much of the focus was on Vera Rubin, NVIDIA's latest supercomputer. The system consists of 40 racks, i.e. a wall of cabinets around 30 meters long, which is to be sold as a complete solution. What is new is that special hardware for inference is now also being integrated: so-called LPUs, or Language Processing Units, from the acquired company Groq. Their main purpose is to speed up the execution of language models.

Vera Rubin supercomputer: very fewpeoplewillever seea system like this.But we will all use Vera Rubin at some point - every time we ask the chatbot a question.

 

As is so often the case with NVIDIA, however, the same applies here: The benchmarks shown in this keynote do not always accurately compare like with like. The actual performance gain is therefore difficult to estimate. An acceleration of the inference, i.e. the actual chat, by a factor of 10 seems quite plausible.

The system should be available in the second quarter of this year. However, it will probably be some time before such machines are actually in productive use. They will probably initially be used internally for training or later offered externally as particularly expensive fast modes.

NVIDIA did not mention any specific prices. However, it is doubtful that such a system will be available for less than 100 million dollars. And this does not even take into account the costs for the necessary data center, water cooling and its own power supply.

For the vast majority of visitors to the conference, this hardware was in a completely different league anyway. Anyone who can afford such a system is hardly part of the typical GTC target group. The two-hour keynote therefore offered little for normal users.

Instead, another question arises that I am now encountering more and more frequently: Can't language models also be run locally?

There are many reasons for this. Some are simply attracted by technical curiosity, others want more control, stability, adaptability or offline capability. Then there are data protection, confidentiality and, last but not least, the desire for more independence from American providers.

The hope of lower costs also plays a role. Many suspect that local models could end up being cheaper than the offerings of hyperscalers.

Without RAM, things get tight

However, the answer to this question is sobering. If you want to use local models, you very quickly end up with the question of suitable hardware. This is expensive, to say the least.

Basically, with sufficient RAM, practically any model can be run on a CPU. However, without hardware acceleration, especially for matrix multiplications, this is often so slow that it is hardly practical.

The question remains: What does sufficient RAM actually mean? The smallest reasonable models for free download today are around 20-30 billion parameters, such as the Gemma 4 model. At the other end of the spectrum, for example, is the newly released coding model GLM-5.1 with 754 billion parameters. Assuming 4-bit quantization, the required memory can be roughly estimated using a simple rule of thumb:

Working memory in gigabytes = number of parameters in billions × 0.7

For GLM-5.1, this means that even a single user would need around half a terabyte of RAM. At today's market prices, this alone would be equivalent to around 7000 euros, and that's only with normal slow RAM such as DDR5. GLM-5.1 would be calculated here at a snail's pace of 1-3 tokens per second.

The significantly faster memory typically used in such systems costs a factor of 5-10 more. There is a good reason why these models are hardly ever found in local use.

And memory and RAM speed are not enough. If you not only want to run language models, but perhaps also train or fine-tune other models, you need a software ecosystem that is as broad and reliable as possible. This is precisely where NVIDIA is currently the obvious choice. This is the best way to ensure that the countless libraries and tools that have been created in recent years actually work. CUDA and the associated ecosystem continue to clearly dominate this area. This can work well with other providers. But you often have to reckon with limitations.

NVIDIA

If you look at the NVIDIA line-up beyond Vera Rubin, it doesn't really become any more suitable for everyday use. A GB300 rack costs around 3 million euros and is therefore clearly aimed at hyperscalers - i.e. large AI data centers or cloud providers.

The first system, which is more suitable for research centers or larger companies for which technological sovereignty plays an important role, is the DGX B200 or DGX B300. These systems cost around 300,000 to 500,000 dollars each and are also only intended for use in data centers. Appropriate cooling and a 17 kW power connection are required. With 2 TB of GPU RAM, this is also likely to be the last expansion stage that is theoretically capable of running even the largest language models currently available. These include the large Chinese models, but possibly also systems from Google, OpenAI or Anthropic, if one had access to them. These machines are also fast enough to serve several users in parallel.

Not entirely serious. But the DGX-B200 is the smallest system in the data center league that you could buy for roughly 300,000 dollars.

 

The first system that is actually aimed at individual users is the DGX Station. It comes in the form of a larger desktop computer, costs around 95,000 dollars and only requires a normal power connection. Nevertheless, it's best not to run the machine under your desk, as its waste heat is probably equivalent to that of a tumble dryer. However, with just 252 GB of GPU RAM, you have to make significant compromises. However, there is no lack of speed: the memory achieves a bandwidth of 7.1 TB per second. This is around ten times faster than the RAM of a typical PC.

This is where the consumer market for graphics cards slowly begins. Of course, there is initially no way around the classic gaming RTX cards, which are currently available with up to 32 GB of memory. This is actually too little for many models. In practice, many users therefore make do by additionally integrating the computer's normal RAM and only loading the data that is currently required into the GPU memory. Although this works in principle, it costs a lot of performance. Nevertheless, it is often better to go slowly than not at all.

The RTX 6000 , which is equipped with 96 GB of RAM, also occupies a special position here. This series was originally designed more for professional graphics and design applications, but it is just as suitable for AI workloads. However, the price is also correspondingly high: you have to budget for around 10,000 euros.

The RTX-6000 with 96 GB of fast RAM.

 

NVIDIA presented an interesting experiment a few months ago with the DGX Spark , or in a slightly modified form with JetsonThor. This is a mini PC with an integrated GPU whose performance is more in the range of inexpensive graphics cards, but which is equipped with 128 GB of RAM. The system is not particularly fast and runs under Linux, but offers the option of running medium-sized language models locally.

DGX Spark Mini PC comes from many OEMs. Perhaps the best compromise.

 

It is precisely these small systems starting at around 3200 euros that could prove to be the sweet spot for many between price, memory and speed. They are more affordable than professional workstations, much more suitable for everyday use than large data center solutions and powerful enough to seriously experiment with models locally. I myself use a DGX Spark.

Apple and AMD

After around three years of rapid development, there is now much more choice than just NVIDIA. Apple and AMD, for example, now also offer systems with AI acceleration. However, with very different focuses.

Apple plays virtually no role in the server market, but began integrating AI accelerators into its own chip line in the consumer sector at an early stage. Three accelerators are built into the M chips (M1-M5): in the CPU, GPU and NPU.

Which Apple device you actually buy is of secondary importance. The decisive factor here is also the memory expansion. A Mac Studio with 256 GB RAM currently costs around 7300 euros. The version with 512 GB RAM was unfortunately discontinued just a few weeks ago. In terms of speed, such a system is likely to be in a similar order of magnitude to a DGX Spark, as LPDDR5 memory is also used here, i.e. memory with a relatively low bandwidth.

AMD also wants to play in the data center market and offers the Instinct series offers its own accelerators, which are now available with up to 256 GB of RAM. There are also interesting options in the consumer and workstation sector. The most powerful card in this segment is currently the Radeon PRO W7900 with 48 GB memory, which is priced at around 4000 euros.

Systems with CPUs from the Ryzen AI Max series are now also particularly interesting in terms of price. Although they are not really cheap, they are often even cheaper than specialized AI workstations. In terms of speed, they are also likely to be in a similar league to the DGX Spark, as many systems here are also equipped with 128 GB LPDDR5 RAM.

Local Models are Catching up

No matter what you decide: One thing should be clear: none of the models running on reasonably affordable hardware today can match the quality of the systems from the major providers.

They are good for spelling corrections. They can also solve small logic tasks. When it comes to agentic action, however, they are more at school level. As coding agents, they can cope with the exercises of the first few semesters of a computer science course. But not much more.

Companies that can train such models are of course pursuing economic interests. As soon as a model is really competitive, the probability that it will be published for local execution decreases. At the same time, however, the threshold of what is considered competitive is moving ever higher.

As a result, local models have become much more viable compared to last year. And this is perhaps the best news: time is currently working for the buyer. It is therefore quite possible that specialized small models will soon be on a par with the large models in their respective niche.