Nvidia Nemotron 3 Family of Models

wcallahan · 2025-12-15T23:48:52 1765842532

I don’t do ‘evals’, but I do process billions of tokens every month, and I’ve found these small Nvidia models to be the best by far for their size currently.

As someone else mentioned, the GPT-OSS models are also quite good (though I haven’t found how to make them great yet, though I think they might age well like the Llama 3 models did and get better with time!).

But for a defined task, I’ve found task compliance, understanding, and tool call success rates to be some of the highest on these Nvidia models.

For example, I have a continuous job that evaluates if the data for a startup company on aVenture.vc could have overlapping/conflated two similar but unrelated companies for news articles, research details, investment rounds, etc… which is a token hungry ETL task! And I recently retested this workflow on the top 15 or so models today with <125b parameters, and the Nvidia models were among the best performing for this type of work, particularly around non-hallucination if given adequate grounding.

Also, re: cost - I run local inference on several machines that run continuously, in addition to routing through OpenRouter and the frontier providers, and was pleasantly surprised to find that if I’m a paying customer of OpenRouter otherwise, the free variant there from Nvidia is quite generous for limits, too.

kgeist · 2025-12-17T00:19:59 1765930799

>the GPT-OSS models are also quite good

I recently pitted gpt-oss 120b against Qwen3-Next 80b on a lot of internal benchmarks (for production use), and for me, gpt-oss was slightly slower (vLLM, both fit in VRAM), much worse at multilingual tasks (33 languages evaluated), and had worse instruction following (e.g., Qwen3-Next was able to reuse the same prompts I used for Gemma3 perfectly, while gpt-oss struggled and RAG benchmarks suddenly went from 90% to 60% without additional prompt engineering).

And that's with Qwen3-Next being a random unofficial 4-bit quant (compared to gpt-oss having native support) + I had to disable multi-token prediction in Qwen3-Next because vLLM crashed with it.

Has someone here tried both gpt-oss 120b and Qwen3-Next 80b? Maybe I was doing something wrong because I've seen a lot of people praise gpt-oss.

andy99 · 2025-12-16T23:44:24 1765928664

What do you mean about not doing evals? Just literally that you don’t run any benchmarks or do you have something against them?

woodson · 2025-12-17T00:46:26 1765932386

Not OP, but perhaps they mean not putting too much faith in common benchmarks (thanks to benchmaxxing).

btown · 2025-12-16T22:02:34 1765922554

Would you mind sharing what hardware/card(s) you're using? And is https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B... one of the ones you've tested?

red2awn · 2025-12-15T22:23:06 1765837386

Very interesting release:

* Hybrid MoE: 2-3x faster than pure MoE transformers

* 1M context length

* Trained on NVFP4

* Open Source! Pretraining, mid-training, SFT and RL dataset released (SFT HF link is 404...)

* Open model training recipe (coming soon)

Really appreciate Nvidia being the most open lab but they really should make sure all the links/data are available on day 0.

Also interesting that the model is trained in NVFP4 but the inference weights are FP8.

bcatanzaro · 2025-12-16T15:06:19 1765897579

The Nano model isn’t pretrained in FP4, only Super and Ultra are. And posttraining is not in FP4, so the posttrained weights of these models are not native FP4.

sosodev · 2025-12-16T22:30:17 1765924217

I love how detailed and transparent the data set statistics are on the huggingface pages. https://huggingface.co/nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B...

I've noticed that open models have made huge efficiency gains in the past several months. Some amount of that is explainable as architectural improvements but it seems quite obvious that a huge portion of the gains come from the heavy use of synthetic training data.

In this case roughly 33% of the training tokens are synthetically generated by a mix of other open weight models. I wonder if this trend is sustainable or if it might lead to model collapse as some have predicted. I suspect that the proliferation of synthetic data throughout open weight models has lead to a lot of the ChatGPT writing style replication (many bullet points, em dashes, it's not X but actually Y, etc).

pants2 · 2025-12-15T17:56:47 1765821407

If it's intelligence + speed you want, nothing comes close to GPT-OSS-120B on Cerebras or Groq.

However, this looks like it has great potential for cost-effectiveness. As of today it's free to use over API on OpenRouter, so a bit unclear what it'll cost when it's not free, but free is free!

https://openrouter.ai/nvidia/nemotron-3-nano-30b-a3b:free

viraptor · 2025-12-15T19:05:53 1765825553

> nothing comes close to GPT-OSS-120B on Cerebras

That's temporary. Cerebras speeds up everything, so if Nemotron is good quality, it's just a matter of time until they add it.

credit_guy · 2025-12-16T00:23:23 1765844603

That's unlikely. Cerebras doesn't speed up everything. Can it speed up everything? I don't know, I'm not an insider. But does it speed up everything? That is evidently not the case. Their page [1] lists only 4 production models and 2 preview models.

[1] https://inference-docs.cerebras.ai/models/overview

agentastic · 2025-12-17T00:49:35 1765932575

They need to compile the model for their chips. Standard transformers are easier, so GPT-OSS, Qwen, GLM, etc if there is demand, they will deploy it.

Nemotron on the other hand is a hybrid (Transformer + Mamba-2) so it will be more challenging to compile it on Cerebras/Groq chips.

(Me thinks Nvidia is purposefully picking architecture+FP4 that is easy to ship on Nvidia chips, but harder for TPU or Cerebras/Groq to deploy)

max002 · 2025-12-16T08:55:37 1765875337

Im upvoting, im happy to finally see open source model with commercial use from Nvidia as most of the models ive been checking from you guys couldnt be used in commercial settings. Bravo Nvidia!

kristopolous · 2025-12-16T22:40:36 1765924836

I was just using the embeddings model last night. Boy is it slow. Nice results but this 5090 isn't cutting it.

I'm guessing there's some sophistication in the instrumentation I'm just not up to date with.

kristianp · 2025-12-16T21:19:51 1765919991

The article seem to focus on the nano model. Where are the details of the larger ones?

shikon7 · 2025-12-16T21:22:30 1765920150

> We are releasing the Nemotron 3 Nano model and technical report. Super and Ultra releases will follow in the coming months.

jtbayly · 2025-12-16T21:33:02 1765920782

Any chance of running this nano model on my Mac?

mark_l_watson · 2025-12-16T23:30:07 1765927807

I used Nemotron 3 nana on LM Studio yesterday on my 32G M2-Pro mac mini. It is fast and passed all of my personal tool use tests, and did a good job analyzing code. Love it.

Today I ran a few simple cases on Ollama, but not much real testing.

axoltl · 2025-12-16T22:49:43 1765925383

There's MLX versions of the model, so yes. LM Studio hasn't updated their mlx-lm runtime yet though, you'll get an exception.

But if you're OK running it without a UI wrapper, mlx_lm==0.30.0 will serve you fine.

netghost · 2025-12-16T22:34:15 1765924455

Kind of depends on your mac, but if it's a relatively recent apple silicon model… maybe, probably?

> Nemotron 3 Nano is a 3.2B active (3.6B with embeddings) 31.6B total parameter model.

So I don't know the exact math once you have a MoE, but 3.2b will run on most anything, 31.6b and you're looking at needing a pretty large amount of ram.

vessenes · 2025-12-16T23:13:05 1765926785

Given Mac bandwidth, you'll generally want to load the whole thing in RAM. You get speed benefits based on smaller-size active experts, since the Mac compute is slow compared to Nvidia hardware. This should be relatively snappy on a Mac, if you can load the entire thing.

jonrosner · 2025-12-17T00:26:11 1765931171

running it on my M4 @ 90tps, takes 18GB of RAM.

sosodev · 2025-12-16T22:41:17 1765924877

The claim that a small, fast, and decently accurate model makes a good foundation for agentic workloads seems like a reasonable claim.

However, is cost the biggest limiting factor for agent adoption at this point? I would suspect that the much harder part is just creating an agent that yields meaningful results.

all2 · 2025-12-16T23:37:03 1765928223

This has been my major concern, so much do that I'm going to be launching a tool to handle this specific task: agent conception and testing. There is so little visibility in the tools I've used that debug is just a game of whackamole.

Y_Y · 2025-12-15T17:39:31 1765820371

Wow, Nvidia keepson pushing the frontier of misleading benchmarks