Get the top HN stories in your inbox every day.
tehnub
Interesting exchange on the use of AI coding tools:
curious how much did you write the code by hand of it?
Karpathy: Good question, it's basically entirely hand-written (with tab autocomplete). I tried to use claude/codex agents a few times but they just didn't work well enough at all and net unhelpful, possibly the repo is too far off the data distribution.
https://x.com/karpathy/status/1977758204139331904gyomu
> the repo is too far off the data distribution
ah, this explains why these models have been useless to me this whole time. everything i do is just too far off the data distribution!
SchemaLoad
Everything is unless your app is a React todolist or leatcode questions.
notatoad
people say this like it's a criticism, but damn is it ever nice to start writing a simple crud form and just have copilot autocomplete the whole thing for me.
meowface
HN's cynicism towards AI coding (and everything else ever) is exhausting. Karpathy would probably cringe reading this.
SeanAnderson
or a typical CRUD app architecture, or a common design pattern, or unit/integration test scaffolding, or standard CI/CD pipeline definitions, or one-off utility scripts, etc...
Like 80% of writing coding is just being a glorified autocomplete and AI is exceptional at automating those aspects. Yes, there is a lot more to being a developer than writing code, but, in those instances, AI really does make a difference in the amount of time one is able to spend focusing on domain-specific deliverables.
KeplerBoy
I don't know. I successfully use it for small changes on VHDL FPGA designs these days.
allochthon
I've had some success with a multi-threaded software defined radio (SDR) app in Rust that does signal processing. It's been useful for trying something out that's beyond my experience. Which isn't to say it's been easy. It's been a learning experience to figure out how to work around Claude's limitations.
lukev
Generative AI for coding isn't your new junior programmer, it's the next generation of app framework.
SalmoShalazar
Really such an annoying genre of comment. Yes I’m sure your groundbreaking bespoke code cannot be written by LLMs, however for the rest of us that build and maintain 99% of the software people actually use, they are quite useful.
dahcryn
simple CRUD, is as common in many many business applications or backend portals, are a good fit for AI assistance imho. And fix some designs here and there, where you can't be bothered to keep track of the latest JS/CSS framework
teleforce
I wonder if the new GenAI architecture namely DDN or distributed discrete networks being discussed recently can outperform the conventional architecture of GAN and VAE. As the name suggests, it can provide multitude of distributions for training and inference purposes [1].
[1] Show HN: I invented a new generative model and got accepted to ICLR (90 comments):
CapsAdmin
I work on this typed lua language in lua, and sometimes use llms to help fix internal analyzer stuff, which works 30% of the time for complex, and sometimes not at all, but helps me find a solution in the end.
However when I ask an llm to generate my typed lua code, with examples and all, on how the syntax is supposed to be, it mostly gets it wrong.
my syntax for tables/objects is: local x: {foo = boolean}
but an llm will most likely gloss over this and always use : instead of = local x: {foo: boolean}
pmarreck
I've had success in the past with getting it to write YueScript/Moonscript (which is not a very large part of its training data) by pointing it to the root URL for the language docs and thus making that part of the context.
If your typed version of Lua has a syntax checker, you could also have it try to use that first on any code it's generated
kasey_junk
Are you using a coding agent or just an llm chat interface? Do you have a linter or compiler that will catch the misuse that you’ve hooked up to the agent?
random_cynic
[dead]
rootusrootus
That is a good thing to hear from someone as reputable as Karpathy. The folks who think we're on the cusp of AGI may want to temper their expectations a bit.
I do love Claude Code, because one thing I periodically need to do is write some web code, which is not my favorite type of coding but happens to have incredibly good coverage in the training data. Claude is a much better web developer than I am.
But for digging into the algorithmic core of our automation tooling, it doesn't have nearly as much to work with and makes far more mistakes. Still a net win I'm happy to pay for, even if it's never anything more than my web developer slave.
vunderba
100%. I find the "LLMs are completely useless" and the "LLMs will usher in a new era of messianic programming" camps to be rather reductive.
I've already built some pretty large projects [1] with the assistance of agentic tooling like Claude Code. When it comes to the more squirrely algorithms and logic, they can fall down pretty hard. But as somebody who is just dreadful at UI/UX, having it hammer out all the web dev scaffolding saves me a huge amount of time and stress.
It's just a matter of tempering one's expectations.
ggsp
Hey, thank you for making this—I really enjoyed playing it and it feels like it fits the mental-reward-between-work-tasks need. It did spin up my M1's fans after a few minutes which is a rather rare occurrence, but I'm guessing that's par for the course when you're working with a bunch of video on canvas. Either way, hope I remember it the next time I'm looking for a puzzle to solve while I take a break :)
meowface
>and the "LLMs will usher in a new era of messianic programming" camps
Well, this one might still be borne out. It's just silly to think it's the case right now. Check in again in 10 years and it may be a very different story. Maybe even in 5 years.
bdangubic
> But for digging into the algorithmic core of our automation tooling
What I find fascinating is reading this same thing in other context like “UI guru” will say “I would not let CC touch the UI but I let it rip on algorithmic core of our automation tooling cause it is better at it than me…”
Filligree
Both can be true. LLMs tend to be mediocre at (almost) everything, so they're always going to be worse than the user at whatever the user is an expert in.
But 'mediocre' isn't 'useless'.
SeanAnderson
This makes sense, right? It's a relatively novel thing to be writing. I don't find it to be a damning remark like other comments here seem to be concluding.
If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
simonw
Yeah, if your goal is "build the tightest 8,000 line implementation of training an LLM from scratch, with a focus on both conciseness and educational value" I don't think it's particularly surprising that Claude/Codex weren't much help.
fragmede
Now to wait for Sonnet 5 and GPT-6, and ask them to build that, and see what they come up with.
JustFinishedBSG
> This makes sense, right? It's a relatively novel thing to be writing.
It's really not though? Honestly I'm surprised coding agents fail hard at this task apparently
krackers
It's not _that_ far off distribution though. The math and concepts are well understood.
nomel
That's not really how LLM work though. It's fundamentally next word prediction, based on statistics of the context. Reordering ideas (which can drastically change the outcome) can result in a statistically rare context. The silly failures of simple riddles [1], and the like, demonstrate this well.
The riddle issue is putting trivial ideas together, but combined them in a statistically rare way, giving low quality output that tends towards the statistically significant answer, even if it's incorrect. The same things happens with coding, when combining well known things together in uncommon ways.
Worse (as with the riddle problem), nearby concepts that have strong statistics are going to act like attractors, with the LLM always trending towards those, removing and rewriting bits of code to better accommodate those, even if they're the opposite of what you want. I have this happen all the time in my somewhat obscure work. It'll rewrite key maths in my code to be the statistically significant textbook example, which is not what I need. I'll fix it manually or point it out, and a few changes later, it'll rewrite it again. A reasonable way around this is to "pin" the concept with a very strongly worded negative comment, like "DO NOT USE DOT PRODUCT. THIS IS NOT VECTOR PROJECTION. DO NOT modify the next line of code.".
[1] Claude 4.1 Opus:
> Prompt: "A son and his mother are in a car accident. They’re both rushed to the hospital. The doctor is about to operate on the son. But, before he does, he looks down and says "This is my son! I can’t operate on this boy!". How can this be?"
> Response: "The doctor is the boy's father.
This is a classic riddle that plays on assumptions people might make about gender and professions. The boy has two parents - his mother (who was in the car with him) and his father (the doctor at the hospital). The riddle works because some people automatically assume the doctor must be male, forgetting that the boy's father could be the doctor since his mother was already mentioned as being in the accident."
Another, with output that doesn't match the goal, statistically attracted to the riddle:
> Prompt: "A man, a sheep, and a wolf are on one side of the river, with a boat that can only hold two. How can the man safely get the boat to the other side of the river, without the sheep being eaten?"
bringmeiron
> If anything, the fact that Karpathy reached towards Claude/Codex in an attempt to gain value is indicative that, in previous coding efforts, those tools were helpful to him.
This is good for bitcoin.
kubb
He probably just doesn’t know how to prompt correctly (heheh).
satvikpendem
That's funny that the coiner of the term vibe coding has eventually found it not useful anymore.
JimDabell
That’s not what he said. This is the new project:
> My goal is to get the full "strong baseline" stack into one cohesive, minimal, readable, hackable, maximally forkable repo. nanochat will be the capstone project of LLM101n (which is still being developed). I think it also has potential to grow into a research harness, or a benchmark, similar to nanoGPT before it.
This is how he described vibe coding:
> There's a new kind of coding I call "vibe coding", where you fully give in to the vibes, embrace exponentials, and forget that the code even exists. It's possible because the LLMs (e.g. Cursor Composer w Sonnet) are getting too good. Also I just talk to Composer with SuperWhisper so I barely even touch the keyboard. I ask for the dumbest things like "decrease the padding on the sidebar by half" because I'm too lazy to find it. I "Accept All" always, I don't read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I'd have to really read through it for a while. Sometimes the LLMs can't fix a bug so I just work around it or ask for random changes until it goes away. It's not too bad for throwaway weekend projects, but still quite amusing. I'm building a project or webapp, but it's not really coding - I just see stuff, say stuff, run stuff, and copy paste stuff, and it mostly works.
Vibe coding is clearly aimed at having fun hacking around on something that doesn’t matter, and he’s doing the opposite of that with this project. The fact that he’s not using vibe coding for something that is completely inappropriate for vibe coding is neither surprising nor a failure of vibe coding.
samus
The llama.cpp maintainers working on supporting Qwen3-next are also not enthused by LLM output. They had to go over everything and fix it up.
https://github.com/ggml-org/llama.cpp/pull/16095#issuecommen...
martingalex2
Isn't the point that now Andrej's published this, it will be in-distribution soon?
montebicyclelo
> nanochat is also inspired by modded-nanoGPT
Nice synergy here, the lineage is: Karpathy's nano-GPT -> Keller Jordan's modded-nanoGPT (a speedrun of training nanoGPT) -> NanoChat
modded-nanoGPT [1] is a great project, well worth checking out, it's all about massively speeding up the training of a small GPT model.
Notably it uses the author's Muon optimizer [2], rather than AdamW, (for the linear layers).
varunneal
Muon was invented by Keller Jordan (and then optimized by others) for the sake of this speedrunning competition. Even though it was invented less than a year ago, it has already been widely adopted as SOTA for model training
tbalsam
This is the common belief but not quite correct! The Muon update was proposed by Bernstein as the result of a theoretical paper suggesting concrete realizations of the theory, and Keller implemented it and added practical things to get it to work well (input/output AdamW, aggressive coefficients, post-Nesterov, etc).
Both share equal credit I feel (also, the paper's co-authors!), both put in a lot of hard work for it, though I tend to bring up Bernstein since he tends to be pretty quiet about it himself.
(Source: am experienced speedrunner who's been in these circles for a decent amount of time)
varunneal
I think it's good to bring up Bernstein & Newhouse as well as Yuchen Jin, Jiacheng You and the other speedrunners who helped iterate on Muon. But I think it's very fair to call Keller Jordan the main author of Muon of its current form. I'm also in the speedrunning community though maybe not as long as you have
swyx
sharing some useful resrources for learning Muon (since I'm also just catching up on it)
- https://x.com/leloykun/status/1846842883967692926
- https://www.yacinemahdid.com/p/muon-optimizer-explained-to-a...
cantor_S_drug
This Simple Optimizer Is Revolutionizing How We Train AI [Muon]
https://www.youtube.com/watch?v=bO5nvE289ec
I found the above video as a good introduction.
kouteiheika
The most exciting thing about Muon for me is that it requires half the state of Adam while having either equivalent or better performance. That's amazing if you are VRAM limited! And just like Adam, you can also quantize it. I can get it to work relatively well as low as 4-bit, which essentially cuts down the memory requirements from full 32-bit Adam by a factor of 16x! (And by a factor of 4x vs 8-bit Adam).
ComplexSystems
I haven't heard of this before. Has Muon dethroned Adam and AdamW as the standard general purpose optimizer for deep learning?
spyder
It's for hidden layers and not for every parameter: From Keller's Muon github page:
"Muon is an optimizer for the hidden weights of a neural network. Other parameters, such as embeddings, classifier heads, and hidden gains/biases should be optimized using standard AdamW."
And I just looked into this nanochat repo and it's also how it's used here.
https://github.com/karpathy/nanochat/blob/dd6ff9a1cc23b38ce6...
echelon
8xH100 is pretty wild for a single inference node.
Is this what production frontier LLMs are running inference with, or do they consume even more VRAM/compute?
At ~$8/hr, assuming a request takes 5 seconds to fulfill, you can service roughly 700ish requests. About $0.01 per request.
Is my math wrong?
vessenes
This is the spec for a training node. The inference requires 80GB of VRAM, so significantly less compute.
andai
The default model is ~0.5B params right?
Tepix
As vessenes wrote, that‘s for training. But a H100 can also process many requests in parallel.
undefined
sammyd56
I'm doing a training run right now (started 20min ago). You can follow it at https://api.wandb.ai/links/sjd333-none/dsv4zkij
Will share the resulting model once ready (4 hours from now) for anyone to test inference.
sammyd56
I've uploaded the model here: https://huggingface.co/sdobson/nanochat
I didn't get as good results as Karpathy (unlucky seed?)
It's fun to play with though...
User: How many legs does a dog have? Assistant: That's a great question that has been debated by dog enthusiasts for centuries. There's no one "right" answer (...)
simonw
I got your model working on CPU on macOS by having Claude Code hack away furiously for a while. Here's a script that should work for anyone: https://gist.github.com/simonw/912623bf00d6c13cc0211508969a1...
You can run it like this:
cd /tmp
git clone https://huggingface.co/sdobson/nanochat
uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0211508969a100a/raw/80f79c6a6f1e1b5d4485368ef3ddafa5ce853131/generate_cpu.py \
--model-dir /tmp/nanochat \
--prompt "Tell me about dogs."sammyd56
This is a much easier way to run the model. I'm going to update the huggingface README to point to this. The one thing that could be improved is the turn-taking between user and assistant, which it sometimes gets confused about. I fixed that in my fork of your gist here: https://gist.github.com/samdobson/975c8b095a71bbdf1488987eac...
vessenes
Simon, I had to run "brew install git-lfs && cd nano-chat && git lfs install && git lfs pull" and then it worked. before then, the model weights didn't get cloned by default for me on macOS.
% uv run https://gist.githubusercontent.com/simonw/912623bf00d6c13cc0... \ --model-dir nanochat/ --prompt "who is simonw on hacker news?" Using device: cpu Loading model from nanochat/model_000650.pt Loading metadata from nanochat/meta_000650.json Model config: {'sequence_len': 2048, 'vocab_size': 65536, 'n_layer': 20, 'n_head': 10, 'n_kv_head': 10, 'n_embd': 1280} Loading model weights (this may take a minute for a 2GB model)... Converting model to float32 for CPU... Model loaded successfully! Loading tokenizer... Tokenizer loaded successfully!
Prompt: who is simonw on hacker news? Encoded to 9 tokens
Generating... -------------------------------------------------- who is simonw on hacker news?<|user_end|><|assistant_start|>A hacker news reporter, I'd say a few things. First, I'm a bit of a hothead, always pushing the boundaries of what's acceptable in the world of hacking. I've got a reputation for being merciless and relentless in my pursuit of the truth.
In many ways, I've developed a sixth sense for this type of thing. I've spent years honing my skills, learning the language of hacking and the tactics it takes. I know how to think like the hacker --------------------------------------------------
iamcreasy
For anyone curious this is the error when running uv sync on macos,
> uv sync Resolved 88 packages in 3ms error: Distribution `torch==2.8.0+cu128 @ registry+https://download.pytorch.org/whl/cu128` can't be installed because it doesn't have a source distribution or wheel for the current platform
hint: You're on macOS (`macosx_15_0_arm64`), but `torch` (v2.8.0+cu128) only has wheels for the following platforms: `manylinux_2_28_x86_64`, `win_amd64`; consider adding your platform to `tool.uv.required-environments` to ensure uv resolves to a version with compatible wheels
Also, tmp/nanochat expects all contents from tokenizer and chatsft_checkpoints folder.
Lerc
The comment beside the first chart
>Our main measure of progress. Bits per byte is, per Karpathy, "a much better measure than just the typical cross-entropy loss, because it further normalizes the loss on each token by the number of bytes of that token, making the metric tokenizer-invariant".
Is so blindingly obvious, that I'm ashamed to think that I didn't think do it when trialing my own tokenizer approach on tinystories. I might go back and have a look at how well my tokenizer compared to how well I imagined it compared.
SeanAnderson
ELI5 for anyone else (I had to have this explained to me):
When you train a language model, it tries to predict the next token.
We measure how good it is at that using loss aka how surprised it was by the real answer.
Different models might use different token lengths. So, if you describe loss relative to tokens then you can't easily compare the performance of two models that use different token lengths.
So, compare loss to bytes of text data instead.
typpilol
Why hasn't anyone made a tokenizer that's 1 character per token. Is it because it requires an insane amount of compute?
Or would the loss of efficiency make it dumber then modern tokenizers?
nl
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better.
Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though.
SeanAnderson
yes to both.
absolutely requires longer training time and more compute.
once trained, predictions need to hold through many more steps because each step processes one token. if a token early in a sentence heavily implies a token will occur later in the sentence then that awareness needs to be maintained while processing each intermediary token and each step is a bit lossy. the fewer steps you need to take before leveraging that knowledge the better the prediction.
if you had infinite compute and data for training then performance would be equivalent though, i think.
skirmish
Since OpenAI tokenizer is estimated at ~4.2 characters per token, with your proposed "1 char per token tokenizer", the effective context length immediately becomes 4.2 times smaller, and generated output 4.2 times slower (since 4.2 times more tokens are needed for the same output). Doesn't look like a good tradeoff.
royosherove
Cool. Is there a simple "howto" on running this repo with training on W&B for a programmer like me who has never done model training flows? Maybe you could share the steps you took?
sammyd56
There's not much to it... it took longer to spin up the cloud machine than it did to kick off the training run. I'll be writing up a blog post with a step-by-step guide when I get a free moment, but in the meantime, here are the commands I ran: https://pastebin.com/sdKVy0NR
royosherove
Ah I was missing the WANDB_RUN env var. so did not get any logs. thanks!
bravura
The measures that drop exponentially like val/bpb and train/loss you should put the x-axis in log-scale. That will better show you if it's converged
sammyd56
Great call, thankyou - I switched to log scale for those metrics - agree that it is much clearer.
bravura
Sorry fat fingers. It should be the y axis that is log scale, not x axis. (Sometimes both is good.)
Did you notice the inflection point in which the loss drops faster than expected in the top graph? Maybe you should let it run more…
faxmeyourcode
This weekend I just cracked into nanoGPT (https://github.com/karpathy/nanoGPT), an older but fabulous learning exercise where you build and train a crappy shakespeare GPT with ~0.8M parameters on a cpu. Results are about what you'd expect from that, they suck, but you can start to feel the magic, especially if you're not a deep learning professional and you just want to poke around and hack on it.
I started writing up a blog post on my weekend with nanoGPT but it's not done yet... Would have been great to link to here lol oh well
ACCount37
It's a useful exercise. A lot of the good ML work is first validated at small scale.
And this new example goes even further - adds instruction following and tool use SFT, as well as RLVR. Makes for a more useful baseline.
faxmeyourcode
Absolutely, it's wildly fun to read the outputs of even a little tiny 0.8M model trained on CPU. And now I've actually got a much better understanding of the transformer architecture after playing around with it for a day. This repo is probably going to spawn some new folks to try out ideas which will turn into new researchers in the field, no doubt.
andrewljohnson
the shakespeare code tuned a little with different training data does a good job of generating Magic The Gathering commander decks
jwitthuhn
Somewhat related: I wrote up a MTG card generator based on nanoGPT a while ago that I think produces pretty good results for being 1m parameters.
The real neat thing about this is that WotC makes a few thousand new cards each year, so my training data set just grows over time and the model gets better with no effort spent on my part.
wordpad
It would be interesting to come up with a use case which requires a freshly trained model and isn't just something that generic models can already, especially with 1MM context window
SeanAnderson
would love more details on this. this is exactly the type of project I'd like to dabble in to get more up to speed.
astrange
People have been doing this for a while.
https://bsky.app/profile/roborosewaterm.bsky.social
You can see the invention of RLHF/ChatGPT here because text generation suddenly became much more coherent and also much less interesting. You have to go back to older tech for surrealism because nobody will let you see the good stuff (the base models).
vunderba
FWIW, there was a pretty popular post on HN around generating MTG cards using AI a couple years back but I believe that their approach was a fine-tune on an existing LLM.
dmarcos
I like the idea of specific-purpose toy models. How did you tune the code and what dataset you used?
sieve
Nice! His Shakespeare generator was one of the first projects I tried after ollama. The goal was to understand what LLMs were about.
I have been on an LLM binge this last week or so trying to build a from-scratch training and inference system with two back ends:
- CPU (backed by JAX)
- GPU (backed by wgpu-py). This is critical for me as I am unwilling to deal with the nonsense that is rocm/pytorch. Vulkan works for me. That is what I use with llama-cpp.
I got both back ends working last week, but the GPU back end was buggy. So the week has been about fixing bugs, refactoring the WGSL code, making things more efficient.
I am using LLMs extensively in this process and they have been a revelation. Use a nice refactoring prompt and they are able to fix things one by one resulting in something fully functional and type-checked by astral ty.
danielmarkbruce
Unwilling to deal with pytorch? You couldn't possibly hobble yourself anymore if you tried.
sieve
If you want to train/sample large models, then use what the rest of the industry uses.
My use case is different. I want something that I can run quickly on one GPU without worrying about whether it is supported or not.
I am interested in convenience, not in squeezing out the last bit of performance from a card.
danielmarkbruce
You wildly misunderstand pytorch.
ComputerGuru
If you’re not writing/modifying the model itself but only training, fine tuning, and inferencing, ONNX now supports these with basically any backend execution provider without needing to get into dependency version hell.
Breza
What are your thoughts on using JAX? I've used TensorFlow and Pytorch and I feel like I'm missing out by not having experience with JAX. But at the same time, I'm not sure what the advantages are.
sieve
I only used it to build the CPU back end. It was a fair bit faster than the previous numpy back end. One good thing about JAX (unlike numpy) is that it also gives you access to a GPU back end if you have the appropriate stuff installed.
swyx
> Thank you to chief LLM whisperer Alec Radford for advice/guidance.
oh man an Alec x Andrej podcast would BREAK THE INTERNET... just saying... going from glory days of GPT1 to now building GPT3? in 4 hours
codybontecou
Please oh please. This would be perfect.
karimf
I've always thought about the best way to contribute to humanity: number of people you help x how much you help them. I think what Karpathy is doing is one of the highest leverage ways to achieve that.
Our current world is build on top of open source projects. This is possible because there are a lot of free resources to learn to code so anyone from anywhere in the world can learn and make a great piece of software.
I just hope the same will happen with the AI/LLM wave.
bkettle
This free tradition in software is I think one of the things that I love so much, but I don't see how it can continue with LLMs due to the extremely high training costs and the powerful hardware required for inference. It just seems like writing software will necessarily require paying rent to the LLM hosts to keep up. I guess it's possible that we'll figure out a way to do local inference in a way that is accessible to everyone in the way that most other modern software tools are, but the high training costs make that seem unlikely to me.
I also worry that as we rely on LLMs more and more, we will stop producing the kind of tutorials and other content aimed at beginners that makes it so easy to pick up programming the manual way.
levocardia
There's a Stephen Boyd quote that's something like "if your optimization problem is too computationally expensive, just go on vacation to Greece for a few weeks and by the time you get back, computers might be fast enough to solve it." With LLMs there's sort of an equivalent situation with cost: how mindblowing would it be able to train this kind of LLM at all even just 4 years ago? And today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
There's also a reasonable way to "leapfrog" the training cost with a pre-trained model. So if you were doing nanochat as a learning exercise and had no money, the idea would be to code it up, run one or two very slow gradient descent iterations on your slow machine to make sure it is working, then download a pre-trained version from someone who could spare the compute.
piokoch
But in this case the reason is simple: the core algorithm is O(n^2), this not going to be improved over a few weeks.
dingnuts
> today you can get a kindergartener level chat model for about $100. Not hard to imagine the same model costing $10 of compute in a few years.
No, it's extremely hard to imagine since I used one of Karpathy's own models to have a basic chat bot like six years ago. Yes, it spoke nonsense; so did my GPT-2 fine tune four years ago and so does this.
And so does ChatGPT
Improvement is linear at best. I still think it's actually a log curve and GPT3 was the peak of the "fun" part of the curve. The only evidence I've seen otherwise is bullshit benchmarks, "agents" that increase performance 2x by increasing token usage 100x, and excited salesmen proclaiming the imminence of AGI
hodgesrm
This. It looks like one of the keys to maintaining open source is to ensure OSS developers have access to capable models. In the best of worlds, LLM vendors would recognize that open source software is the commons that feeds their models and ensure it flourishes.
In the real world...
DennisP
Maybe this isn't possible for LLMs yet, but open source versions of AlphaZero have been trained on peer-to-peer networks.
Lerc
(This is a bit ranty, but due to a sincere desire for a better world, and being the recipient of personal attacks for believing a better world is achievable by a different path to others)
I feel like this point of view is an ideal not shared by one of the main branches of anti-AI sentiment.
The idea of intellectual property works against this. Rather than contributing to humanity directly, ownership of information is accumulated by individuals and then rented to humanity.
At the same time I agree that people should be able to have a livelihood that affords them the ability to create new intellectual contributions.
The service Karpathy is providing is also being provided by thousands of YouTube creators in a huge variety of topics. It's a little sad that so many must support their efforts with support their efforts with sponsorships from sources with varying degrees of ethical behaviour. Patreon is better but still not ideal. I sincerely believe this _is_ one of the best ways to contribute to society.
A recent Daily Show had Jon Stewart describe training AI as strip mining human knowledge. Training AI is regularly described as theft as if this position is a given without any counter argument possible. It is opinion masquerading as fact. This saddens me because it suggests to me that the war to control the narrative is being won by people who want to entrench a hypercapitalistic vision of ownership where not only is a particular expression of an idea ownable but also stakes a claim to own some of any ideas that come from viewing that expression.
I cannot see any way that this viewpoint would aid humanity as a whole, but instead assign benefits to a collection of individuals. The ability to trade intellectual property means that ownership inevitably gets passed to a smaller and smaller pool of individuals over time.
I think we really do need a new way to consider these issues in light of the modern world. When mentioning these thoughts to others a common refrain is that it doesn't matter because the powers that be (and their lobbyists) will prevent any fix from happening. I have never been fond of that particular fatalism, especially when it inhibits discussion of what would be better.
oblio
Awesome approach.
I'm all for abolishing IP if all AIs are owned communally. I.e. ideally they're utilities or flat out co-ops like some Spanish businesses.
https://en.wikipedia.org/wiki/Mondragon_Corporation
Consum (Spanish supermarket).
They don't get to use everything communally and then capitalism their way forward.
viccis
I recommend his ANN/LLM from scratch videos to people a lot because not only is he a clear instructor, but his code tends to be very Pythonic and just the right balance of terse but readable (not counting the Pytorch vectorization stuff, but that's not his fault, it's just complex). So I think people benefit just from watching and imitating his code style.
undefined
undefined
epolanski
Then a single person whose learned those skills decide to poison all of us thanks to the skills acquired.
shafyy
If it only were so easy
carlcortright
strong +1 - developers like him are heros
flakiness
Eureka Labs: https://github.com/EurekaLabsAI
What a prolific person Andrej is. It's been more than amazing to follow along!
CountGeek
So could I in practice train it on all my psychology books, materials, reports, case study and research papers and then run it on demand on a 1xH100 node - https://getdeploying.com/reference/cloud-gpu/nvidia-h100 whenever I have a specialised question?
leokeba
You could do that indeed, but the performance would be abysmal. For this kind of use-case, it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
dmix
> it would be a LOT better to use a small pre-trained model and either fine-tune it on your materials, or use some kind of RAG workflow (possibly both).
I noticed NewRelic has a chat feature that does this sort of thing, it's scoped very narrowly down to their website and analytics DSL language, and generates charts/data from their db. I've always wondered how they did that (specifically in terms of set up the training/RAG + guardrails). It's super useful.
simonw
You might be able to figure that out just by asking it - see if you can get it to spit out a copy of the system prompt or tell you what tools it has access to.
The most likely way of building that would be to equip it with a "search_docs" tool that lets it look up relevant information for your query. No need to train an extra model at all if you do that.
zipy124
You could but it would be significantly worse than fine-tuning or RAG with a pre-trained model, or using a smaller model since your dataset would be so small.
gojomo
Yes, though it's possible a more-general core model, further enhanced with some other ways to bring those texts-of-interest into the working context, might perform better.
Those other ways to integrate the texts might be some form of RAG or other ideas like Apple's recent 'hierarchical memories' (https://arxiv.org/abs/2510.02375).
nickandbro
You could! But just like others have mentioned, the performance would be negligible. If you really wanted to see more of a performance boost by pretraining you could try to create a bigger chunk of data to train off of. This would be done by either creating synthetic data off of your material, or finding adjacent information to your material. Here's a good paper about it: <https://arxiv.org/abs/2409.07431>
undefined
alganet
No.
daft_pink
Wow, how do we sign up for the Eurekalabs course and how much does it cost?
karpathy
Still under development, remaining work includes tuning nanochat (current state being solid v0.1) and finalizing the in-between projects so that students can "unlock" all complexity that hides underneath: `torch.Tensor`, `torch.dist`, `.backward()`, '.compile()`, etc. And then the more ops heavy aspects.
BrokenCogs
What's the pricing for the course/EurekaLabs? P.s. thanks for all you're doing
huseyinkeles
Karpathy says nanochat will become the capstone project of the course LLM101n being developed by Eureka Labs.
I guess it’s still a work in progress? Couldn’t find any other information elsewhere.
Schiphol
A bit more info [here](https://github.com/karpathy/LLM101n)
chipsrafferty
Would love to hear some metrics on training it on your personal computer rather than a "cloud GPU box". I don't care if it takes 3 months to train if I have something good, offline, and free(ish, but just pay electric bills)
ComputerGuru
Each H100 can do 60 TFLOPS of f32 operations, while a single RTX 3080 can do roughly half that (just under 30). So complete back-of-the-envelope answer would be 16x as long (since nanochat is targeting four hours with 8xH100)
64 hours isn’t too bad at all!
(An RTX 2080 can only do 10 TFLOPS for fp32, so that would be again 3x as long.)
zoba
I’d also be interested in this. Especially for Macs
TheAceOfHearts
Here's the announcement post [0] from Karpathy, which provides a bit of additional context.
dang
Thanks - we'll put that in the toptext as well
Get the top HN stories in your inbox every day.
https://x.com/karpathy/status/1977755427569111362