Original post

Yeah, I kind of like to think about the training process essentially as almost like iterative testing, with table tests of some function. If you’re thinking about the trial and error process, then you parameterize these bits of the function… And if you’re wondering “Well, how good is my parameterization? How good did I pick my numbers?”, then what you wanna do is you wanna try some examples and see how many you get right.

The difference with a traditional software engineering function is that you always expect to get all the examples right. For your API endpoints you have a bunch of examples in a table; you want to get 100% of those right, and fail if you miss one.

In the case of the machine learning or AI model, you’re gonna have a bunch of example images. Some of them are gonna have cats, some of them are not gonna have cats. You would never expect to get all of them right, but you want to get as many as possible. So what you do is you choose some random parameters to start with, and then you run your examples through and see how many you got right. Maybe you got like 20% right, or something.

Then you tweak your parameters a little bit and try again, and maybe you got 25% right, so you’re kind of going in the right direction with your parameters… And you kind of just do this iteratively over and over, until you get the best parameter set that you can find. That’s how the training process works.

Now, there’s various mathematics that help in that, in terms of not just randomly choosing parameters, but moving them in the right direction… But it’s essentially that trial and error. Now, it depends in terms of the training data and how much you need, it depends on how complicated your model is.

If you have just one if statement, then it’s gonna be fairly quick to parameterize that, and you might not need that many examples. But if you have over a billion parameters, like some of these larger models that we see now, you’re not going to find all of those parameters with 100 examples. You need very, very many examples, which is why with the scale of model complexity that we’ve seen over recent years, we’ve seen a similar sort of boom in how much training data is needed.

At the same time, we’ve seen various tricks that allow you to adapt or fine-tune models, and not always start from scratch with your training process… Which has been one of the reasons why things are moving so quickly – there’s this kind of idea of piggy-backing off of other’s work. Google might have trained already on 200 terabytes of data, and you’re just fine-tuning to a particular problem, so you don’t need as much.