You are probably familiar with the long list of various benchmarks that new models are tested on and compared against. These benchmarks are supposedly designed to assess the model’s ability to perform in various aspects of language understanding, logical reasoning, information recall, and so on.
However, while I understand the need for an objective and scientific measurement scale, I have long felt that these benchmarks are not particularly representative of the actual experience of using the models. For example, people will claim that a model performs at “some percentage of GPT-3” and yet not one of these models has ever been able to produce correctly-functioning code for any non-trivial task or follow a line of argument/reasoning. Talking to GPT-3 I have felt that the model has an actual in-depth understanding of the text, question, or argument, whereas other models that I have tried always feel as though they have only a superficial/surface-level understanding regardless of what the benchmarks claim.
My most recent frustration, and the one that prompted this post, is regarding the newly-released OpenOrca preview 2 model. The benchmark numbers claim that it performs better than other 13B models at the time of writing, supposedly outperforms Microsoft’s own published benchmark results for their yet-unreleased model, and scores an “average” result of 74.0% against GPT-3’s 75.7% while the LLaMa model that I was using previously apparently scores merely 63%.
I’ve used GPT-3 (text-davinci-003), and this model does not “come within comparison” of it. Even giving it as much of a fair chance as I can, giving it plenty of leeway and benefit of the doubt, not only can it still not write correct code (or even valid code in a lot of cases) but it is significantly worse at it than LLaMa 13B (which is also pretty bad). This model does not understand basic reasoning and fails at basic reasoning tasks. It will write a long step-by-step explanation of what it claims that it will do, but the answer itself contradicts the provided steps or the steps themselves are wrong/illogical. The model has only learnt to produce “step by step reasoning” as an output format, and has a worse understanding of what that actually means than any other model does when asked to “explain your reasoning” (at least, for other models that I have tried, asking them to explain their reasoning produces at least a marginal improvement in coherence).
There is something wrong with these benchmarks. They do not relate to real-world performance. They do not appear to be measuring a model’s ability to actually understand the prompt/task, but possibly only measuring its ability to provide an output that “looks correct” according to some format. These benchmarks are not a reliable way to compare model performance and as long as we keep using them we will keep producing models that score higher on benchmarks and claim to perform “almost as good as GPT-3” but yet fail spectacularly in any task/prompt that I can think of to throw at them.
(I keep using coding as an example however I have also tried other tasks besides code as I realise that code is possibly a particularly challenging task due to requirements like needing exact syntax. My interpretation of the various models’ level of understanding is based on experience across a variety of tasks.)
Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.
I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.
Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there’s often very little leeway for choosing what to write.
So you said:
But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.
By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.
I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.
Well, you said you sometimes did that so it’s not entirely clear what conclusions you came to are based on deterministic sampling and which aren’t. Anyway, like I said, it’s not just temperature that may be causing issues.
I want to be clear I’m not criticizing you personally or anything like that. I’m not trying to catch you out and you don’t have to justify anything about your decisions or approach to me. The only thing I’m trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.