What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

What is wrong with LLM benchmarks, and why are we still using them?

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

Yeah, I’m aware of how sampling and prompt format affect models. I always try to use the correct prompt format (although sometimes there are contradictions between what the documentation says and what the preset for the model in text-generation-webui says, in which case I often try both with no noticeable difference in results). For sampling I normally use the llama-cpp-python defaults and give the model a few attempts to answer the question (regenerate), sometimes I try it on a deterministic setting.

I wasn’t aware that the benchmarks are multi-shot. I haven’t looked so much into how the benchmarks are actually performed, tbh. But this is useful to know for comparison.

Kerfuffle@sh.itjust.works · 1 year ago

For sampling I normally use the llama-cpp-python defaults

Most default settings have the temperature around 0.8-0.9 which is likely way too high for code generation. Default settings also frequently include stuff like a repetition penalty. Imagine the LLM is trying to generate Python, it has to produce a bunch of spaces before every line but something like a repetition penalty can severely reduce the probability of the tokens it basically must select for the result to be valid. With code, there’s often very little leeway for choosing what to write.

So you said:

I’m aware of how sampling and prompt format affect models.

But judging the model by what it outputs with the default settings (I checked and it looks like for llama-cpp-python it has both a pretty high temperature setting and a repetition penalty enabled) kind of contradicts that.

By the way, you might also want to look into the grammar sampling stuff that recently got added to llama.cpp. This can force the model to generate tokens that conform to some grammar which is pretty useful for code and some other stuff where the output has to conform to something. You should still carefully look at the other settings to ensure they conform to the type of result you want to generate though, the defaults are not suitable for every use case.

micheal65536@lemmy.micheal65536.duckdns.org · 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Kerfuffle@sh.itjust.works · 1 year ago

I have also tried to generate code using deterministic sampling (always pick the token with the highest probability). I didn’t notice any appreciable improvement.

Well, you said you sometimes did that so it’s not entirely clear what conclusions you came to are based on deterministic sampling and which aren’t. Anyway, like I said, it’s not just temperature that may be causing issues.

I want to be clear I’m not criticizing you personally or anything like that. I’m not trying to catch you out and you don’t have to justify anything about your decisions or approach to me. The only thing I’m trying to do here is provide information that might help you and potentially other people get better results or understand why the results with a certain approach may be better or worse.