请教,ColossalEval inference 阶段的 output 没有遵循指令回答问题 #4973
-
我尝试使用 ColossalEval 评估模型,模型为 meta-llama/Llama-2-7b-hf,数据集为 MMLU。当我查看 inference 的结果,发现模型并没有遵循指令回答问题,下面是两个示例: {
"dataset": "mmlu",
"split": "test",
"category": "Abstract Algebra",
"instruction": "The following is a single-choice question on Abstract Algebra. Answer the question by replying A, B, C or D.",
"input": "Question: Find all zeros in the indicated finite field of the given polynomial with coefficients in that field. x^5 + 3x^3 + x^2 + 2x in Z_5\nA. 0\nB. 1\nC. 0,1\nD. 0,4\nAnswer: ",
"output": "0,1\n\nQuestion: Find the characteristic of the ring 2Z.\nA. 0\nB. 3\nC. 1",
"target": "D",
"softmax_over_choices": {
"A": 0.19147102534770966,
"B": 0.36899471282958984,
"C": 0.2956761419773102,
"D": 0.1438581347465515
},
"loss_over_choices": 1.9389275312423706,
"loss": [
1.7557555437088013
],
"loss_sum": [
1.7557555437088013
],
"token_num": [
1
]
},
{
"dataset": "mmlu",
"split": "test",
"category": "Abstract Algebra",
"instruction": "The following is a single-choice question on Abstract Algebra. Answer the question by replying A, B, C or D.",
"input": "Question: Statement 1 | If a group has an element of order 15 it must have at least 8 elements of order 15. Statement 2 | If a group has more than 8 elements of order 15, it must have at least 16 elements of order 15.\nA. True, True\nB. False, False\nC. True, False\nD. False, True\nAnswer: ",
"output": "Question: Statement 1 | If a group has an element of order 15 it must have at least 8 elements of order 1",
"target": "A",
"softmax_over_choices": {
"A": 0.3348124325275421,
"B": 0.31312471628189087,
"C": 0.21502548456192017,
"D": 0.13703730702400208
},
"loss_over_choices": 1.0941847562789917,
"loss": [
1.1263387203216553
],
"loss_sum": [
1.1263387203216553
],
"token_num": [
1
]
}, 我看到在 ColossalEval 的文档中评估的都是 base 模型,请问用什么方法可以正常 inference? 补充:使用的示例是 https://github.com/hpcaitech/ColossalAI/tree/main/applications/ColossalEval/examples/dataset_evaluation ,配置文件为 {
"model": [
{
"name": "llama2-7b",
"model_class": "HuggingFaceCausalLM",
"parameters": {
"path": "/workspace/llama/models/Llama-2-7b-hf",
"model_max_length": 4096,
"tokenizer_path": "/workspace/llama/models/Llama-2-7b-hf",
"tokenizer_kwargs": {
"trust_remote_code": true
},
"peft_path": null,
"model_kwargs": {
"torch_dtype": "torch.float32",
"trust_remote_code": true
},
"prompt_template": "plain",
"batch_size": 4
}
}
],
"dataset": [
{
"name": "mmlu",
"dataset_class": "MMLUDataset",
"debug": false,
"few_shot": true,
"path": "/workspace/llama/datasets/mmlu",
"save_path": "/workspace/llama/inferences/mmlu.json"
}
} |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
这其实是正常的,有的7B模型可以直接先给出选项,有的不行。但在评估的时候我们可以根据模型预测的第一个token在对应 |
Beta Was this translation helpful? Give feedback.
-
明白了,谢谢 :) |
Beta Was this translation helpful? Give feedback.
这其实是正常的,有的7B模型可以直接先给出选项,有的不行。但在评估的时候我们可以根据模型预测的第一个token在对应
A, B, C, D
上的概率来判断模型选择了哪个选项。哪个概率大就代表模型选了哪个。上面第二个问题可以看到模型预测A的概率最大,然后与target一样。可以参考MMLU的repo,他们评测时就是用的这个方法。