新未名空间

https://arxiv.org/abs/2507.10532

Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks.

其实我觉得不见得。我记得我中学时候，也经常是看题目开头就能猜出他要干什么。这些记忆和推理人也都有。

新未名空间

复旦卖国，竟然抓包qwen用benchmark训练，还居然放过llama

#1 复旦卖国，竟然抓包qwen用benchmark训练，还居然放过llama