🏆 MHPP Leaderboard 🏆
MHPP Evaluates AI Coders Performance against Diverse Code
Generation Challenges
📝 Notes
- Models are ranked based on their pass@1 scores using greedy decoding. For the sampling results, we set the temperature to 0.7 and sampled 100 times. We recommend using 1024 tokens as the context length, considering the length of problems and potential responses.
- In the table, positions marked with a '-' indicate that the data was not collected due to limited resources or budget constraints.
🤗 Acknowledgement and More Leaderboards
We greatly thank the authors of the EvalPlus Leaderboard for allowing us to borrow their leaderboard code! In addition to MHPP leaderboards, it is recommended to comprehensively understand LLM coding ability through a diverse set of benchmarks and leaderboards, such as: