Aix-bench, the Java benchmark for code synthesis problem.


AiXcoder NL2Code Evaluation Benchmark (aix-bench)


Paper available:


This is a method-level benchmark for evaluating code generating (synthesis) models, which take natural language as input and code as output, and is primarily used to evaluate the ability of code-generating models. AiXcoder NL2Code Evaluation Benchmark is divided into two datasets:

  1. Automated Test Dataset: Each sample in this part of the dataset contains a functionally independent and well-described natural language function description, the Java function signature of the function, and a set of Java unit tests that verify the correctness of this function.

    The main use of this dataset is to automatically evaluate the correctness of the code generated by the model.

  2. NL Task Description Dataset: Each sample in this part of the data set contains a relatively independent functional description. This part of the data is closer to the real method description in the code, and contains some functional descriptions whose details are not very clear.

    The code generated by the model requires human evaluation. Please refer to the detailed introduction for the evaluation criteria described later.

Datasets Automated Test Dataset NL Task Description Dataset
Test Set Size 175 161

Currently, these two datasets only contain Java codes, and the natural language description part contains English and Chinese languages. If you only care about code correctness, you can just use the automated test dataset.


The code in this project uses the MIT open source license.

The data in this project is licensed under the Computational Use of Data Agreement (C-UDA).


If you use code or data from this project, it is recommended that you reference it like this:

  Author = {Yiyang Hao and Ge Li and Yongqiang Liu and Xiaowei Miao and He Zong and Siyuan Jiang and Yang Liu and He Wei},
  Title = {AixBench: A Code Generation Benchmark Dataset},
  Year = {2022},
  Eprint = {arXiv:2206.13179},

Automated Test Dataset

Data file path: src/main/resources/dataset_autotest.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

  1. Comments well describe a function that can be implemented.
  2. The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
  3. The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
  4. Comments are descriptions of the objective, rather than descriptions of the implementation process.

On this basis, we extracted the descriptions in the comments, and then made some supplements, so that:

  1. The description contains specific information necessary to implement the function. For example: Returns whether or no the JDK version is high enough. There is no clear high enough standard. So we added it manually as Returns whether or no the JDK version is 1.7u40 and above..
  2. The part of description irrelevant to the task is deleted. For example removed the second half of the original data max() that works on three integers. Like many of the other max() functions in this class.

Just like in real-world scenarios, natural language descriptions will contain certain grammatical errors or punctuation or inconsistencies in capitalization. We keep these because we think these perturbations test the model's anti-disturbance ability.

NL Task Description Dataset

Data file path: src/main/resources/dataset_manual_nl.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

  1. Comments well describe a function that can be implemented.
  2. The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
  3. The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
  4. We allow a certain degree of ambiguity, such as in "Read the encoded image data from a JPEG image.", we do not specify how the read data should be handled. During evaluation, as long as the code generated by the model fully implements the functions described in the description, then a full score is awarded for correctness.

Evaluation standard

We manually evaluate the code generated by the model in three dimensions.


  • 4 points: The specified function is fully realized.
  • 3 points: The main function is realized. However, some details are missing, which does not affect the correctness of the overall logic. A little modification is need to meet all the requirements.
  • 2 points: Only the core function is implemented. Most of the requirements are not reflected in the code. More modifications are required to meet the requirements.
  • 1 point: The specified function is not implemented at all.

Code Quality:

  • 3 points: The details are in place. No obviously better code in terms of performance exists. If possible, resources are released accordingly. No obvious code smell.
  • 2 points: Some details are not in place. There is code smell of low severity.
  • 1 point: There is significantly better solution in terms of performance. Or there is serious code smell.


  • 5 points: The method implementation is very standardized, the variable naming is semantically straightforward, the method is not unnecessarily bloated, the readability is good, the code is short, and the code blocks are clearly structured.
  • 4 points: The method implementation is relatively standardized, the variable naming is basically semantically straightforward, and the readability is better.
  • 3 points: The method implementation meets certain specifications, some variable names are meaningless, and defective code and deprecate methods are used.
  • 2 points: The code is written in a confusing way, or does not follow a consistent specification, or there are many meaningless names in variable naming, or there are certain repetitions and redundant codes. Poor readability.
  • 1 point: Very confusing, completely illogical, hard-to-read code.


The dataset includes 175 hand-picked code examples that occur frequently in JAVA programming, and each example includes the following fields:

"task_id": 166,
"raw_nl": "通过反射为对象的对应字段注入值",
"signature": "public <T> T initByReflect(String name, String value, T t)"

The task_id is used to mark the serial number of the example, raw_nl represents the description in natural language, signature represents the signature of the function to be generated, and raw_nl and signature are used together as the input of the model.

Project structure

     Automated test classes for testing each example
     Function-level code to store model output, each example needs to manually create a class
     Automated test executor
     Convert predicted methods into classes that can be tested by automation

How to use

1. Download the dataset

git clone

2. Get model predictions

For each test data, take raw_nl and signature as input, get the output of the model, the output is the only method of the class, the class name is prefix+task_id, and the prefix is user-defined. At the same time, this class needs to inherit the GenerateMethodBase class. For the following example, according to the prediction output of the model, user need to manually generate the following class, where the class name is Aixcoder166 (Aixcoder+166), and inherit the GenerateMethodBase class at the same time.

public class Aixcoder166 extends GenerateMethodBase {
* 通过反射为对象的对应字段注入值
public<T> T initByReflect(String name, Object value, T t) {
if (null == t) {
throw new NullPointerException("t can not be null");

        if (null == value) {
            return null;

        Class<?> clazz = t.getClass();

        if (!clazz.isAssignableFrom(value.getClass())) {
            throw new IllegalArgumentException("value must be assignable to" + clazz);

        try {
            Field field = clazz.getDeclaredField(name);
            field.set(t, value);
        } catch (NoSuchFieldException e) {
            throw new IllegalArgumentException("no such field:" + name);
        } catch (IllegalAccessException e) {
            throw new IllegalArgumentException("illegal access:" + name);

        return t;

The above process can be implemented in batches. Using the assembleFile method in the predictionHelper class, all classes can be generated in batches according to the prediction output of the model. Each class needs to manually import all the required dependency packages. Execute the following code:

public class predictionHelper {
    public static void main(String[] args) {

3. Finally execute Executor

3.1 Test sample can be executed one by one at a time
class Executor{
    private static void evaluationOneExample(String basePackage,String prefix,String fileId){
        try {
            int[] result= evaluationGenerateMethod(fileId,basePackage,prefix);
            System.out.println(prefix+" result:"+result[0]+"/"+result[1]);
        }catch (Exception e){

You can execute the example above like this:

class Executor{
    public static void main(String[] args) {
        try {
            String taskId = "166";
            String basePackage = "com.aixcode.autoTest.generate.aixcoder";
            String prefix = "Aixcoder";
            evaluationOneExample(taskId, basePackage, prefix);
        } catch (Exception e) {
3.2 Executing all test samples at once
class Executor{
    //Executing all samples. This will iterate through all evaluation classes under src/main/java/com/aixcode/autoTest/evaluation
    public static double[] runAllTest(String basePackage, String prefix, int minFileId, int maxFileId) {
        try {
            List<String> fileNames = listFiles("src/main/java/com/aixcode/autoTest/evaluation");
            List<String> fileIds = -> fileName.substring("Evaluation".length(), fileName.lastIndexOf("."))).collect(Collectors.toList());

            double copilot_score = 0;
            int CopilotExacttCount = 0;
            int totalCount = 0;
            for (String fileId : fileIds) {
                if (!(Integer.parseInt(fileId) >= minFileId && Integer.parseInt(fileId) <= maxFileId)) {
                int[] result = evaluationGenerateMethod(fileId, basePackage, prefix);
                if (result != null && result.length == 2 && result[1] != 0) {
                    copilot_score += (double) result[0] / result[1];
                    if (result[0] == result[1]) {

            return new double[]{copilot_score, CopilotExacttCount, totalCount};
        } catch (Exception e) {
        return new double[]{0, 0, 0};

To perform the above tasks, you can do the following:

class Executor {
    public static void main(String[] args) {
        try {
            double[] res=runAllTest("com.aixcode.autoTest.generate.aixcoderFirstHalf", "AixcoderAuto", 0, 103);
        } catch (Exception e) {


  • Fork the repository
  • Create Feat_xxx branch
  • Submit code
  • Create pull request
You might also like...

High performance Java implementation of a Cuckoo filter - Apache Licensed

Cuckoo Filter For Java This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has addition

Dec 30, 2022

An advanced, but easy to use, platform for writing functional applications in Java 8.

An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

Dec 29, 2022

Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

English | 中文 | Deutsch | Español | Ελληνικά | Français | 日本語 | Norsk (bokmål) | Português-Brasil | Русский | हिंदी Eclipse Collections is a comprehens

Dec 29, 2022

External-Memory Sorting in Java

Externalsortinginjava External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm. The vers

Dec 29, 2022

A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jan 1, 2023

Geohash utitlies in java

geo Java utility methods for geohashing. Status: production, available on Maven Central Maven site reports are here including javadoc. Add this to you

Jan 1, 2023

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Dec 25, 2022

High Performance Primitive Collections for Java

HPPC: High Performance Primitive Collections Collections of primitive types (maps, sets, stacks, lists) with open internals and an API twist (no java.

Dec 28, 2022

Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

Oct 31, 2022
Comparison between Java and Common Lisp solutions to a phone-encoding problem described by Prechelt

Prechelt Phone Number Encoding This project implements the phone number encoding described by Lutz Prechelt in his article for the COMMUNICATIONS OF T

Renato Athaydes 27 Nov 30, 2021
This repository has the code for basic operations on tries - insert, search and delete.

This repository is part of the unacademy session series I took on 17th and 18th of April, 2021. I am daily improving it a bit, with the amount of time

Tarun Gupta 12 Apr 27, 2021
A generalization of Elias Gamma Code

Zeta-Xi Code Zeta-Xi Code is a universal code for representing variable-length nonnegative integers in binary format, developed by Einar Saukas. It's

Einar Saukas 6 Dec 22, 2022
Google Hash Code '22 Question

Answer for - Mentorship and Teamwork Google Hash Code '22 Question Credit goes to Google LLC - Hash Code '22 Work is so much more fun when we are part

Dilshan Karunarathne 4 Apr 12, 2022
SWE5003 - Achitecting Real Time Systems for Data Processing - Code Base

ARTS2022 SWE5003 - Achitecting Real Time Systems for Data Processing (ISS NUS Offering) - Code Base This module is part of the ISS MTech Graduate Cert

Suria R Asai 5 Apr 2, 2022
High Performance data structures and utility methods for Java

Agrona Agrona provides a library of data structures and utility methods that are a common need when building high-performance applications in Java. Ma

Real Logic 2.5k Jan 5, 2023
Bloofi: A java implementation of multidimensional Bloom filters

Bloofi: A java implementation of multidimensional Bloom filters Bloom filters are probabilistic data structures commonly used for approximate membersh

Daniel Lemire 71 Nov 2, 2022
A high performance caching library for Java

Caffeine is a high performance, near optimal caching library. For more details, see our user's guide and browse the API docs for the latest release. C

Ben Manes 13k Jan 5, 2023
Chronicle Bytes has a similar purpose to Java NIO's ByteBuffer with many extensions

Chronicle-Bytes Chronicle-Bytes Chronicle Bytes contains all the low level memory access wrappers. It is built on Chronicle Core’s direct memory and O

Chronicle Software : Open Source 334 Jan 1, 2023