AiXcoder NL2Code Evaluation Benchmark (aix-bench)

Paper available: https://arxiv.org/abs/2206.13179

Introduction

This is a method-level benchmark for evaluating code generating (synthesis) models, which take natural language as input and code as output, and is primarily used to evaluate the ability of code-generating models. AiXcoder NL2Code Evaluation Benchmark is divided into two datasets:

Automated Test Dataset: Each sample in this part of the dataset contains a functionally independent and well-described natural language function description, the Java function signature of the function, and a set of Java unit tests that verify the correctness of this function.

The main use of this dataset is to automatically evaluate the correctness of the code generated by the model.
NL Task Description Dataset: Each sample in this part of the data set contains a relatively independent functional description. This part of the data is closer to the real method description in the code, and contains some functional descriptions whose details are not very clear.

The code generated by the model requires human evaluation. Please refer to the detailed introduction for the evaluation criteria described later.

Datasets	Automated Test Dataset	NL Task Description Dataset
Test Set Size	175	161

Currently, these two datasets only contain Java codes, and the natural language description part contains English and Chinese languages. If you only care about code correctness, you can just use the automated test dataset.

License

The code in this project uses the MIT open source license.

The data in this project is licensed under the Computational Use of Data Agreement (C-UDA).

Referencing

If you use code or data from this project, it is recommended that you reference it like this:

@misc{2206.13179,
  Author = {Yiyang Hao and Ge Li and Yongqiang Liu and Xiaowei Miao and He Zong and Siyuan Jiang and Yang Liu and He Wei},
  Title = {AixBench: A Code Generation Benchmark Dataset},
  Year = {2022},
  Eprint = {arXiv:2206.13179},
}

Automated Test Dataset

Data file path: src/main/resources/dataset_autotest.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

Comments well describe a function that can be implemented.
The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
Comments are descriptions of the objective, rather than descriptions of the implementation process.

On this basis, we extracted the descriptions in the comments, and then made some supplements, so that:

The description contains specific information necessary to implement the function. For example: Returns whether or no the JDK version is high enough. There is no clear high enough standard. So we added it manually as Returns whether or no the JDK version is 1.7u40 and above..
The part of description irrelevant to the task is deleted. For example removed the second half of the original data max() that works on three integers. Like many of the other max() functions in this class.

Just like in real-world scenarios, natural language descriptions will contain certain grammatical errors or punctuation or inconsistencies in capitalization. We keep these because we think these perturbations test the model's anti-disturbance ability.

NL Task Description Dataset

Data file path: src/main/resources/dataset_manual_nl.jsonl

This data is a collection of hand-picked batches of "Method Comments" from open-sourced "Method Comments - Java Method Implementation" pairs. Our selection criteria are:

Comments well describe a function that can be implemented.
The functions are relatively independent and do not depend on the understanding of the context of the project and business logic.
The functionality is reasonable and could occur in a developer's day-to-day work. rather than programming competition quizzes or coursework.
We allow a certain degree of ambiguity, such as in "Read the encoded image data from a JPEG image.", we do not specify how the read data should be handled. During evaluation, as long as the code generated by the model fully implements the functions described in the description, then a full score is awarded for correctness.

Evaluation standard

We manually evaluate the code generated by the model in three dimensions.

Correctness:

4 points: The specified function is fully realized.
3 points: The main function is realized. However, some details are missing, which does not affect the correctness of the overall logic. A little modification is need to meet all the requirements.
2 points: Only the core function is implemented. Most of the requirements are not reflected in the code. More modifications are required to meet the requirements.
1 point: The specified function is not implemented at all.

Code Quality:

3 points: The details are in place. No obviously better code in terms of performance exists. If possible, resources are released accordingly. No obvious code smell.
2 points: Some details are not in place. There is code smell of low severity.
1 point: There is significantly better solution in terms of performance. Or there is serious code smell.

Maintainability:

5 points: The method implementation is very standardized, the variable naming is semantically straightforward, the method is not unnecessarily bloated, the readability is good, the code is short, and the code blocks are clearly structured.
4 points: The method implementation is relatively standardized, the variable naming is basically semantically straightforward, and the readability is better.
3 points: The method implementation meets certain specifications, some variable names are meaningless, and defective code and deprecate methods are used.
2 points: The code is written in a confusing way, or does not follow a consistent specification, or there are many meaningless names in variable naming, or there are certain repetitions and redundant codes. Poor readability.
1 point: Very confusing, completely illogical, hard-to-read code.

Dataset

The dataset includes 175 hand-picked code examples that occur frequently in JAVA programming, and each example includes the following fields:

{
"task_id": 166,
"raw_nl": "通过反射为对象的对应字段注入值",
"signature": "public <T> T initByReflect(String name, String value, T t)"
}

The task_id is used to mark the serial number of the example, raw_nl represents the description in natural language, signature represents the signature of the function to be generated, and raw_nl and signature are used together as the input of the model.

Project structure

src/main/java/com/aixcode/autoTest/evaluation/
     Automated test classes for testing each example
src/main/java/com/aixcode/autoTest/generate/
     Function-level code to store model output, each example needs to manually create a class
src/main/java/com/aixcode/autoTest/Executor.java
     Automated test executor
src/main/java/com/aixcode/autoTest/predictionHelper.java
     Convert predicted methods into classes that can be tested by automation

How to use

1. Download the dataset

git clone https://github.com/aixcoder-plugin/nl2code-dataset.git

2. Get model predictions

For each test data, take raw_nl and signature as input, get the output of the model, the output is the only method of the class, the class name is prefix+task_id, and the prefix is user-defined. At the same time, this class needs to inherit the GenerateMethodBase class. For the following example, according to the prediction output of the model, user need to manually generate the following class, where the class name is Aixcoder166 (Aixcoder+166), and inherit the GenerateMethodBase class at the same time.

public class Aixcoder166 extends GenerateMethodBase {
/**
* 通过反射为对象的对应字段注入值
*/
public<T> T initByReflect(String name, Object value, T t) {
if (null == t) {
throw new NullPointerException("t can not be null");
}

        if (null == value) {
            return null;
        }

        Class<?> clazz = t.getClass();

        if (!clazz.isAssignableFrom(value.getClass())) {
            throw new IllegalArgumentException("value must be assignable to" + clazz);
        }

        try {
            Field field = clazz.getDeclaredField(name);
            field.setAccessible(true);
            field.set(t, value);
        } catch (NoSuchFieldException e) {
            throw new IllegalArgumentException("no such field:" + name);
        } catch (IllegalAccessException e) {
            throw new IllegalArgumentException("illegal access:" + name);
        }

        return t;
    }
}

The above process can be implemented in batches. Using the assembleFile method in the predictionHelper class, all classes can be generated in batches according to the prediction output of the model. Each class needs to manually import all the required dependency packages. Execute the following code:

public class predictionHelper {
    public static void main(String[] args) {
        assembleFile("src/main/resources/prediction.jsonl");
    }
}

3. Finally execute Executor

3.1 Test sample can be executed one by one at a time

class Executor{
    private static void evaluationOneExample(String basePackage,String prefix,String fileId){
        try {
            int[] result= evaluationGenerateMethod(fileId,basePackage,prefix);
            System.out.println(prefix+" result:"+result[0]+"/"+result[1]);
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

You can execute the example above like this:

class Executor{
    public static void main(String[] args) {
        try {
            String taskId = "166";
            String basePackage = "com.aixcode.autoTest.generate.aixcoder";
            String prefix = "Aixcoder";
            evaluationOneExample(taskId, basePackage, prefix);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

3.2 Executing all test samples at once

class Executor{
    //Executing all samples. This will iterate through all evaluation classes under src/main/java/com/aixcode/autoTest/evaluation
    public static double[] runAllTest(String basePackage, String prefix, int minFileId, int maxFileId) {
        try {
            List<String> fileNames = listFiles("src/main/java/com/aixcode/autoTest/evaluation");
            List<String> fileIds = fileNames.stream().map(fileName -> fileName.substring("Evaluation".length(), fileName.lastIndexOf("."))).collect(Collectors.toList());

            double copilot_score = 0;
            int CopilotExacttCount = 0;
            int totalCount = 0;
            for (String fileId : fileIds) {
                if (!(Integer.parseInt(fileId) >= minFileId && Integer.parseInt(fileId) <= maxFileId)) {
                    continue;
                }
                totalCount++;
                int[] result = evaluationGenerateMethod(fileId, basePackage, prefix);
                if (result != null && result.length == 2 && result[1] != 0) {
                    copilot_score += (double) result[0] / result[1];
                    if (result[0] == result[1]) {
                        CopilotExacttCount++;
                    }
                }
            }

            return new double[]{copilot_score, CopilotExacttCount, totalCount};
        } catch (Exception e) {
            e.printStackTrace();
        }
        return new double[]{0, 0, 0};
    }
}

To perform the above tasks, you can do the following:

class Executor {
    public static void main(String[] args) {
        try {
            double[] res=runAllTest("com.aixcode.autoTest.generate.aixcoderFirstHalf", "AixcoderAuto", 0, 103);
            System.out.println("result:"+res[0]+"/"+res[1]+"/"+res[2]);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Contributing

Fork the repository
Create Feat_xxx branch
Submit code
Create pull request

High performance Java implementation of a Cuckoo filter - Apache Licensed

Cuckoo Filter For Java This library offers a similar interface to Guava's Bloom filters. In most cases it can be used interchangeably and has addition

Dec 30, 2022

An advanced, but easy to use, platform for writing functional applications in Java 8.

Getting Cyclops X (10) The latest version is cyclops:10.4.0 Stackoverflow tag cyclops-react Documentation (work in progress for Cyclops X) Integration

Dec 29, 2022

Eclipse Collections is a collections framework for Java with optimized data structures and a rich, functional and fluent API.

External-Memory Sorting in Java

Externalsortinginjava External-Memory Sorting in Java: useful to sort very large files using multiple cores and an external-memory algorithm. The vers

A Java library for quickly and efficiently parsing and writing UUIDs

fast-uuid fast-uuid is a Java library for quickly and efficiently parsing and writing UUIDs. It yields the most dramatic performance gains when compar

Jan 1, 2023

Geohash utitlies in java

geo Java utility methods for geohashing. Status: production, available on Maven Central Maven site reports are here including javadoc. Add this to you

Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access.

Hollow Hollow is a java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-on

Dec 25, 2022

High Performance Primitive Collections for Java

HPPC: High Performance Primitive Collections Collections of primitive types (maps, sets, stacks, lists) with open internals and an API twist (no java.

Dec 28, 2022

Java port of a concurrent trie hash map implementation from the Scala collections library

About This is a Java port of a concurrent trie hash map implementation from the Scala collections library. It is almost a line-by-line conversion from

Oct 31, 2022