About BIRD
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
News
- Sept 25, 2023: We have released a cleaner version of
dev set
. Please download dev set again. We checked all cases of dev set and fixed all errors that we found. After cleaning, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. Thanks for all feedbacks! - Sept 21, 2023: Our paper has been accepted by
NeurIPS 2023
as aSpotlight
!!! Thanks for all the efforts and suggestions of co-authors, anonymous reviewers, awesome researchers/users in github or emails. - July 17, 2023: We update newest results of GPT-4, Claude-2 and Palm-2. Additionally, we have some Azure OpenAI API quotas for free LLM-based test now.
- July 14, 2023: The data link has been updated, fixing the schema names in the CSV files. Additionally, tied results caused by
order_by limit 1
are now considered. Both SQL queries - with and without accounting for tied results - are valid at this time. - Jun 12, 2023: We are welcome to any suggestions and reported gold errors in help_wanted. Any of your help is appreciated!
- Jun 5, 2023: We open-sourced our Graphix-T5, a graph-aware semi-pretrained text-to-text PLM specifically designed to improve multi-hop reasoning for the complex text-to-SQL task.
- May 30, 2023: If you are interested in ICL, please check out our interesting work deep-thinkingπ€. Generate 1000 models for 1000 people smoothly!
Surprise from BIRD
1. Large and Dirty values: Due to the nature of the real-world scenarios from which BIRD's database values were collected, they typically retain their original and frequently "dirty" format. Hence, text-to-SQL parsers must first analyze these values to account for their non-standard format before engaging in reasoning.


2. External Knowledge: "account.type = 'OWNER'" can be inferred by the knowledge evidence: "The condition of the loans require the account type should be the owner."

3. Text-to-Efficient-SQL: BIRD is the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient. This emphasis on efficiency is especially valuable in real-world data / business analysis circumstances.

Submission
We support evaluation for open or closed-source LLMs. Please connect bird.bench23@gmail.com
for test evaluation. Currently, we have several free quota from Azure OpenAI for GPT-4, GPT-4-32k, GPT-turbo supported by HKU ITS.
These API can be called via ChatCompletion
instead of Completion
. Feel free to contact us!
Subscribe to BIRD Update
Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.
Citation
@misc{li2023llm, title={Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs}, author={Jinyang Li and Binyuan Hui and Ge Qu and Binhua Li and Jiaxi Yang and Bowen Li and Bailin Wang and Bowen Qin and Rongyu Cao and Ruiying Geng and Nan Huo and Chenhao Ma and Kevin C. C. Chang and Fei Huang and Reynold Cheng and Yongbin Li}, year={2023}, eprint={2305.03111}, archivePrefix={arXiv}, primaryClass={cs.CL} }
Model | Code | Size | Orale Knowledge | Dev (%) | Test (%) | |
---|---|---|---|---|---|---|
Human Performance Data Engineers + DB Students |
βοΈ | 92.96 | ||||
π1 Aug 15, 2023 |
DIN-SQL + GPT-4 University of Alberta [Pourreza et al. 2023] |
[link] | UNK | βοΈ | 50.72 | 55.90 |
π₯2 Jul 01, 2023 |
GPT-4 Baseline |
[link] | UNK | βοΈ | 46.35 | 54.89 |
π₯3 Jul 16, 2023 |
Claude-2 Baseline |
[link] | UNK | βοΈ | 42.70 | 49.02 |
4 Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. 2023] |
[link] | UNK | βοΈ | 36.64 | 40.08 |
5 Mar 17, 2023 |
ChatGPT Baseline |
UNK | βοΈ | 37.22 | 39.30 | |
6 Feb 17, 2023 |
Codex Baseline |
175B | βοΈ | 34.35 | 36.47 | |
7 Jul 16, 2023 |
Palm-2 Baseline |
[link] | UNK | βοΈ | 27.38 | 33.04 |
8 Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. 2023] |
[link] | UNK | 25.88 | 28.95 | |
9 Mar 17, 2023 |
ChatGPT Baseline |
UNK | 24.05 | 26.77 | ||
10 Feb 17, 2023 |
Codex Baseline |
175B | 25.42 | 24.86 | ||
11 Feb 5, 2023 |
T5-3B Baseline |
3B | βοΈ | 23.34 | 24.05 | |
12 Feb 3, 2023 |
T5-Large Baseline |
770M | βοΈ | 19.75 | 20.94 | |
13 Feb 3, 2023 |
T5-Base Baseline |
220M | βοΈ | 11.54 | 12.89 | |
14 Feb 5, 2023 |
T5-3B Baseline |
3B | 10.37 | 11.17 | ||
15 Feb 3, 2023 |
T5-Large Baseline |
770M | 9.71 | 10.38 | ||
16 Feb 3, 2023 |
T5-Base Baseline |
220M | 6.32 | 7.06 |
Model | Code | Size | Oracle Knowledge | Dev | Test | |
---|---|---|---|---|---|---|
Human Performance Data Engineers + DB Students |
βοΈ | 90.27 | ||||
π1 Jul 01, 2023 |
GPT-4 Baseline |
[link] | UNK | βοΈ | 49.77 | 60.77 |
π₯2 Aug 15, 2023 |
DIN-SQL + GPT-4 University of Alberta [Pourreza et al. 2023] |
[link] | UNK | βοΈ | 58.79 | 59.44 |
π₯3 Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. 2023] |
[link] | UNK | βοΈ | 42.30 | 56.56 |
4 Mar 17, 2023 |
ChatGPT Baseline |
UNK | βοΈ | 43.81 | 51.40 | |
5 Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. 2023] |
[link] | UNK | 32.33 | 49.69 | |
6 Feb 17, 2023 |
Codex Baseline |
175B | βοΈ | 43.41 | 41.60 | |
7 Mar 17, 2023 |
ChatGPT Baseline |
UNK | 27.97 | 36.68 | ||
8 Feb 17, 2023 |
Codex Baseline |
175B | 33.37 | 35.40 | ||
9 Feb 5, 2023 |
T5-3B Baseline |
3B | βοΈ | 25.57 | 27.80 | |
10 Feb 3, 2023 |
T5-Large Baseline |
770M | βοΈ | 22.74 | 25.00 | |
11 Feb 5, 2023 |
T5-3B Baseline |
3B | 13.62 | 15.17 | ||
12 Feb 3, 2023 |
T5-Base Baseline |
220M | βοΈ | 12.90 | 14.70 | |
13 Feb 3, 2023 |
T5-Large Baseline |
770M | 9.90 | 12.25 | ||
14 Feb 3, 2023 |
T5-Base Baseline |
220M | 7.78 | 8.97 |