About BIRD
BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.
News
- Aug. 4, 2024: The Reward-based Valid Efficiency Score (R-VES) will be used as the efficiency metric for future test submissions. The rationale and formula for R-VES can be found in the Mini-Dev repository. You can check the legacy VES scores for previous submissions here.
- Jun. 30, 2024: Due to large requests of test submissions about mixed models (open-source + GPU-based closed source), we update the submission instructions to accelerate your waiting time. Please check it out!
- Jun. 30, 2024: If you are interested in code agent, please do not miss a SOTA code agent implementation by OpenDevin for BIRD Dev!
-
Jun. 27, 2024:
Excited to announce the release of our
BIRD Mini-Dev dataset
with 500 high-quality examples. This dataset includes
all BIRD keywords, with modifications for questions such
as the addition of
window function
. We are the first to deliver it in not only SQLite, but also MySQL, and PostgreSQL. We include Soft-F1 and R-VES metrics to reduce bias. Don't miss thecolumn_meaning.json
file, preprocessed by TA-SQL. Available for dev and testing set. Check out our work here: Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation (TA-SQL), appearing atACL 2024
Findings. -
Apr. 27, 2024:
Due to large volume of requests, we now modify the
license of our data to
CC BY-SA 4.0
. However, we will not take responsibility of any bad purposes by using our data. Since we develop this benchmark for research and healthy application only. - Mar 13, 2024: Please also take a look at our related work: Tapilot-Crossing, which is the first challenging and more realistic benchmark designed to evaluate Large Language Model (LLM) agents on interactive data analysis tasks. The code includes Python and Private Library. And it covers 6 common agent actions in evaluation.
-
Sept 25, 2023:
We have released a cleaner version of
dev set
. Please download dev set again. We checked all cases of dev set and fixed all errors that we found. After cleaning, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. Thanks for all feedbacks! -
Sept 21, 2023:
Our paper has been accepted by
NeurIPS 2023
as aSpotlight
!!! Thanks for all the efforts and suggestions of co-authors, anonymous reviewers, awesome researchers/users in github or emails. - July 17, 2023: We update newest results of GPT-4, Claude-2 and Palm-2.
-
July 14, 2023:
The data link has been updated, fixing the schema names
in the CSV files. Additionally, tied results caused by
order_by limit 1
are now considered. Both SQL queries - with and without accounting for tied results - are valid at this time. - Jun 12, 2023: We are welcome to any suggestions and reported gold errors in help_wanted. Any of your help is appreciated!
- Jun 5, 2023: We open-sourced our Graphix-T5, a graph-aware semi-pretrained text-to-text PLM specifically designed to improve multi-hop reasoning for the complex text-to-SQL task.
- May 30, 2023: If you are interested in ICL, please check out our interesting work deep-thinking🤔. Generate 1000 models for 1000 people smoothly!
Surprise from BIRD
1. Large and Dirty values: Due to the nature of the real-world scenarios from which BIRD's database values were collected, they typically retain their original and frequently "dirty" format. Hence, text-to-SQL parsers must first analyze these values to account for their non-standard format before engaging in reasoning.
2. External Knowledge: "account.type = 'OWNER'" can be inferred by the knowledge evidence: "The condition of the loans require the account type should be the owner."
3. Text-to-Efficient-SQL: BIRD is the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient. This emphasis on efficiency is especially valuable in real-world data / business analysis circumstances.
Submission
Please follow the Submission Guideline (below) and contact
bird.bench23@gmail.com
for test evaluation.
Ususally, we will return your results in 10 days!
Subscribe to BIRD Update
Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.
Citation
@article{li2024can, title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls}, author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others}, journal={Advances in Neural Information Processing Systems}, volume={36}, year={2024} }
Model | Code | Size | Oracle Knowledge | Dev (%) | Test (%) | |
---|---|---|---|---|---|---|
Human Performance Data Engineers + DB Students |
✔️ | 92.96 | ||||
Sep 1, 2024 | AskData + GPT-4o AT&T - CDO |
UNK | ✔️ | 72.03 | 72.39 | |
Aug 21, 2024 | OpenSearch-SQL, v2 + GPT-4o Alibaba Cloud |
UNK | ✔️ | 69.30 | 72.28 | |
Jul 22, 2024 | Distillery + GPT-4o Distyl AI Research [Maamari et al. '24] |
UNK | ✔️ | 67.21 | 71.83 | |
Aug 1, 2024 |
ExSL + granite-34b-code IBM Research AI |
34B | ✔️ | 67.47 | 70.37 | |
Aug 28, 2024 | Insights AI Uber Freight |
UNK | ✔️ | 72.16 | 70.26 | |
Aug 30, 2024 | PURPLE + RED + GPT-4o Fudan University + Transwarp Technology |
UNK | ✔️ | 68.12 | 70.21 | |
Jul 14, 2024 |
RECAP + Gemini Google Cloud |
UNK | ✔️ | 66.95 | 69.03 | |
Jul 2, 2024 |
ByteBrain ByteDance Infra Lab |
33B | ✔️ | 65.45 | 68.87 | |
May 14, 2024 |
ExSL + granite-20b-code IBM Research AI |
20B | ✔️ | 65.38 | 67.86 | |
May 21, 2024 |
CHESS Stanford [Talaei et al.'24] |
[link] | UNK | ✔️ | 65.00 | 66.69 |
Aug 29, 2024 | Arcwise + GPT-4o Arcwise |
UNK | ✔️ | 67.99 | 66.21 | |
Sep 1, 2024 | AskData + GPT-4o AT&T - CDO |
UNK | 65.19 | 65.62 | ||
Jan 14, 2024 |
MCS-SQL + GPT-4 Dunamu |
UNK | ✔️ | 63.36 | 65.45 | |
Aug 20, 2024 | SCL-SQL Xffuture |
UNK | ✔️ | 64.73 | 65.23 | |
Apr 08, 2024 |
OpenSearch-SQL,v1 + GPT-4 Alibaba Cloud |
UNK | ✔️ | 61.34 | 64.95 | |
Feb 27, 2024 |
PB-SQL, v1 Seoul National University |
UNK | ✔️ | 60.50 | 64.84 | |
Jun 7, 2024 |
SFT CodeS-15B + SQLFixAgent Soochow University |
UNK | ✔️ | -- | 64.62 | |
Aug 30, 2024 | PURPLE + GPT-4o Fudan University + Transwarp Technology |
UNK | ✔️ | 62.97 | 64.51 | |
Feb 21, 2024 |
Sense Anonymous |
13B | ✔️ | 55.48 | 63.39 | |
Apr 10, 2024 |
GRA-SQL Tencent CDP-youpu |
UNK | ✔️ | 62.58 | 63.22 | |
Jun 1, 2024 |
SuperSQL HKUST(GZ) [Li et al. '24] |
[link] | UNK | ✔️ | 58.50 | 62.66 |
Mar 27, 2024 |
{Chat2Query} (GPT-4 + data entity modeling) (PingCAP) PingCAP |
[link] | UNK | ✔️ | 58.15 | 60.98 |
Nov 16, 2023 |
Dubo-SQL, v1 Mercator Technologies |
UNK | ✔️ | 59.71 | 60.71 | |
Oct 12, 2023 |
SFT CodeS-15B Renmin University of China [Li et al. SIGMOD'24] |
[link] | 15B | ✔️ | 58.47 | 60.37 |
Feb 27, 2024 |
DTS-SQL + DeepSeek 7B University of Alberta [Pourreza et al. '24] |
[link] | 7B | ✔️ | 55.8 | 60.31 |
Nov 21, 2023 |
MAC-SQL + GPT-4 BUAA & Tencent [Wang et al. '23] |
UNK | ✔️ | 57.56 | 59.59 | |
Oct 12, 2023 |
SFT CodeS-7B Renmin University of China [Li et al. SIGMOD'24] |
[link] | 7B | ✔️ | 57.17 | 59.25 |
May 27, 2024 |
TA-SQL + GPT-4 HKU [Qu et al. ACL Findings'24] |
[link] | UNK | ✔️ | 56.19 | 59.14 |
Nov 09, 2023 |
DAIL-SQL + GPT-4 Alibaba Group [Gao and Wang et al. VLDB'24] |
[link] | UNK | ✔️ | 54.76 | 57.41 |
May 24, 2024 |
ExSL + granite-20b-code IBM Research AI |
20B | 51.69 | 57.13 | ||
Aug 10, 2024 |
DeepSeek Baseline |
[link] | 236B | ✔️ | 56.13 | 56.68 |
Aug 15, 2023 |
DIN-SQL + GPT-4 University of Alberta [Pourreza et al. '23] |
[link] | UNK | ✔️ | 50.72 | 55.90 |
Aug 08, 2024 |
Mistral Baseline |
[link] | 123B | ✔️ | 53.52 | 55.84 |
Jul 01, 2023 |
GPT-4 Baseline |
[link] | UNK | ✔️ | 46.35 | 54.89 |
Jul 16, 2023 |
Claude-2 Baseline |
[link] | UNK | ✔️ | 42.70 | 49.02 |
Nov 23, 2023 |
Open-SQL Anonymous |
7B | ✔️ | 37.68 | 47.74 | |
Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. NeurIPS'23] |
[link] | UNK | ✔️ | 36.64 | 40.08 |
Mar 17, 2023 |
ChatGPT Baseline |
UNK | ✔️ | 37.22 | 39.30 | |
Feb 17, 2023 |
Codex Baseline |
175B | ✔️ | 34.35 | 36.47 | |
Jul 16, 2023 |
Palm-2 Baseline |
[link] | UNK | ✔️ | 27.38 | 33.04 |
Mar 17, 2023 |
ChatGPT + CoT HKU & DAMO [Li et al. NeurIPS'23] |
[link] | UNK | 25.88 | 28.95 | |
Mar 17, 2023 |
ChatGPT Baseline |
UNK | 24.05 | 26.77 | ||
Feb 17, 2023 |
Codex Baseline |
175B | 25.42 | 24.86 | ||
Feb 5, 2023 |
T5-3B Baseline |
3B | ✔️ | 23.34 | 24.05 | |
Feb 3, 2023 |
T5-Large Baseline |
770M | ✔️ | 19.75 | 20.94 | |
Feb 3, 2023 |
T5-Base Baseline |
220M | ✔️ | 11.54 | 12.89 | |
Feb 5, 2023 |
T5-3B Baseline |
3B | 10.37 | 11.17 | ||
Feb 3, 2023 |
T5-Large Baseline |
770M | 9.71 | 10.38 | ||
Feb 3, 2023 |
T5-Base Baseline |
220M | 6.32 | 7.06 |
Model | Code | Size | Oracle Knowledge | Test | |
---|---|---|---|---|---|
Human Performance Data Engineers + DB Students |
✔️ | 83.26 | |||
Aug 21, 2024 | OpenSearch-SQL, v2 + GPT-4o Alibaba Cloud |
UNK | ✔️ | 69.36 | |
Aug 1, 2024 |
ExSL + granite-34b-code IBM Research AI |
34B | ✔️ | 68.79 | |
Jul 22, 2024 | Distillery + GPT-4o Distyl AI Research |
UNK | ✔️ | 67.41 | |
Sep 1, 2024 | AskData + GPT-4o AT&T - CDO |
UNK | ✔️ | 66.92 | |
Jul 5, 2024 | Insights AI Uber Freight |
UNK | ✔️ | 66.39 | |
May 14, 2024 | ExSL + granite-20b-code IBM Research AI |
20B | ✔️ | 66.25 | |
Jul 14, 2024 | RECAP + Gemini Google Cloud |
UNK | ✔️ | 65.70 | |
Aug 30, 2024 | PURPLE + RED + GPT-4o Fudan University + Transwarp Technology |
UNK | ✔️ | 65.62 | |
Aug 29, 2024 | Arcwise + GPT-4o Arcwise |
UNK | ✔️ | 63.68 | |
May 21, 2024 | CHESS Stanford [Talaei et al.'24] |
[link] | UNK | ✔️ | 62.77 |
Jun 7, 2024 | SFT CodeS-15B + SQLFixAgent Soochow University |
UNK | ✔️ | 61.37 | |
Aug 20, 2024 | SCL-SQL Xffuture |
UNK | ✔️ | 61.28 | |
Jan 14, 2024 | MCS-SQL + GPT-4 Dunamu |
UNK | ✔️ | 61.23 | Feb 27, 2024 | PB-SQL Seoul National University |
UNK | ✔️ | 60.36 |
Sep 1, 2024 | AskData + GPT-4o AT&T - CDO |
UNK | 60.25 | ||
Aug 30, 2024 | PURPLE + GPT-4o Fudan University + Transwarp Technology |
UNK | ✔️ | 60.35 | |
Nov 21, 2023 | MAC-SQL + GPT-4 BUAA & Tencent [Wang et al. '23] |
UNK | ✔️ | 57.60 | |
Oct 12, 2023 | SFT CodeS-15B Renmin University of China [Li et al. SIGMOD'24] |
[link] | 15B | ✔️ | 56.73 |
Apr 10, 2024 | GRA-SQL Tencent CDP-youpu |
UNK | ✔️ | 56.63 | |
Nov 16, 2023 | Dubo-SQL, v1 Mercator Technologies |
UNK | ✔️ | 56.63 | |
May 24, 2024 | ExSL + granite-20b-code IBM Research AI |
20B | 56.11 | Mar 27, 2024 | {Chat2Query} (GPT-4 + data entity modeling) (PingCAP) PingCAP |
[link] | UNK | ✔️ | 56.06 |
Oct 12, 2023 | SFT CodeS-7B Renmin University of China [Li et al. SIGMOD'24] |
[link] | 7B | ✔️ | 55.69 |
Nov 09, 2023 | DAIL-SQL + GPT-4 Alibaba Group [Gao and Wang et al. VLDB'24] |
[link] | UNK | ✔️ | 54.02 |
Aug 10, 2024 | DeepSeek Baseline |
[link] | 236B | ✔️ | 53.25 |
Aug 15, 2023 | DIN-SQL + GPT-4 University of Alberta [Pourreza et al. '23] |
[link] | UNK | ✔️ | 53.07 |
Aug 08, 2024 | Mistral Baseline |
[link] | 128B | ✔️ | 52.59 |
Jul 01, 2023 | GPT-4 Baseline |
[link] | UNK | ✔️ | 51.75 |
BIRD Mini-Dev
A Lite version of developtment dataset, which is designed to facilitate efficient and cost-effective development cycles, especially for testing and refining SQL query generation models. For more details, please visit the GitHub repository. For updating Leaderboard, please make sure your paper or resource is public available and submit a PR.
Model | Code | Size | Oracle Knowledge | SQLite | MySQL | PostgreSQL | |
---|---|---|---|---|---|---|---|
Jun 31, 2024 |
TA + GPT-4 HKU [Qu et al. ACL Findings'24] |
[link] | UNK | ✔️ | 58.00 | 49.20 | 50.80 |
Jun 31, 2024 | GPT-4 | UNK | ✔️ | 47.80 | 40.80 | 35.80 | |
Jun 31, 2024 | GPT-4-32k | UNK | ✔️ | 47.00 | 43.20 | 35.00 | |
Jun 31, 2024 | GPT-4-turbo | UNK | ✔️ | 45.80 | 41.00 | 36.00 | |
Jun 31, 2024 | Llama3-70b-instruct | 70B | ✔️ | 40.80 | 37.00 | 29.40 | |
Jun 31, 2024 | GPT-35-turbo | UNK | ✔️ | 38.00 | 36.00 | 27.40 | |
Jun 31, 2024 | GPT-35-turbo-instruct | UNK | ✔️ | 33.60 | 31.20 | 26.60 | |
Jun 31, 2024 | Phi-3-medium-128k-instruct | 13B | ✔️ | 30.60 | 25.00 | 21.60 | |
Jun 31, 2024 | Llama3-8b-instruct | 8B | ✔️ | 24.40 | 24.60 | 18.40 | |
Jun 31, 2024 | Mixtral-8x7b | 46.7B | ✔️ | 21.60 | 13.60 | 12.40 |