logo

BIRD-SQL

A Big Bench for Large-Scale Database Grounded Text-to-SQLs

About BIRD

BIRD (BIg Bench for LaRge-scale Database Grounded Text-to-SQL Evaluation) represents a pioneering, cross-domain dataset that examines the impact of extensive database contents on text-to-SQL parsing. BIRD contains over 12,751 unique question-SQL pairs, 95 big databases with a total size of 33.4 GB. It also covers more than 37 professional domains, such as blockchain, hockey, healthcare and education, etc.

News

  • Nov. 26, 2024: Thanks the support of BIRD-SQL 2023! Now we are pleased to share that the project BIRD 2025 has been started. It will contains 4-6 new benchmarks with each covering its special focus of professional databases and their knowledge in the wild applications. We will release the first benchmark by early Jan. Feel free to let us know your needs or suggestions for cooking new generations of Text-to-SQL challenges. Thanks!
  • Aug. 4, 2024: The Reward-based Valid Efficiency Score (R-VES) will be used as the efficiency metric for future test submissions. The rationale and formula for R-VES can be found in the Mini-Dev repository. You can check the legacy VES scores for previous submissions here.
  • Jun. 30, 2024: Due to large requests of test submissions about mixed models (open-source + GPU-based closed source), we update the submission instructions to accelerate your waiting time. Please check it out!
  • Jun. 30, 2024: If you are interested in code agent, please do not miss a SOTA code agent implementation by OpenDevin for BIRD Dev!
  • Jun. 27, 2024: Excited to announce the release of our BIRD Mini-Dev dataset with 500 high-quality examples. This dataset includes all BIRD keywords, with modifications for questions such as the addition of window function. We are the first to deliver it in not only SQLite, but also MySQL, and PostgreSQL. We include Soft-F1 and R-VES metrics to reduce bias. Don't miss the column_meaning.json file, preprocessed by TA-SQL. Available for dev and testing set. Check out our work here: Before Generation, Align it! A Novel and Effective Strategy for Mitigating Hallucinations in Text-to-SQL Generation (TA-SQL), appearing at ACL 2024 Findings.
  • Apr. 27, 2024: Due to large volume of requests, we now modify the license of our data to CC BY-SA 4.0. However, we will not take responsibility of any bad purposes by using our data. Since we develop this benchmark for research and healthy application only.
  • Mar 13, 2024: Please also take a look at our related work: Tapilot-Crossing, which is the first challenging and more realistic benchmark designed to evaluate Large Language Model (LLM) agents on interactive data analysis tasks. The code includes Python and Private Library. And it covers 6 common agent actions in evaluation.
  • Sept 25, 2023: We have released a cleaner version of dev set. Please download dev set again. We checked all cases of dev set and fixed all errors that we found. After cleaning, the ChatGPT (gpt-3.5-turbo) and GPT4 (gpt-4-32k) EX scores have improved to 42.24 (from 37.22) and 49.15 (from 46.35), respectively. Thanks for all feedbacks!
  • Sept 21, 2023: Our paper has been accepted by NeurIPS 2023 as a Spotlight!!! Thanks for all the efforts and suggestions of co-authors, anonymous reviewers, awesome researchers/users in github or emails.
  • July 17, 2023: We update newest results of GPT-4, Claude-2 and Palm-2.
  • July 14, 2023: The data link has been updated, fixing the schema names in the CSV files. Additionally, tied results caused by order_by limit 1 are now considered. Both SQL queries - with and without accounting for tied results - are valid at this time.
  • Jun 12, 2023: We are welcome to any suggestions and reported gold errors in help_wanted. Any of your help is appreciated!
  • Jun 5, 2023: We open-sourced our Graphix-T5, a graph-aware semi-pretrained text-to-text PLM specifically designed to improve multi-hop reasoning for the complex text-to-SQL task.
  • May 30, 2023: If you are interested in ICL, please check out our interesting work deep-thinking🤔. Generate 1000 models for 1000 people smoothly!

Surprise from BIRD

1. Large and Dirty values: Due to the nature of the real-world scenarios from which BIRD's database values were collected, they typically retain their original and frequently "dirty" format. Hence, text-to-SQL parsers must first analyze these values to account for their non-standard format before engaging in reasoning.

2. External Knowledge: "account.type = 'OWNER'" can be inferred by the knowledge evidence: "The condition of the loans require the account type should be the owner."

3. Text-to-Efficient-SQL: BIRD is the first text-to-SQL benchmark designed to encourage semantic parsers to produce SQL queries that are not only correct but also efficient. This emphasis on efficiency is especially valuable in real-world data / business analysis circumstances.

Submission

Please follow the Submission Guideline (below) and contact bird.bench23@gmail.com for test evaluation. Ususally, we will return your results in 10 days!

Subscribe to BIRD Update

Bird is a long-term research project aimed at bridging the gap between semantic parsing models and the success of database applications. To receive the latest updates of the dataset, you can leave your email address.

Email Subscription

Citation

@article{li2024can,
  title={Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls},
  author={Li, Jinyang and Hui, Binyuan and Qu, Ge and Yang, Jiaxi and Li, Binhua and Li, Bowen and Wang, Bailin and Qin, Bowen and Geng, Ruiying and Huo, Nan and others},
  journal={Advances in Neural Information Processing Systems},
  volume={36},
  year={2024}
}
Leaderboard - Execution Accuracy (EX)
Model Code Size Oracle Knowledge Dev (%) Test (%)

Human Performance
Data Engineers + DB Students
✔️ 92.96
Nov 24, 2024 CHASE-SQL + Gemini
Google Cloud
[Pourreza et al. '24]
UNK ✔️ 74.46 74.79
Nov 11, 2024 DSAIR + GPT-4o
AT&T - CDO
UNK ✔️ 74.32 74.12
Oct 27, 2024 ExSL + granite-34b-code
IBM Research AI
34B ✔️ 72.43 73.17
Aug 21, 2024 OpenSearch-SQL, v2 + GPT-4o
Alibaba Cloud
UNK ✔️ 69.30 72.28
Jul 22, 2024 Distillery + GPT-4o
Distyl AI Research
[Maamari et al. '24]
UNK ✔️ 67.21 71.83
May 21, 2024 CHESSIR +CG +UT
Stanford
[Talaei et al.'24]
[link] UNK ✔️ 68.31 71.10
Aug 28, 2024 Insights AI
Uber Freight
UNK ✔️ 72.16 70.26
Aug 30, 2024 PURPLE + RED + GPT-4o
Fudan University + Transwarp Technology
UNK ✔️ 68.12 70.21
Nov 10, 2024 PB-SQL, GPT-4o
Seoul National University
UNK ✔️ 68.64 69.26
Jul 14, 2024 RECAP + Gemini
Google Cloud
UNK ✔️ 66.95 69.03
Jul 2, 2024 ByteBrain
ByteDance Infra Lab
33B ✔️ 65.45 68.87
May 14, 2024 ExSL + granite-20b-code
IBM Research AI
20B ✔️ 65.38 67.86
Nov 11, 2024 DSAIR + GPT-4o
AT&T - CDO
UNK 65.91 67.41
May 21, 2024 CHESSIR +SS +CG
Stanford
[Talaei et al.'24]
[link] UNK ✔️ 65.00 66.69
Sep 23, 2024 E-SQL + GPT-4o
Bilkent University
[Caferoğlu et al.'24]
[link] UNK ✔️ 65.58 66.29
Aug 29, 2024 Arcwise + GPT-4o
Arcwise
UNK ✔️ 67.99 66.21
Nov 18, 2024 RSL-SQL + DeepSeek
VLR-Lab
[Cao et al.'24]
[link] UNK ✔️ 63.56 65.51
Jan 14, 2024 MCS-SQL + GPT-4
Dunamu
UNK ✔️ 63.36 65.45
Aug 20, 2024 SCL-SQL
Xffuture
UNK ✔️ 64.73 65.23
Apr 08, 2024 OpenSearch-SQL,v1 + GPT-4
Alibaba Cloud
UNK ✔️ 61.34 64.95
Jun 7, 2024 SFT CodeS-15B + SQLFixAgent
Soochow University
UNK ✔️ -- 64.62
Aug 30, 2024 PURPLE + GPT-4o
Fudan University + Transwarp Technology
UNK ✔️ 62.97 64.51
Oct 10, 2024 MSL-SQL + DeepSeek-V2.5
Wuhan University of Technology
236B ✔️ 66.82 64.00
Feb 21, 2024 Sense
Anonymous
13B ✔️ 55.48 63.39
Apr 10, 2024 GRA-SQL
Tencent CDP-youpu
UNK ✔️ 62.58 63.22
Jun 1, 2024 SuperSQL
HKUST(GZ)
[Li et al. '24]
[link] UNK ✔️ 58.50 62.66
Mar 27, 2024 {Chat2Query} (GPT-4 + data entity modeling) (PingCAP)
PingCAP
[link] UNK ✔️ 58.15 60.98
Nov 16, 2023 Dubo-SQL, v1
Mercator Technologies
UNK ✔️ 59.71 60.71
Oct 12, 2023 SFT CodeS-15B
Renmin University of China
[Li et al. SIGMOD'24]
[link] 15B ✔️ 58.47 60.37
Feb 27, 2024 DTS-SQL + DeepSeek 7B
University of Alberta
[Pourreza et al. '24]
[link] 7B ✔️ 55.8 60.31
Sep 23, 2024 E-SQL + GPT-4o mini
Bilkent University
[Caferoğlu et al.'24]
[link] UNK ✔️ 61.60 59.81
Nov 21, 2023 MAC-SQL + GPT-4
BUAA & Tencent
[Wang et al. '23]
UNK ✔️ 57.56 59.59
Oct 12, 2023 SFT CodeS-7B
Renmin University of China
[Li et al. SIGMOD'24]
[link] 7B ✔️ 57.17 59.25
May 27, 2024 TA-SQL + GPT-4
HKU
[Qu et al. ACL Findings'24]
[link] UNK ✔️ 56.19 59.14
Nov 09, 2023 DAIL-SQL + GPT-4
Alibaba Group
[Gao and Wang et al. VLDB'24]
[link] UNK ✔️ 54.76 57.41
May 24, 2024 ExSL + granite-20b-code
IBM Research AI
20B 51.69 57.13
Aug 10, 2024 DeepSeek
Baseline
[link] 236B ✔️ 56.13 56.68
Aug 15, 2023 DIN-SQL + GPT-4
University of Alberta
[Pourreza et al. '23]
[link] UNK ✔️ 50.72 55.90
Aug 08, 2024 Mistral
Baseline
[link] 123B ✔️ 53.52 55.84
Jul 01, 2023 GPT-4
Baseline
[link] UNK ✔️ 46.35 54.89
Nov 8, 2024 Interactive-T2S
Anonymous
UNK 54.56 54.11
Sep 19, 2024 Prem-1B-SQL
Prem AI
[link] 1B ✔️ - 51.54
Jul 16, 2023 Claude-2
Baseline
[link] UNK ✔️ 42.70 49.02
Nov 23, 2023 Open-SQL
Anonymous
7B ✔️ 37.68 47.74
Mar 17, 2023 ChatGPT + CoT
HKU & DAMO
[Li et al. NeurIPS'23]
[link] UNK ✔️ 36.64 40.08
Mar 17, 2023 ChatGPT
Baseline
UNK ✔️ 37.22 39.30
Feb 17, 2023 Codex
Baseline
175B ✔️ 34.35 36.47
Jul 16, 2023 Palm-2
Baseline
[link] UNK ✔️ 27.38 33.04
Mar 17, 2023 ChatGPT + CoT
HKU & DAMO
[Li et al. NeurIPS'23]
[link] UNK 25.88 28.95
Mar 17, 2023 ChatGPT
Baseline
UNK 24.05 26.77
Feb 17, 2023 Codex
Baseline
175B 25.42 24.86
Feb 5, 2023 T5-3B
Baseline
3B ✔️ 23.34 24.05
Feb 3, 2023 T5-Large
Baseline
770M ✔️ 19.75 20.94
Feb 3, 2023 T5-Base
Baseline
220M ✔️ 11.54 12.89
Feb 5, 2023 T5-3B
Baseline
3B 10.37 11.17
Feb 3, 2023 T5-Large
Baseline
770M 9.71 10.38
Feb 3, 2023 T5-Base
Baseline
220M 6.32 7.06
Leaderboard - Reward-based Valid Efficiency Score (R-VES)
Model Code Size Oracle Knowledge Test

Human Performance
Data Engineers + DB Students
✔️ 83.26
Oct 27, 2024 ExSL + granite-34b-code
IBM Research AI
34B ✔️ 71.37
Nov 24, 2024 CHASE-SQL + Gemini
Google Cloud
[Pourreza et al. '24]
UNK ✔️ 70.57
Nov 11, 2024 DSAIR + GPT-4o
AT&T - CDO
UNK ✔️ 70.13
Aug 21, 2024 OpenSearch-SQL, v2 + GPT-4o
Alibaba Cloud
UNK ✔️ 69.36
Jul 22, 2024 Distillery + GPT-4o
Distyl AI Research
UNK ✔️ 67.41
May 21, 2024 CHESSIR +CG +UT
Stanford
[Talaei et al.'24]
[link] UNK ✔️ 66.53
Jul 5, 2024 Insights AI
Uber Freight
UNK ✔️ 66.39
May 14, 2024 ExSL + granite-20b-code
IBM Research AI
20B ✔️ 66.25
Jul 14, 2024 RECAP + Gemini
Google Cloud
UNK ✔️ 65.70
Aug 30, 2024 PURPLE + RED + GPT-4o
Fudan University + Transwarp Technology
UNK ✔️ 65.62
Aug 29, 2024 Arcwise + GPT-4o
Arcwise
UNK ✔️ 63.68
Nov 11, 2024 DSAIR + GPT-4o
AT&T - CDO
UNK 63.34
May 21, 2024 CHESSIR +SS +CG
Stanford
[Talaei et al.'24]
[link] UNK ✔️ 62.77
Sep 23, 2024 E-SQL + GPT-4o
Bilkent University
[Caferoğlu et al.'24]
[link] UNK ✔️ 62.43
Jun 7, 2024 SFT CodeS-15B + SQLFixAgent
Soochow University
UNK ✔️ 61.37
Aug 20, 2024 SCL-SQL
Xffuture
UNK ✔️ 61.28
Jan 14, 2024 MCS-SQL + GPT-4
Dunamu
UNK ✔️ 61.23
Feb 27, 2024 PB-SQL
Seoul National University
UNK ✔️ 60.36
Aug 30, 2024 PURPLE + GPT-4o
Fudan University + Transwarp Technology
UNK ✔️ 60.35
Oct 10, 2024 MSL-SQL + DeepSeek-V2.5
Wuhan University of Technology
236B ✔️ 59.42
Nov 21, 2023 MAC-SQL + GPT-4
BUAA & Tencent
[Wang et al. '23]
UNK ✔️ 57.60
Oct 12, 2023 SFT CodeS-15B
Renmin University of China
[Li et al. SIGMOD'24]
[link] 15B ✔️ 56.73
Apr 10, 2024 GRA-SQL
Tencent CDP-youpu
UNK ✔️ 56.63
Nov 16, 2023 Dubo-SQL, v1
Mercator Technologies
UNK ✔️ 56.63
May 24, 2024 ExSL + granite-20b-code
IBM Research AI
20B 56.11
Mar 27, 2024 {Chat2Query} (GPT-4 + data entity modeling) (PingCAP)
PingCAP
[link] UNK ✔️ 56.06
Oct 12, 2023 SFT CodeS-7B
Renmin University of China
[Li et al. SIGMOD'24]
[link] 7B ✔️ 55.69
Sep 23, 2024 E-SQL + GPT-4o mini
Bilkent University
[Caferoğlu et al.'24]
[link] UNK ✔️ 55.64
Nov 09, 2023 DAIL-SQL + GPT-4
Alibaba Group
[Gao and Wang et al. VLDB'24]
[link] UNK ✔️ 54.02
Aug 10, 2024 DeepSeek
Baseline
[link] 236B ✔️ 53.25
Aug 15, 2023 DIN-SQL + GPT-4
University of Alberta
[Pourreza et al. '23]
[link] UNK ✔️ 53.07
Aug 08, 2024 Mistral
Baseline
[link] 128B ✔️ 52.59
Jul 01, 2023 GPT-4
Baseline
[link] UNK ✔️ 51.75

BIRD Mini-Dev

A Lite version of developtment dataset, which is designed to facilitate efficient and cost-effective development cycles, especially for testing and refining SQL query generation models. For more details, please visit the GitHub repository. For updating Leaderboard, please make sure your paper or resource is public available and submit a PR.

Mini Dev - Execution Accuracy (EX)
Model Code Size Oracle Knowledge SQLite MySQL PostgreSQL
Jun 31, 2024 TA + GPT-4
HKU
[Qu et al. ACL Findings'24]
[link] UNK ✔️ 58.00 49.20 50.80
Jun 31, 2024 GPT-4 UNK ✔️ 47.80 40.80 35.80
Jun 31, 2024 GPT-4-32k UNK ✔️ 47.00 43.20 35.00
Jun 31, 2024 GPT-4-turbo UNK ✔️ 45.80 41.00 36.00
Jun 31, 2024 Llama3-70b-instruct 70B ✔️ 40.80 37.00 29.40
Jun 31, 2024 GPT-35-turbo UNK ✔️ 38.00 36.00 27.40
Jun 31, 2024 GPT-35-turbo-instruct UNK ✔️ 33.60 31.20 26.60
Jun 31, 2024 Phi-3-medium-128k-instruct 13B ✔️ 30.60 25.00 21.60
Jun 31, 2024 Llama3-8b-instruct 8B ✔️ 24.40 24.60 18.40
Jun 31, 2024 Mixtral-8x7b 46.7B ✔️ 21.60 13.60 12.40