NLP based Data-Centric AI

NAVER AI TECH 2023. 5. 24. 22:49

현재 접할 수 있는 수많은 NLP 관련 데이터셋 정보를 얻을 수 있었다.

국가 기관 주도

기업 주도

학계 주도

해외

Task - Data Table

Task	Data	비고
Hate Speech Detection	HateXplain
Counter Speech Generation	ProsocialDialog
Sarcasm Detection	iSarcasm
Fake News Detection	LIAR
Fact Checking	FEVER
Quality Estimation	QUAK
Automatic Post Editing	SubEdits
Chat Translation	WMT22	https://wmt-chat-task.github.io/
Persona-grounded Dialogue	PersonaChat / BSBT
Persuasive Dialogue	Persuasion for Good
Dialogue Summarization	DialogSum / SAMSum
Knowledge-grounded Dialogue	Wizard Of Wikipedia
Dialogue for Characters	Harry Potter Dialogue (HPD)
Empathetic Dialogue	Empathetic Dialogues (ED)
Question Generation	Question Generation for Question Answering (EMNLP 2017)
Document-level Relation Extraction	DocRED
고전어 데이터셋
케어콜 데이터셋		Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models (Baeetal.,2022)
혐오 발언 탐지 데이터셋		BEEP! Korean Corpus of Online News Comments for Toxic Speech Detection (Moonetal.,2020)
쓰기 평가 데이터셋		딥러닝 기반 언어모델을 이용한 한국어 학습자 쓰기 평가의 자동 점수 구간 분류 -KoBERT와 KoGPT2를 중심으로- (조희련 외 4인 2021)
문법 교정 데이터셋		Towards standardizing Korean Grammatical Error Correction: Datasets and Annotation (Yoonetal.,2022)

분명 필요할 때가 올 거라고 생각한다. 정리해놓고 나중에 필요한 데이터셋이 생겼을 때 돌아오자.

Recent Work in Data-Centric NLP (0)	2023.05.31
모델 성능 향상을 위한 데이터 처리 방법 (0)	2023.05.25
Data-Centric AI (0)	2023.05.23
9주차 Embedding 평가 방법 (0)	2023.05.03
8주차 회고록 (AI 서비스 개발 기초) (0)	2023.04.28

동산 동산