Origin paper
NusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource Languages
InstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction Tuning
NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages
Cendol: Open Instruction-tuned Generative Large Language Models for Indonesian Languages
Instruct-Align: Teaching Novel Languages with to LLMs through Alignment-based Cross-Lingual Instruction
LLMs Are Few-Shot In-Context Low-Resource Language Learners
Large Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLU
One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia
NusaCrowd: Open Source Initiative for Indonesian NLP Resources
BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer
Multilingual Large Language Models Are Not (Yet) Code-Switchers
Many-to-Many Multilingual Translation Model for Languages of Indonesia
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural
Bactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank Adaptation
IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation
Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback
ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic
MEGA: Multilingual Evaluation of Generative AI
Cross-lingual Few-Shot Learning on Unseen Languages
IndoCulture: Exploring Geographically Influenced Cultural Commonsense Reasoning Across Eleven Indonesian Provinces
Prompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages
Transfer Learning-Based Neural Machine Translation for Low-Resource Languages
CMMLU: Measuring massive multitask language understanding in Chinese
IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection
Location-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local Languages
Cheetah: Natural Language Generation for 517 African Languages
Not All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought Prompting
Improving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph Vector
Findings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023
IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding
COPAL-ID: Indonesian Language Reasoning with Local Culture and Nuances
Towards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine Translation
Extending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and Results
Can the capability of Large Language Models be described by human ability? A Meta Study
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models
Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Scaling Up Multilingual Evaluation
Zero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment Lexicon
Large-scale Lifelong Learning of In-context Instructions and How to Tackle It
HelpCenter
20192025
C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation ModelsIndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language UnderstandingQuality at a Glance: An Audit of Web-Crawled Multilingual DatasetsMEGA: Multilingual Evaluation of Generative AICMMLU: Measuring massive multitask language understanding in ChineseNot All Languages Are Created Equal in LLMs: Improving Multilingual Capability by Cross-Lingual-Thought PromptingOkapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human FeedbackOne Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in IndonesiaIndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language GenerationNusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local LanguagesBactrian-X : A Multilingual Replicable Instruction-Following Model with Low-Rank AdaptationBUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual TransferMultilingual Large Language Models Are Not (Yet) Code-SwitchersNusaCrowd: Open Source Initiative for Indonesian NLP ResourcesPrompting Multilingual Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian LanguagesImproving Bi-LSTM Performance for Indonesian Sentiment Analysis Using Paragraph VectorCross-lingual Few-Shot Learning on Unseen LanguagesLarge Language Models Only Pass Primary School Exams in Indonesia: A Comprehensive Test on IndoMMLULLMs Are Few-Shot In-Context Low-Resource Language LearnersArabicMMLU: Assessing Massive Multitask Language Understanding in ArabicTowards Computational Linguistics in Minangkabau Language: Studies on Sentiment Analysis and Machine TranslationCOPAL-ID: Indonesian Language Reasoning with Local Culture and NuancesIndoCulture: Exploring Geographically Influenced Cultural Commonsense Reasoning Across Eleven Indonesian ProvincesCendol: Open Instruction-tuned Generative Large Language Models for Indonesian LanguagesLarge-scale Lifelong Learning of In-context Instructions and How to Tackle ItInstructAlign: High-and-Low Resource Language Alignment via Continual Crosslingual Instruction TuningInstruct-Align: Teaching Novel Languages with to LLMs through Alignment-based Cross-Lingual InstructionExtending the Pre-Training of BLOOM for Improved Support of Traditional Chinese: Models, Methods and ResultsSEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian LanguagesMany-to-Many Multilingual Translation Model for Languages of IndonesiaNusaWrites: Constructing High-Quality Corpora for Underrepresented and Extremely Low-Resource LanguagesZero-shot Sentiment Analysis in Low-Resource Languages Using a Multilingual Sentiment LexiconCheetah: Natural Language Generation for 517 African LanguagesTransfer Learning-Based Neural Machine Translation for Low-Resource LanguagesLocation-based Twitter Filtering for the Creation of Low-Resource Language Datasets in Indonesian Local LanguagesFindings of the 1st Shared Task on Multi-lingual Multi-task Information Retrieval at MRL 2023NusaBERT: Teaching IndoBERT to be Multilingual and MulticulturalScaling Up Multilingual EvaluationIdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm DetectionBilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal ContaminationCan the capability of Large Language Models be described by human ability? A Meta StudyHuang, 2023Wilie, 2020Caswell, 2021Ahuja, 2023Li, 2023Huang, 2023Lai, 2023Aji, 2022Cahyawijaya, 2021Winata, 2022Li, 2023Asai, 2023Zhang, 2023Cahyawijaya, 2022Yong, 2023Purwarianti, 2019Winata, 2022Koto, 2023Cahyawijaya, 2024Koto, 2024Koto, 2020Wibowo, 2023Koto, 2024Cahyawijaya, 2024Mok, 2023Cahyawijaya, 2023Cahyawijaya, 2023Ennen, 2023Lovenia, 2024Wongso, 2023Cahyawijaya, 2023Koto, 2024Adebara, 2024Dong, 2023Amien, 2022Tinner, 2023Wongso, 2024Ahuja, 2022Suhartono, 2024Sánchez-Salido, 2024Zan, 2025Huang, 2023Wilie, 2020Caswell, 2021Ahuja, 2023Li, 2023Huang, 2023Lai, 2023Aji, 2022Cahyawijaya, 2021Winata, 2022Li, 2023Asai, 2023Zhang, 2023Cahyawijaya, 2022Yong, 2023Purwarianti, 2019Winata, 2022Koto, 2023Cahyawijaya, 2024Koto, 2024Koto, 2020Wibowo, 2023Koto, 2024Cahyawijaya, 2024Mok, 2023Cahyawijaya, 2023Cahyawijaya, 2023Ennen, 2023Lovenia, 2024Wongso, 2023Cahyawijaya, 2023Koto, 2024Adebara, 2024Dong, 2023Amien, 2022Tinner, 2023Wongso, 2024Ahuja, 2022Suhartono, 2024Sánchez-Salido, 2024Zan, 2025
Log in to saveSave
Democratizing access to natural language processing (NLP) technology is crucial, especially for underrepresented and extremely low-resource languages. Previous research has focused on developing labeled and unlabeled corpora for these languages through online scraping and document translation. While these methods have proven effective and cost-efficient, we have identified limitations in the resulting corpora, including a lack of lexical diversity and cultural relevance to local communities. To address this gap, we conduct a case study on Indonesian local languages. We compare the effectiveness of online scraping, human translation, and paragraph writing by native speakers in constructing datasets. Our findings demonstrate that datasets generated through paragraph writing by native speakers exhibit superior quality in terms of lexical diversity and cultural content. In addition, we present the \datasetname{} benchmark, encompassing 12 underrepresented and extremely low-resource languages spoken by millions of individuals in Indonesia. Our empirical experiment results using existing multilingual large language models conclude the need to extend these models to more underrepresented languages. We release the NusaWrites dataset at https://github.com/IndoNLP/nusa-writes.