Sedrick Keh

I'm currently a research engineer at Toyota Research Institute working on pre-training, post-training, and multimodal models. I completed my MS at the Machine Learning Department at Carnegie Mellon University (CMU), where I did research on multimodal models and grounding with Daniel Fried, as well as on AI for social good Fei Fang. I previously received my Bachelor's degree in Mathematics and Data Science from the Hong Kong University of Science and Technology (HKUST).

Email / Google Scholar / Twitter / GitHub / CV

News

Apr. 2025: Attending ICLR in Singapore! 🇸🇬 Two papers accepted: Should VLMs be Pre-trained with Image Data? and Over-trained Language Models Scale Reliably on Downstream Tasks.
Jan. 2025: Announcing OpenThoughts!
Oct. 2024: Attending COLM to present Linearizing Large Language Models.
June 2024: Happy to be part of the DataComp-LM effort! Check out the DCLM-Baseline dataset and our competition.
Apr. 2024: Check out our release of Mamba-7B!

May 2024: Check out our paper Linearizing Large Language Models.
May 2024: Will be attending ICRA in May (Yokohama, Japan) and NAACL in June (Mexico City, Mexico).
Dec. 2023: Attending EMNLP 2023 in Singapore. 🇸🇬
Sept. 2023: Joined TRI full time as a research engineer!
May 2023: Started my yearlong AI Residency program with TRI!
Feb. 2023: NewsPanda was presented at IAAI 2023. Check out our website!
Jan. 2023: Our tongue twisters paper (PANCETTA) is accepted at EACL 2023!
Dec. 2022: Attending EMNLP 2022 in Abu Dhabi!
Oct. 2022: Two papers accepted to the EMNLP 2022 Figurative Language Processing Workshop!
Oct. 2022: I am attending COLING virtually! My first conference! 🙌
Aug. 2022: Starting the Fall'22 semester! I'll be TA-ing 10-417/617 Intermediate Deep Learning.
Aug. 2022: 🏆 Our team won first place at the 2022 EMNLP FigLang Euphemism Detection Shared Task Challenge!
Aug. 2022: Our PINEAPPLE Personifications paper was accepted to COLING 2022!

Research

These days I'm very interested in pretraining, post-training, and evaluation. I'm specifically interested in multimodal models and applications to robotics and embodied systems. I have previously done work on grounding (NAACL Findings 2024), as well as on controllable text generation and evaluation (COLING 2022, EACL 2023).

Publications

2025

Should VLMs be Pre-trained with Image Data?
Sedrick Keh*, Jean Mercat*, Samir Yitzhak Gadre, Kushal Arora, Igor Vasiljevic, Benjamin Burchfiel, Shuran Song, Russ Tedrake, Thomas Kollar, Ludwig Schmidt, Achal Dave
ICLR 2025
pdf | abstract
Pre-trained LLMs that are further trained with image data perform well on vision-language tasks. While adding images during a second training phase effectively unlocks this capability, it is unclear how much of a gain or loss this two-step pipeline gives over VLMs which integrate images earlier into the training process. To investigate this, we train models spanning various datasets, scales, image-text ratios, and amount of pre-training done before introducing vision tokens. We then fine-tune these models and evaluate their downstream performance on a suite of vision-language and text-only tasks. We find that pre-training with a mixture of image and text data allows models to perform better on vision-language tasks while maintaining strong performance on text-only evaluations. On an average of 6 diverse tasks, we find that for a 1B model, introducing visual tokens 80% of the way through pre-training results in a 2% average improvement over introducing visual tokens to a fully pre-trained model.

2024

DataComp-LM: In search of the next generation of training sets for language models
DCLM team
NeurIPS Datasets and Benchmarks 2024
pdf | abstract
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages
SEACrowd team
EMNLP 2024
pdf | abstract
Southeast Asia (SEA) is a region rich in linguistic diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people. However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and resource equity for the future of AI in SEA.
Linearizing Large Language Models
Jean Mercat*, Igor Vasiljevic*, Sedrick Keh*, Kushal Arora, Achal Dave, Adrien Gaidon, Thomas Kollar
COLM 2024
pdf | abstract
Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget. This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring 5% of the training cost. We find that our linearization technique leads to competitive performance on standard benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.
Language Models Scale Reliably with Over-training and on Downstream Tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt
ICLR 2025
pdf | abstract
Scaling laws are useful guides for developing language models, but there are still gaps between current scaling studies and how language models are ultimately trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime (i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs. Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token run (i.e., 32x over-trained) and a 6.9B parameter, 138B token run -- each from experiments that take 300x less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law. We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments that take 20x less compute. Our experiments are available at https://github.com/mlfoundations/scaling.
A Critical Evaluation of AI Feedback for Aligning Large Language Models
Archit Sharma, Sedrick Keh, Eric Mitchell, Chelsea Finn, Kushal Arora, Thomas Kollar
NeurIPS 2024
pdf | abstract
Reinforcement learning with AI feedback (RLAIF) is a popular paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic (e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF maximally useful in practice.
Where It Really Matters: Few-Shot Environmental Conservation Media Monitoring for Low-Resource Languages
Sameer Jain, Sedrick Scott Keh, Shova Chhetri, Karun Dewan, Pablo Izquierdo, Johanna Prussmann, Pooja Shrestha, César Suárez, Zheyuan Ryan Shi, Lei Li, Fei Fang
AAAI 2024
pdf | abstract
Environmental conservation organizations routinely monitor news content on conservation in protected areas to maintain situational awareness of developments that can have an environmental impact. Existing automated media monitoring systems require large amounts of data labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such tools are most needed in the global south where news of interest is mainly in local low-resource languages, and far fewer experts are available to annotate datasets sustainably. In this paper, we propose NewsSerow, a method to automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods and achieves comparable performance with models fully fine-tuned using thousands of examples. The World Wide Fund for Nature (WWF) has deployed NewsSerow for media monitoring in Nepal, significantly reducing their operational burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow has also been deployed for countries with other languages like Colombia.

2023

Asking More Informative Questions for Grounded Retrieval
Sedrick Keh, Justin T Chiu, Daniel Fried
NAACL Findings 2024
pdf | abstract
When a model is trying to gather information in an interactive setting, it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous studies have been constrained to polar yes/no questions, limiting how much information the model can gain in a single turn. We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf visual question answering (VQA) models often make presupposition errors, which standard information gain question selection methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations.
Doolittle: Benchmarks and Corpora for Academic Writing Formalization
Shizhe Diao, Yongyu Lei, Liangming Pan, Tianqing Fang, Wangchunshu Zhou, Sedrick Keh, Min-Yen Kan, Tong Zhang
EMNLP 2023
pdf | abstract
Improving the quality of academic writing is a meaningful but challenging task. Conventional methods of language refinement focus on narrow, specific linguistic features within isolated sentences, such as grammatical errors and improper word use. We propose a more general task, Academic Writing Formalization (AWF), to improve the overall quality of formal academic writing at the paragraph level. We formulate this language refinement task as a formal text style transfer task which transfers informal-academic text to formal-academic and contribute a large-scale non-parallel dataset, Doolittle, for this purpose. Concurrently, we apply a method named metric-oriented reinforcement learning (MORL) to two large language models (LLM) where we incorporate different levels of automatic feedback into the training process. Our experiments reveal that existing text transfer models and grammatical error correction models address certain aspects of AWF but still have a significant performance gap compared to human performance. Meanwhile, language models fine-tuned with our MORL method exhibit considerably improved performance, rivaling the latest chatbot ChatGPT, but still have a non-negligible gap compared to the ground truth formal-academic texts in Doolittle.
Hashtag-Guided Low-Resource Tweet Classification
Shizhe Diao*, Sedrick Scott Keh*, Liangming Pan, Zhiliang Tian, Yan Song, Tong Zhang
WWW 2023
pdf | abstract
Social media classification tasks (e.g., tweet sentiment analysis, tweet stance detection) are challenging because social media posts are typically short, informal, and ambiguous. Thus, training on tweets is challenging and demands large-scale human-annotated labels, which are time-consuming and costly to obtain. In this paper, we find that providing hashtags to social media tweets can help alleviate this issue because hashtags can enrich short and ambiguous tweets in terms of various information, such as topic, sentiment, and stance. This motivates us to propose a novel Hashtag-guided Tweet Classification model (HashTation), which automatically generates meaningful hashtags for the input tweet to provide useful auxiliary signals for tweet classification. To generate high-quality and insightful hashtags, our hashtag generation model retrieves and encodes the post-level and entity-level information across the whole corpus. Experiments show that HashTation achieves significant improvements on seven low-resource tweet classification tasks, in which only a limited amount of training data is provided, showing that automatically enriching tweets with model-generated hashtags could significantly reduce the demand for large-scale human-labeled data. Further analysis demonstrates that HashTation is able to generate high-quality hashtags that are consistent with the tweets and their labels. The code is available at https://github.com/shizhediao/HashTation.

2022

PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically
Sedrick Scott Keh, Steven Y. Feng*, Varun Gangal*, Malihe Alikhani, Eduard Hovy
EACL 2023
pdf | abstract | github
Tongue twisters are meaningful sentences that are difficult to pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony. In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel, phonetically difficult, fluent, and semantically meaningful tongue twisters.

EUREKA: EUphemism Recognition Enhanced through Knn-based methods and Augmentation
Sedrick Scott Keh*, Rohit Bharadwaj*, Emmy Liu**, Simone Tedeschi**, Varun Gangal, Roberto Navigli
EMNLP 2022 FigLang Workshop
pdf | abstract | github | slides
We introduce EUREKA, an ensemble-based approach to perform automatic euphemism detection. We (1) identify and correct potentially mislabelled rows in the dataset, (2) curate an expanded corpus called EuphAug, (3) leverage model representations of Potentially Euphemistic Terms (PETs), and (4) explore using representations of semantically close sentences to aid in classification. Using these methods, EUREKA was able to achieve state-of-the-art results on the public leaderboard of the Euphemism Detection Shared Task, with a macro-F1 score of 0.881. Our code is available at this https URL.

Exploring Euphemism Detection in Few-Shot and Zero-Shot Settings
Sedrick Scott Keh
EMNLP 2022 FigLang Workshop
pdf | abstract | github | poster
This work builds upon the Euphemism Detection Shared Task proposed in the EMNLP 2022 FigLang Workshop, and extends it to few-shot and zero-shot settings. We demonstrate a few-shot and zero-shot formulation using the dataset from the shared task, and we conduct experiments in these settings using RoBERTa and GPT-3. Our results show that language models are able to classify euphemistic terms relatively well even on new terms unseen during training, indicating that it is able to capture higher-level concepts related to euphemisms.

PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification Data for Learning Enhanced Generation
Sedrick Scott Keh, Kevin Lu, Steven Y. Feng*, Varun Gangal*, Harsh Jhamtani, Malihe Alikhani, Eduard Hovy
COLING 2022
pdf | abstract | github | talk | slides | poster
A personification is a figure of speech that endows inanimate entities with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation. To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative personifications that enhance the overall appeal of a sentence.

NewsPanda: Media Monitoring for Timely Conservation Action
Sedrick Scott Keh*, Zheyuan Ryan Shi*, David J. Patterson, Nirmal Bhagabati, Karun Dewan, Areendran Gopala, Pablo Izquierdo, Debojyoti Mallick, Ambika Sharma, Pooja Shrestha, Fei Fang
IAAI 2023
pdf | abstract | github | website
Non-governmental organizations for environmental conservation have a significant interest in monitoring conservation-related media and getting timely updates about infrastructure construction projects as they may cause massive impact to key conservation areas. Such monitoring, however, is difficult and time-consuming. We introduce NewsPanda, a toolkit which automatically detects and analyzes online articles related to environmental conservation and infrastructure construction. We fine-tune a BERT-based model using active learning methods and noise correction algorithms to identify articles that are relevant to conservation and infrastructure construction. For the identified articles, we perform further analysis, extracting keywords and finding potentially related sources. NewsPanda has been successfully deployed by the World Wide Fund for Nature teams in the UK, India, and Nepal since February 2022. It currently monitors over 80,000 websites and 1,074 conservation sites across India and Nepal, saving more than 30 hours of human efforts weekly. We have now scaled it up to cover 60,000 conservation sites globally.

2021 and older

Semi-supervised Noisy Student Pre-training on EfficientNet Architectures for Plant Pathology Classification
Sedrick Scott Keh
arXiv 2020
pdf | abstract
In recent years, deep learning has vastly improved the identification and diagnosis of various diseases in plants. In this report, we investigate the problem of pathology classification using images of a single leaf. We explore the use of standard benchmark models such as VGG16, ResNet101, and DenseNet 161 to achieve a 0.945 score on the task. Furthermore, we explore the use of the newer EfficientNet model, improving the accuracy to 0.962. Finally, we introduce the state-of-the-art idea of semi-supervised Noisy Student training to the EfficientNet, resulting in significant improvements in both accuracy and convergence rate. The final ensembled Noisy Student model performs very well on the task, achieving a test score of 0.982.

Myers-Briggs Personality Classification and Personality-specific Language Generation Using Pre-trained Language Models
Sedrick Scott Keh, I-Tsun Cheng
arXiv 2019
pdf | abstract
The Myers-Briggs Type Indicator (MBTI) is a popular personality metric that uses four dichotomies as indicators of personality traits. This paper examines the use of pre-trained language models to predict MBTI personality types based on scraped labeled texts. The proposed model reaches an accuracy of 0.47 for correctly predicting all 4 types and 0.86 for correctly predicting at least 2 types. Furthermore, we investigate the possible uses of a fine-tuned BERT model for personality-specific language generation. This is a task essential for both modern psychology and for intelligent empathetic systems.

Site Template