These days I'm very interested in pretraining, alignment, and evaluation.
I'm specifically interested in multimodal models and applications to robotics.
I have done work on grounding (NAACL Findings 2024), as well as on controllable text generation and evaluation
(COLING 2022, EACL 2023).
We introduce DataComp for Language Models (DCLM), a testbed for
controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized
corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a
broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies
such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for
DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training
set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot
accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language
models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute.
Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on
an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our
results highlight the importance of dataset design for training language models and offer a starting point for further
research on data curation.
Southeast Asia (SEA) is a region rich in linguistic
diversity and cultural variety, with over 1,300 indigenous languages and a population of 671 million people.
However, prevailing AI models suffer from a significant lack of representation of texts, images, and audio datasets
from SEA, compromising the quality of AI models for SEA languages. Evaluating models for SEA languages is challenging
due to the scarcity of high-quality datasets, compounded by the dominance of English training data, raising concerns
about potential cultural misrepresentation. To address these challenges, we introduce SEACrowd, a collaborative
initiative that consolidates a comprehensive resource hub that fills the resource gap by providing standardized
corpora in nearly 1,000 SEA languages across three modalities. Through our SEACrowd benchmarks, we assess the quality
of AI models on 36 indigenous languages across 13 tasks, offering valuable insights into the current AI landscape in
SEA. Furthermore, we propose strategies to facilitate greater AI advancements, maximizing potential utility and
resource equity for the future of AI in SEA.
Linear transformers have emerged as a subquadratic-time
alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state
that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms
compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these
shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models
requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by
the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training
linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to
uptrain existing large pre-trained transformers into Recurrent Neural Networks (RNNs) with a modest compute budget.
This allows us to leverage the strong pre-training data and performance of existing transformer LLMs, while requiring
5% of the training cost. We find that our linearization technique leads to competitive performance on standard
benchmarks, but we identify persistent in-context learning and long-context modeling shortfalls for even the largest
linear models. Our code and models can be found at https://github.com/TRI-ML/linear_open_lm.
Language Models Scale Reliably with Over-training and on Downstream Tasks
Samir Yitzhak Gadre, Georgios Smyrnis, Vaishaal Shankar, Suchin Gururangan, Mitchell Wortsman, Rulin Shao, Jean Mercat, Alex Fang, Jeffrey Li, Sedrick Keh, Rui Xin, Marianna Nezhurina, Igor Vasiljevic, Jenia Jitsev, Alexandros G Dimakis, Gabriel Ilharco, Shuran Song, Thomas Kollar, Yair Carmon, Achal Dave, Reinhard Heckel, Niklas Muennighoff, Ludwig Schmidt pdf |
abstract
Scaling laws are useful guides for developing
language models, but there are still gaps between current scaling studies and how language models are ultimately
trained and evaluated. For instance, scaling is usually studied in the compute-optimal training regime
(i.e., "Chinchilla optimal" regime); however, in practice, models are often over-trained to reduce inference costs.
Moreover, scaling laws mostly predict loss on next-token prediction, but ultimately models are compared based on
downstream task performance. In this paper, we address both shortcomings. To do so, we create a testbed of 104 models
with 0.011B to 6.9B parameters trained with various numbers of tokens on three data distributions. First, we investigate
scaling in the over-trained regime. We fit scaling laws that extrapolate in both the number of model parameters and the
ratio of training tokens to parameters. This enables us to predict the validation loss of a 1.4B parameter, 900B token
run (i.e., 32x over-trained) and a 6.9B parameter, 138B token run -- each from experiments that take 300x
less compute. Second, we relate the perplexity of a language model to its downstream task performance via a power law.
We use this law to predict top-1 error averaged over downstream tasks for the two aforementioned models using experiments
that take 20x less compute. Our experiments are available at https://github.com/mlfoundations/scaling.
Reinforcement learning with AI feedback (RLAIF) is a popular
paradigm for improving the instruction-following abilities of powerful pre-trained language models. RLAIF first performs
supervised fine-tuning (SFT) using demonstrations from a teacher model and then further fine-tunes the model with
reinforcement learning (RL), using feedback from a critic model. While recent popular open-source models have
demonstrated substantial improvements in performance from the RL step, in this paper we question whether the complexity
of this RL step is truly warranted for AI feedback. We show that the improvements of the RL step are virtually entirely
due to the widespread practice of using a weaker teacher model (e.g. GPT-3.5) for SFT data collection than the critic
(e.g., GPT-4) used for AI feedback generation. Specifically, we show that simple supervised fine-tuning with GPT-4 as
the teacher outperforms existing RLAIF pipelines. More generally, we find that the gains from RLAIF vary substantially
across base model families, test-time evaluation protocols, and critic models. Finally, we provide a mechanistic
explanation for when SFT may outperform the full two-step RLAIF pipeline as well as suggestions for making RLAIF
maximally useful in practice.
Environmental conservation organizations
routinely monitor news content on conservation in protected areas to maintain situational awareness of developments
that can have an environmental impact. Existing automated media monitoring systems require large amounts of data
labeled by domain experts, which is only feasible at scale for high-resource languages like English. However, such
tools are most needed in the global south where news of interest is mainly in local low-resource languages, and far
fewer experts are available to annotate datasets sustainably. In this paper, we propose NewsSerow, a method to
automatically recognize environmental conservation content in low-resource languages. NewsSerow is a pipeline of
summarization, in-context few-shot classification, and self-reflection using large language models (LLMs). Using at
most 10 demonstration example news articles in Nepali, NewsSerow significantly outperforms other few-shot methods
and achieves comparable performance with models fully fine-tuned using thousands of examples. The World Wide Fund
for Nature (WWF) has deployed NewsSerow for media monitoring in Nepal, significantly reducing their operational
burden, and ensuring that AI tools for conservation actually reach the communities that need them the most. NewsSerow
has also been deployed for countries with other languages like Colombia.
When a model is trying to gather information in an interactive setting,
it benefits from asking informative questions. However, in the case of a grounded multi-turn image identification task, previous
studies have been constrained to polar yes/no questions, limiting how much information the model can gain in a single turn.
We present an approach that formulates more informative, open-ended questions. In doing so, we discover that off-the-shelf
visual question answering (VQA) models often make presupposition errors, which standard information gain question selection
methods fail to account for. To address this issue, we propose a method that can incorporate presupposition handling into both
question selection and belief updates. Specifically, we use a two-stage process, where the model first filters out images which
are irrelevant to a given question, then updates its beliefs about which image the user intends. Through self-play and human
evaluations, we show that our method is successful in asking informative open-ended questions, increasing accuracy over the past
state-of-the-art by 14%, while resulting in 48% more efficient games in human evaluations.
Improving the quality of academic writing is a meaningful but
challenging task. Conventional methods of language refinement focus on narrow, specific linguistic features within isolated
sentences, such as grammatical errors and improper word use. We propose a more general task, Academic Writing Formalization (AWF),
to improve the overall quality of formal academic writing at the paragraph level. We formulate this language refinement task as a
formal text style transfer task which transfers informal-academic text to formal-academic and contribute a large-scale non-parallel
dataset, Doolittle, for this purpose. Concurrently, we apply a method named metric-oriented reinforcement learning (MORL) to two
large language models (LLM) where we incorporate different levels of automatic feedback into the training process. Our experiments
reveal that existing text transfer models and grammatical error correction models address certain aspects of AWF but still have a
significant performance gap compared to human performance. Meanwhile, language models fine-tuned with our MORL method exhibit
considerably improved performance, rivaling the latest chatbot ChatGPT, but still have a non-negligible gap compared to the ground
truth formal-academic texts in Doolittle.
Social media classification tasks (e.g., tweet sentiment analysis,
tweet stance detection) are challenging because social media posts are typically short, informal, and ambiguous. Thus, training
on tweets is challenging and demands large-scale human-annotated labels, which are time-consuming and costly to obtain. In this paper,
we find that providing hashtags to social media tweets can help alleviate this issue because hashtags can enrich short and ambiguous
tweets in terms of various information, such as topic, sentiment, and stance. This motivates us to propose a novel Hashtag-guided
Tweet Classification model (HashTation), which automatically generates meaningful hashtags for the input tweet to provide useful
auxiliary signals for tweet classification. To generate high-quality and insightful hashtags, our hashtag generation model retrieves
and encodes the post-level and entity-level information across the whole corpus. Experiments show that HashTation achieves significant
improvements on seven low-resource tweet classification tasks, in which only a limited amount of training data is provided, showing that
automatically enriching tweets with model-generated hashtags could significantly reduce the demand for large-scale human-labeled data.
Further analysis demonstrates that HashTation is able to generate high-quality hashtags that are consistent with the tweets and their
labels. The code is available at https://github.com/shizhediao/HashTation.
Tongue twisters are meaningful sentences that are difficult to
pronounce. The process of automatically generating tongue twisters is challenging since the generated utterance must satisfy
two conditions at once: phonetic difficulty and semantic meaning. Furthermore, phonetic difficulty is itself hard to characterize
and is expressed in natural tongue twisters through a heterogeneous mix of phenomena such as alliteration and homophony.
In this paper, we propose PANCETTA: Phoneme Aware Neural Completion to Elicit Tongue Twisters Automatically. We leverage
phoneme representations to capture the notion of phonetic difficulty, and we train language models to generate original tongue
twisters on two proposed task settings. To do this, we curate a dataset called PANCETTA, consisting of existing English tongue
twisters. Through automatic and human evaluation, as well as qualitative analysis, we show that PANCETTA generates novel,
phonetically difficult, fluent, and semantically meaningful tongue twisters.
We introduce EUREKA, an ensemble-based approach to perform automatic
euphemism detection. We (1) identify and correct potentially mislabelled rows in the dataset, (2) curate an expanded corpus
called EuphAug, (3) leverage model representations of Potentially Euphemistic Terms (PETs), and (4) explore using representations
of semantically close sentences to aid in classification. Using these methods, EUREKA was able to achieve state-of-the-art results
on the public leaderboard of the Euphemism Detection Shared Task, with a macro-F1 score of 0.881. Our code is available at
this https URL.
This work builds upon the Euphemism Detection Shared Task proposed
in the EMNLP 2022 FigLang Workshop, and extends it to few-shot and zero-shot settings. We demonstrate a few-shot and zero-shot
formulation using the dataset from the shared task, and we conduct experiments in these settings using RoBERTa and GPT-3. Our
results show that language models are able to classify euphemistic terms relatively well even on new terms unseen during training,
indicating that it is able to capture higher-level concepts related to euphemisms.
A personification is a figure of speech that endows inanimate entities
with properties and actions typically seen as requiring animacy. In this paper, we explore the task of personification generation.
To this end, we propose PINEAPPLE: Personifying INanimate Entities by Acquiring Parallel Personification data for Learning Enhanced
generation. We curate a corpus of personifications called PersonifCorp, together with automatically generated de-personified
literalizations of these personifications. We demonstrate the usefulness of this parallel corpus by training a seq2seq model to
personify a given literal input. Both automatic and human evaluations show that fine-tuning with PersonifCorp leads to significant
gains in personification-related qualities such as animacy and interestingness. A detailed qualitative analysis also highlights key
strengths and imperfections of PINEAPPLE over baselines, demonstrating a strong ability to generate diverse and creative
personifications that enhance the overall appeal of a sentence.
NewsPanda: Media Monitoring for Timely Conservation Action Sedrick Scott Keh*, Zheyuan Ryan Shi*, David J. Patterson, Nirmal Bhagabati, Karun Dewan, Areendran Gopala, Pablo Izquierdo, Debojyoti Mallick, Ambika Sharma, Pooja Shrestha, Fei Fang The 35th Annual Conference on Innovative Applications of Artificial Intelligence (IAAI 2023) pdf |
abstract |
github |
website
Non-governmental organizations for environmental conservation have a
significant interest in monitoring conservation-related media and getting timely updates about infrastructure construction projects
as they may cause massive impact to key conservation areas. Such monitoring, however, is difficult and time-consuming. We introduce
NewsPanda, a toolkit which automatically detects and analyzes online articles related to environmental conservation and
infrastructure construction. We fine-tune a BERT-based model using active learning methods and noise correction algorithms to
identify articles that are relevant to conservation and infrastructure construction. For the identified articles, we perform further
analysis, extracting keywords and finding potentially related sources. NewsPanda has been successfully deployed by the World Wide
Fund for Nature teams in the UK, India, and Nepal since February 2022. It currently monitors over 80,000 websites and 1,074 conservation
sites across India and Nepal, saving more than 30 hours of human efforts weekly. We have now scaled it up to cover 60,000 conservation
sites globally.
In recent years, deep learning has vastly improved the identification and
diagnosis of various diseases in plants. In this report, we investigate the problem of pathology classification using images of a single
leaf. We explore the use of standard benchmark models such as VGG16, ResNet101, and DenseNet 161 to achieve a 0.945 score on the task.
Furthermore, we explore the use of the newer EfficientNet model, improving the accuracy to 0.962. Finally, we introduce the state-of-the-art
idea of semi-supervised Noisy Student training to the EfficientNet, resulting in significant improvements in both accuracy and convergence
rate. The final ensembled Noisy Student model performs very well on the task, achieving a test score of 0.982.
The Myers-Briggs Type Indicator (MBTI) is a popular personality metric that uses
four dichotomies as indicators of personality traits. This paper examines the use of pre-trained language models to predict MBTI personality
types based on scraped labeled texts. The proposed model reaches an accuracy of 0.47 for correctly predicting all 4 types and 0.86 for correctly
predicting at least 2 types. Furthermore, we investigate the possible uses of a fine-tuned BERT model for personality-specific language generation.
This is a task essential for both modern psychology and for intelligent empathetic systems.