scispace - formally typeset
P

Phillip Rust

Publications -  8
Citations -  88

Phillip Rust is an academic researcher. The author has contributed to research in topics: Computer science & Biobank. The author has an hindex of 3, co-authored 4 publications receiving 40 citations.

Papers
More filters
Proceedings ArticleDOI

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models

TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Proceedings Article

Language Modelling with Pixels

TL;DR: PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.
Proceedings ArticleDOI

Challenges and Strategies in Cross-Cultural NLP

TL;DR: The authors proposed a principled framework for cross-lingual and multilingual NLP, and surveyed existing and potential strategies to accommodate linguistic diversity and serve speakers of many different languages in NLP systems.
Posted Content

How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.

TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Proceedings ArticleDOI

PuzzLing Machines: A Challenge on Learning From Small Data

TL;DR: PuzzLing Machines as mentioned in this paper is a dataset of Rosetta Stone puzzles from Linguistic Olympiads for high school students to solve, which contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.