P
Phillip Rust
Publications - 8
Citations - 88
Phillip Rust is an academic researcher. The author has contributed to research in topics: Computer science & Biobank. The author has an hindex of 3, co-authored 4 publications receiving 40 citations.
Papers
More filters
Proceedings ArticleDOI
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
TL;DR: The authors provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and find that while the pretraining data size is an important factor in the downstream performance, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Proceedings Article
Language Modelling with Pixels
Phillip Rust,Jonas F. Lotz,Emanuele Bugliarello,Elizabeth Salesky,Miryam de Lhoneux,Desmond Elliott +5 more
TL;DR: PIXEL is a pretrained language model that renders text as images, making it possible to transfer representations across languages based on orthographic similarity or the co-activation of pixels, and is more robust to noisy text inputs than BERT, further confirming the benefits of modelling language with pixels.
Proceedings ArticleDOI
Challenges and Strategies in Cross-Cultural NLP
Daniel Hershcovich,Stella Frank,Heather C. Lent,Miryam de Lhoneux,Mostafa Gharib Mostafa Abdou,Stephanie Brandl,Emanuele Bugliarello,Laura Cabello Piqueras,Ilias Chalkidis,Ruixiang Cui,Constanza Fierro,Katerina Margatina,Phillip Rust,Anders Søgaard +13 more
TL;DR: The authors proposed a principled framework for cross-lingual and multilingual NLP, and surveyed existing and potential strategies to accommodate linguistic diversity and serve speakers of many different languages in NLP systems.
Posted Content
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models.
TL;DR: This paper provided a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolinguistic task performance, and found that while the pretraining data size is an important factor in the downstream performance of the multilingual model, a designated mon-olingual tokenizer plays an equally important role in downstream performance.
Proceedings ArticleDOI
PuzzLing Machines: A Challenge on Learning From Small Data
TL;DR: PuzzLing Machines as mentioned in this paper is a dataset of Rosetta Stone puzzles from Linguistic Olympiads for high school students to solve, which contains around 100 puzzles covering a wide range of linguistic phenomena from 81 languages.