Skip to yearly menu bar Skip to main content


Poster
in
Workshop: Machine Learning for Drug Discovery (MLDD)

Evaluating Prompt Tuning for Conditional Protein Sequence Generation

Andrea Nathansen · Kevin Klein · Bernhard Renard · Melania Nowicka · Jakub Bartoszewicz


Abstract:

Text generation models originally developed for natural language processing have proven to be successful in generating protein sequences. These models are often finetuned for improved performance on more specific tasks, such as generation of proteins from families unseen in training. Considering the high computational cost of finetuning separate models for each downstream task, prompt tuning has been proposed as an alternative. However, no openly available implementation of this approach compatible with protein language models has been previously published. Thus, we adapt an open-source codebase designed for NLP models to build a pipeline for prompt tuning on protein sequence data, supporting the protein language models ProtGPT2 and RITA. We evaluate our implementation by learning prompts for conditional sampling of sequences belonging to a specific protein family. This results in improved performance compared to the base model. However, in the presented use case, we observe discrepancies between text-based evaluation and predicted biological properties of the generated sequences, identifying open problems for principled assessment of protein sequence generation quality.

Chat is not available.