CellCLIP – Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning

Paul G. Allen School of Computer Science & Engineering
University of Washington
NeurIPS 2025

*Indicates Equal Contribution

Abstract

High-content screening (HCS) assays based on high-throughput microscopy techniques such as Cell Painting have enabled the interrogation of cells’ morphological responses to perturbations at an unprecedented scale. The collection of such data promises to facilitate a better understanding of the relationships between different perturbations and their effects on cellular state. Towards achieving this goal, recent advances in cross-modal contrastive learning could, in theory, be leveraged to learn a unified latent space that aligns perturbations with their corresponding morphological effects. However, the application of such methods to HCS data is not straightforward due to substantial differences in the semantics of Cell Painting images compared to natural images, and the difficulty of representing different classes of perturbations (e.g. small molecule vs CRISPR gene knockout) in a single latent space. In response to these challenges, here we introduce CellCLIP, a cross-modal contrastive learning framework for HCS data. CellCLIP leverages pre-trained image encoders coupled with a novel channel encoding scheme to better capture relationships between different microscopy channels in image embeddings, along with natural language encoders for representing perturbations. Our framework outperforms current open-source models, demonstrating the best performance in both cross-modal retrieval and biologically meaningful downstream tasks while also achieving significant reductions in computation time.

Method

We introduce CellCLIP, a framework for contrastive learning (CL) on Cell Painting data. Unlike natural image CL methods, our approach accounts for the unique structure of multi-channel cellular images and the many-to-one relationship between perturbations and image sets. CellCLIP integrates three key components:

CrossChannelFormer

To encode images, we adapt pretrained natural image models (e.g., DINOv2) by treating each Cell Painting channel as a grayscale input. We then aggregate images from the same perturbation using gated attention pooling, producing a per-perturbation profile. Finally, our CrossChannelFormer introduces lightweight cross-channel reasoning by combining pooled profiles with channel-specific embeddings and a global CLS token. This design enables efficient modeling of stain-specific structures while requiring only C+1 tokens per perturbation.

For a perturbation \( i \) , CrossChannelFormer takes as input images of all cells receiving perturbation \( i \) and pools them into a single embedding \( p_i \) that takes into account the relationship between information contained in the different Cell Painting channels.

Perturbation Encoding

Instead of building modality-specific encoders, we represent perturbations as natural language prompts. These descriptions capture both cell type and perturbation details (e.g., compounds or CRISPR targets), enabling a unified representation across diverse perturbation types. We encode prompts with a pretrained BERT model. For example to encode the chemical compound butyric acid, a drug affecting cell growth, we use the prompt:

A cell painting image of U2OS cells treated with butyric acid, SMILES: CCCC(O)=O.

Similarly, for a CRISPR perturbation, the prompt is structured as:

A cell painting image of U2OS cells treated with CRISPR, targeting genes: AP2S1.

Training Objective

To align image profiles with perturbation text, we combine the standard CLIP loss with a Continuously Weighted Contrastive Loss (CWCL). CWCL softly reweights pairs based on morphological similarity, ensuring that biologically related perturbations remain close in the embedding space while preserving retrieval performance. This objective balances cross-modal alignment with biologically meaningful intra-modal structure.

For the profile-to-perturbation direction, \( \mathcal{U} \rightarrow \mathcal{V} \), our adapted CWCL objective is given by

\[ \mathcal{L}_{\mathrm{CWCL}, \mathcal{U} \rightarrow \mathcal{V}} = \frac{1}{N} \sum_{i=1}^N \frac{1}{\sum_{j \in [N]} w_{ij}^{\mathcal{U}}} \left[ \sum_{j=1}^N w_{ij}^{\mathcal{U}} \cdot \log \frac{\exp(\langle p_i \cdot q_j \rangle / \tau )} {\sum_{k=1}^N \exp(\langle p_i \cdot q_k \rangle / \tau )} \right]. \]

Here \( p_i \) represents the output of CrossChannelFormer applied to images corresponding to perturbation \( i \), \( q_i \) corresponds to our encoding of the natural language description of perturbation \( i \), and \(w_{ij}^{\mathcal{U}}\) denotes the similarity between the pooled profiles \(\mu(U_i)\) and \(\mu(U_j)\), used to reweight the alignment from modality \(\mathcal{U}\) to modality \(\mathcal{V}\)

For the perturbation-to-profile direction, \( \mathcal{V} \rightarrow \mathcal{U} \), we apply the standard CLIP loss. Thus, the final training loss for CellCLIP is

\[ \mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{CWCL}, \mathcal{U} \rightarrow \mathcal{V}} + \mathcal{L}_{\mathrm{CLIP}, \mathcal{V} \rightarrow \mathcal{U}}. \]

Together, these design choices allow CellCLIP to leverage powerful pretrained models, efficiently represent high-dimensional Cell Painting data, and generalize across diverse perturbation types.

CellCLIP aligns perturbation embeddings generated from natural language descriptions with embeddings of corresponding images produced by a new CrossChannelFormer architecture designed to account for the idiosyncracies of Cell Painting.

Results

Benchmarking CellCLIP and baseline methods on perturbation-to-profile and profile-to-perturbation retrieval performance for unseen molecules from Bray et al 2016. We report mean Recall@1, @5, and @10 \( \pm \) standard deviation across random seeds for both tasks. Higher recall corresponds to better performance. Best results are shown in bold.

Cross-modality retrieval

We present our results for cross-modality retrieval tasks (i.e., perturbation-to-profile and profile-to-perturbation) in Table 1. For our benchmarking we compared CellCLIP against previous state-of-the-art CL methods for Cell Painting perturbation screens: CLOOME and MolPhenix. Overall, we found that CellCLIP demonstrated substantially higher performance at both cross-modal retrieval tasks compared to baseline methods. To understand the source of performance gains in CellCLIP, we conducted a series of ablations to understand the contribution of each component in CellCLIP to retrieval performance. Specifically, starting with CLOOME's proposed encoding scheme, where individual images encoded using ResNet50 are aligned with chemical perturbations encoded using Morgan fingerprints combined with an MLP, we gradually replaced each of CLOOME's components with those of CellCLIP and assessed each change's impact on model performance.

Ablation studies of various vision and perturbation encoder combinations and retrieval performance in Bray et al 2016. We report mean Recall@1, @5, and @10 \( \pm \) standard deviation across random seeds for perturbation-to-profile and profile-to-perturbation retrieval tasks. \(\triangle\) indicates Morgan Fingerprint; \(\Box\) indicates text prompt.

Language can effectively represent perturbations.

We began by replacing CLOOME's chemical structure encoder, which feeds chemicals' Morgan fingerprints through a multi-layer perceptrion (MLP), with the natural language encoder used in CellCLIP while holding all other model components fixed. We found that this change alone yielded significant performance gains for both retrieval tasks. Notably, we found this result continued to hold even after replacing the generic MLP used in CLOOME with an MPNN++ network designed specifically for molecular property prediction

Cross-channel reasoning improves retrieval performance

We next investigated the impact of CrossChannelFormer's ability to reason across global Cell Painting channel information. To do so, we replaced CLOOME's ResNet image encoder with our CrossChannelFormer encoder. To isolate the effects of CrossChannelFormer's ability to reason across channels from the effects of per-perturbation pooling, in this experiment we removed the CrossChannelFormer's pooling function and trained the resulting model (denoted as CrossChannelFormer\(^{\dagger}\) in Table above) on individual image-perturbation pairs. We found that this change led to an additional increase in model performance. To understand how our approach compared to previous channel-aware vision encoding approaches, we ran this same experiment using a channel agnostic MAE (CA-MAE) and ViT (ChannelViT) as vision encoders. We found that CrossChannelFormer consistently outperformed these baselines on both tasks. This demonstrates that image embeddings extracted from models pretrained on natural images can be effectively leveraged with CrossChannelFormer. Additionally, our approach substantially reduced training time, achieving a 3.9 times speedup compared to CLOOME and a 2.2 times speedup compared to other channel-agnostic methods. This efficiency stems from CrossChannelFormer operating in a compact feature space and requiring only \(C+1\) tokens per instance.

Retrieval performance of CellCLIP trained across different pooling strategies on perturb-to-profile and profile-to-perturb tasks.

Pooling yields improved alignment and computational efficiency

Finally, we evaluated the impact of the attention-based pooling operator within CrossChannelFormer compared to instance-level training (i.e., no pooling). We found that including pooling resulted in yet another increase in model performance. In addition, by reducing the number of pairs for contrastive loss computation, pooling yielding a 6.7 times speedup in training time relative to instance-level training. We also explored the impact of varying our choice of pooling operator, and found that attention-based pooling yielded the best results among the pooling operators we considered.

Overall, our results demonstrate that each of our main design choices in CellCLIP contributes to improved retrieval performance and computational efficiency compared to prior work.

Recovering known biological relationships

We report recall of known biological relationships among genetic perturbations in RxRx3-core. Across all benchmark databases, we found that CellCLIP achieved the best recovery of known gene-gene relationships compared to baseline models. Notably, we found that replacing the CWCL loss used in CellCLIP with the standard CLIP loss leads to worse performance on this task, illustrating the benefits of using soft labeling for alignment in the image profile space \( \mathcal{P} \). Altogether, these results further demonstrate that CellCLIP can recover meaningful relationships between perturbations in its image profile latent space.

Zero-shot gene–gene relationship recovery on RxRx3-core, evaluated across varying thresholds from Recall@2% [0.01,0.99] (top and bottom 1%) to Recall@20 % [0.10,0.90] (top and bottom 10%), using pathway annotations from CORUM, HuMAP, Reactome, SIGNOR, and STRING.

BibTeX

@article{lu2025cellclip,
  title={CellCLIP--Learning Perturbation Effects in Cell Painting via Text-Guided Contrastive Learning},
  author={Lu, Mingyu and Weinberger, Ethan and Kim, Chanwoo and Lee, Su-In},
  journal={arXiv preprint arXiv:2506.06290},
  year={2025}
}