Opinion

RFDiffusion3: A Brief Exploration

​Last December, the Institute for Protein Design dropped RFDiffusion3, a protein design model that operates at the level of individual atoms. Before the AIs figure out how to use it to craft mirror life bacteria and kill everyone, I wanted to understand its architecture and do a mini exploration on protein design models. How does it differ from an LLM? What are proteins?Disclaimer: This is what RFDiffusion3 looks like to a curious bio noob. I am not a computational biologist and corrections are welcome! Nothing in this article is health or financial advice etc.ProteinsProteins are chains of amino acids (aka “residues”). An amino acid is a molecule with both amino (NH2) and carboxyl groups (COOH). They’re sort of like tiny magnets. Amino groups stick to carboxyl groups to form peptide bonds (C(=O)-N). They’re often represented as having one N-terminus (amino group) and one C-terminus (carboxyl group) along with a side chain R that can be any set of atoms (including other carboxyl and amino groups, making cool rings and graphs!). Pictures of amino acids. On the right is a skeletal diagram, kinks are CsTwo amino acids forming a peptide bond via a condensation reaction (water leaves)There are over 500 amino acids found in nature but only 20-22 in the genetic code that create proteins (proteinogenic), and most protein models are trained on this set of 20. You’ll often see proteins from this canonical set represented with three letter or one letter codes, e.g. alanine as Ala or A.Some amino acids. Notice how they all start with A and start hashing into increasingly wild single letter codesSince these amino acids and their sidechains have charges and push and pull each other, proteins have an energetically favorable 3D structure they will “fold” into, kind of like those snake toys that can fold into 3D shapes from a 1D sequence. one of my favorite toys as a kidProteins can be described at four levels of structure: primary, secondary, tertiary, and (optionally) quaternary. Let’s explore with protein database entry 4zxb: IRDeltabeta construct, human insulin receptor ectodomain in complex with four Fab molecules (English: The part of the insulin receptor sticking out of the cell, synthetically constructed in a lab, interacting with mouse antibodies).The primary sequence is just the string of amino acids. It looks likeHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRPSSLGDVGNAGNNEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKThe secondary structure describes local geometry of substrings in the primary sequence (typically 4-40 residues) and is mainly comprised of alpha helices (spirals), beta sheets (sheets), and loops (chaos). Here we zoom into residues 695-709 (ELEESSFRKTFEDYL), an alpha helix, and residues 857-862, 842-851, 880-890, and 901-905, beta strands that interlock into a beta sheet.The tertiary structure is the shape of the entire protein chain. Secondary and tertiary Structure. Notice the beta sheets represented by arrows and the alpha helices represented by spirals Quaternary structure describes how two or more folded protein ‘subunits’ interact. In 4zxb, the quaternary structure would be comprised of both the insulin receptor ectodomain and mouse antibody tertiary structures.Quaternary Structure: The insulin receptor protein is in green and mouse antibodies in magentaProtein Folding and DesignAt a high level, people simulating proteins with computers create functions that map between sequence, structure, and protein function. The two main areas are protein folding and protein design. In a nutshell, protein folding is a prediction problem while protein design is a generation problem.When people think of proteins + AI, the first thing that comes to mind is DeepMind’s AlphaFold. AlphaFold does sequence to structure prediction. Given a string of amino acids like MISKIDKNKVRLKRHARVRTNLSGT, AlphaFold outputs likely 3D coordinates and bond rotation angles. Open source protein folding models include RoseTTAFold (fine-tuned to become the denoiser in RFDiffusion) and ByteDance Protenix.The inverse of folding a protein is called inverse folding (surprise!). A popular inverse folding model is ProteinMPNN (message passing neural network), which predicts sequence given an input structure.Protein design involves generating completely new protein structures and sequences from constraints and input structures. The idea is like asking the model “given this fragment of amino acids and target hotspots, can you make a new protein (of this length with these attributes) that binds to it?” RFDiffusion, PXDesign, and DISCO are some example open source models.Some things you can do with protein design models:Generate new proteins that bind to existing proteinsGenerate new proteins that bind to nucleic acid sequences (DNA, RNA, etc)Generate new enzymes, molecules that speed up chemical reactionsGenerate symmetric proteinsAnd more!DiffusionWhy do we use diffusion models to generate proteins? It turns out diffusion models are just generally good at mapping between complicated high dimensional data distributions. A classic use case is image generation, for example Denoising Diffusion Probabilistic Models (DDPMs). RFDiffusion3 ArchitectureAt a very high level, RFDiffusion3 is a black box that takes in input atom/protein structures (.cif or .pdb file) and outputs atom/protein structures (.cif file with atom and protein labels). Although it does provide a sequence, it’s recommended that you refold the sequence or inverse fold the structure to double check (with a model like AlphaFold or ProteinMPNN).RFDiffusion3 as a black boxLooking closer, the two main components are the token initializer and diffusion module. The token initializer takes the input protein/atomic sequence and design constraints and converts them to atom level and token level embeddings. A token is an amino acid, a nucleic acid base pair, or a single atom in ligands (small molecules). This module runs once, and the embeddings are used to condition each pass through the diffusion module.The diffusion module runs multiple forward passes (default 20), gradually denoising 3D atom coordinates from pure noise. Within each forward pass, there is recycling, a trick from AlphaFold where the same blocks are run multiple times to extend the depth of the model (default 3). The idea is the forward pass does “messy” guesses and recycling refines the guess by letting the model see its predicted coordinates and distograms multiple times. A distogram is a tensor of shape [ num tokens, num tokens, c]. It represents binned distances between token pairs. Another thing to note is the U-net like skip connections (raw atom features skipping the token initializer, conditioning embeddings skipping the encoder, and embeddings passing straight from the encoder to decoder).Main components of RFDiffusion3Here’s a more in depth diagram, which I might explain more in a followup post.The Whole WorkflowLet’s design a protein! I’ll be using this example of protein-protein interface design from the docs. The task: design a protein that binds to a human insulin receptor. This might be useful for managing diabetes.We first take the protein structure of the target receptor from the protein database, and crop it to the section we care about. It looks something like this:Then, we specify configs for the generation processThe input is the pdb file of the structure we want to design around. Contig is a string in a domain-specific language (InputSelection) that says “generate a protein 40-120 amino acids long, then a chain break, then take amino acids 6-155 from chain E in the input pdb and design around that”. Length means the length of the final generated protein + input structure. select_hotspots is a way to select specific atoms on the amino acids in the input structure. For example, on the 64th amino acid (E64) of the input, we want to bind to the second delta carbon and zeta carbon atoms in the sidechain. infer_ori_strategy is a way to automatically calculate the origin token, the center of the designed structure in 3D space. is_non_loopy minimizes loopy chaotic secondary structure (not alpha helices or beta sheets). Now we grab an L40 GPU and run rfd3 design inputs=ppi_tutorial.yaml out_dir=outputs/0Then we get a batch of output .cif files. Notice how the part on the left is the input structure, and the part on the right is newly generated, sort of like those generative image inpainting tools. The next step is typically verifying the amino acid sequence through inverse folding the structure with a model like ProteinMPNN, and then refolding the sequence with AlphaFold3 or RoseTTAFold3 to validate the accuracy of the original generated structure. Then you would score the designs on some criteria (like toxicity, immunogenicity) and finally validate the top few in a wet lab.ConclusionToday I learned that not all generative AI slop is slop! Protein design and structure models are awesome and only getting better. They might elevate the risk of human extinction, but might also make lifesaving drugs, enzymes to break down plastics, better GMOs, and more. Current protein models can be run on a single GPU: Design your protein today! Discuss ​Read More

RFDiffusion3: A Brief Exploration

​Last December, the Institute for Protein Design dropped RFDiffusion3, a protein design model that operates at the level of individual atoms. Before the AIs figure out how to use it to craft mirror life bacteria and kill everyone, I wanted to understand its architecture and do a mini exploration on protein design models. How does it differ from an LLM? What are proteins?Disclaimer: This is what RFDiffusion3 looks like to a curious bio noob. I am not a computational biologist and corrections are welcome! Nothing in this article is health or financial advice etc.ProteinsProteins are chains of amino acids (aka “residues”). An amino acid is a molecule with both amino (NH2) and carboxyl groups (COOH). They’re sort of like tiny magnets. Amino groups stick to carboxyl groups to form peptide bonds (C(=O)-N). They’re often represented as having one N-terminus (amino group) and one C-terminus (carboxyl group) along with a side chain R that can be any set of atoms (including other carboxyl and amino groups, making cool rings and graphs!). Pictures of amino acids. On the right is a skeletal diagram, kinks are CsTwo amino acids forming a peptide bond via a condensation reaction (water leaves)There are over 500 amino acids found in nature but only 20-22 in the genetic code that create proteins (proteinogenic), and most protein models are trained on this set of 20. You’ll often see proteins from this canonical set represented with three letter or one letter codes, e.g. alanine as Ala or A.Some amino acids. Notice how they all start with A and start hashing into increasingly wild single letter codesSince these amino acids and their sidechains have charges and push and pull each other, proteins have an energetically favorable 3D structure they will “fold” into, kind of like those snake toys that can fold into 3D shapes from a 1D sequence. one of my favorite toys as a kidProteins can be described at four levels of structure: primary, secondary, tertiary, and (optionally) quaternary. Let’s explore with protein database entry 4zxb: IRDeltabeta construct, human insulin receptor ectodomain in complex with four Fab molecules (English: The part of the insulin receptor sticking out of the cell, synthetically constructed in a lab, interacting with mouse antibodies).The primary sequence is just the string of amino acids. It looks likeHLYPGEVCPGMDIRNNLTRLHELENCSVIEGHLQILLMFKTRPEDFRDLSFPKLIMITDYLLLFRVYGLESLKDLFPNLTVIRGSRLFFNYALVIFEMVHLKELGLYNLMNITRGSVRIEKNNELCYLATIDWSRILDSVEDNHIVLNKDDNEECGDICPGTAKGKTNCPATVINGQFVERCWTHSHCQKVCPTICKSHGCTAEGLCCHSECLGNCSQPDDPTKCVACRNFYLDGRCVETCPPPYYHFQDWRCVNFSFCQDLHHKCKNSRRQGCHQYVIHNNKCIPECPSGYTMNSSNLLCTPCLGPCPKVCHLLEGEKTIDSVTSAQELRGCTVINGSLIINIRGGNNLAAELEANLGLIEEISGYLKIRRSYALVSLSFFRKLRLIRGETLEIGNYSFYALDNQNLRQLWDWSKHNLTITQGKLFFHYNPKLCLSEIHKMEEVSGTKGRQERNDIALKTNGDQASCENELLKFSYIRTSFDKILLRWEPYWPPDFRDLLGFMLFYKEAPYQNVTEFDGQDACGSNSWTVVDIDPPLRSNDPKSQNHPGWLMRGLKPWTQYAIFVKTLVTFSDERRTYGAKSDIIYVQTDATNPSVPLDPISVSNSSSQIILKWKPPSDPNGNITHYLVFWERQAEDSELFELDYCLKGLKLPSRTWSPPFESEDSQKHNQSEYEDSAGECCSCPKTDSQILKELEESSFRKTFEDYLHNVVFVPRPSSLGDVGNAGNNEEHRPFEKVVNKESLVISGLRHFTGYRIELQACNQDTPEERCSVAAYVSARTMPEAKADDIVGPVTHEIFENNVVHLMWQEPKEPNGLIVLYEVSYRRYGDEELHLCVSRKHFALERGCRLRGLSPGNYSVRIRATSLAGNGSWTEPTYFYVTDYLDVPSNIAKThe secondary structure describes local geometry of substrings in the primary sequence (typically 4-40 residues) and is mainly comprised of alpha helices (spirals), beta sheets (sheets), and loops (chaos). Here we zoom into residues 695-709 (ELEESSFRKTFEDYL), an alpha helix, and residues 857-862, 842-851, 880-890, and 901-905, beta strands that interlock into a beta sheet.The tertiary structure is the shape of the entire protein chain. Secondary and tertiary Structure. Notice the beta sheets represented by arrows and the alpha helices represented by spirals Quaternary structure describes how two or more folded protein ‘subunits’ interact. In 4zxb, the quaternary structure would be comprised of both the insulin receptor ectodomain and mouse antibody tertiary structures.Quaternary Structure: The insulin receptor protein is in green and mouse antibodies in magentaProtein Folding and DesignAt a high level, people simulating proteins with computers create functions that map between sequence, structure, and protein function. The two main areas are protein folding and protein design. In a nutshell, protein folding is a prediction problem while protein design is a generation problem.When people think of proteins + AI, the first thing that comes to mind is DeepMind’s AlphaFold. AlphaFold does sequence to structure prediction. Given a string of amino acids like MISKIDKNKVRLKRHARVRTNLSGT, AlphaFold outputs likely 3D coordinates and bond rotation angles. Open source protein folding models include RoseTTAFold (fine-tuned to become the denoiser in RFDiffusion) and ByteDance Protenix.The inverse of folding a protein is called inverse folding (surprise!). A popular inverse folding model is ProteinMPNN (message passing neural network), which predicts sequence given an input structure.Protein design involves generating completely new protein structures and sequences from constraints and input structures. The idea is like asking the model “given this fragment of amino acids and target hotspots, can you make a new protein (of this length with these attributes) that binds to it?” RFDiffusion, PXDesign, and DISCO are some example open source models.Some things you can do with protein design models:Generate new proteins that bind to existing proteinsGenerate new proteins that bind to nucleic acid sequences (DNA, RNA, etc)Generate new enzymes, molecules that speed up chemical reactionsGenerate symmetric proteinsAnd more!DiffusionWhy do we use diffusion models to generate proteins? It turns out diffusion models are just generally good at mapping between complicated high dimensional data distributions. A classic use case is image generation, for example Denoising Diffusion Probabilistic Models (DDPMs). RFDiffusion3 ArchitectureAt a very high level, RFDiffusion3 is a black box that takes in input atom/protein structures (.cif or .pdb file) and outputs atom/protein structures (.cif file with atom and protein labels). Although it does provide a sequence, it’s recommended that you refold the sequence or inverse fold the structure to double check (with a model like AlphaFold or ProteinMPNN).RFDiffusion3 as a black boxLooking closer, the two main components are the token initializer and diffusion module. The token initializer takes the input protein/atomic sequence and design constraints and converts them to atom level and token level embeddings. A token is an amino acid, a nucleic acid base pair, or a single atom in ligands (small molecules). This module runs once, and the embeddings are used to condition each pass through the diffusion module.The diffusion module runs multiple forward passes (default 20), gradually denoising 3D atom coordinates from pure noise. Within each forward pass, there is recycling, a trick from AlphaFold where the same blocks are run multiple times to extend the depth of the model (default 3). The idea is the forward pass does “messy” guesses and recycling refines the guess by letting the model see its predicted coordinates and distograms multiple times. A distogram is a tensor of shape [ num tokens, num tokens, c]. It represents binned distances between token pairs. Another thing to note is the U-net like skip connections (raw atom features skipping the token initializer, conditioning embeddings skipping the encoder, and embeddings passing straight from the encoder to decoder).Main components of RFDiffusion3Here’s a more in depth diagram, which I might explain more in a followup post.The Whole WorkflowLet’s design a protein! I’ll be using this example of protein-protein interface design from the docs. The task: design a protein that binds to a human insulin receptor. This might be useful for managing diabetes.We first take the protein structure of the target receptor from the protein database, and crop it to the section we care about. It looks something like this:Then, we specify configs for the generation processThe input is the pdb file of the structure we want to design around. Contig is a string in a domain-specific language (InputSelection) that says “generate a protein 40-120 amino acids long, then a chain break, then take amino acids 6-155 from chain E in the input pdb and design around that”. Length means the length of the final generated protein + input structure. select_hotspots is a way to select specific atoms on the amino acids in the input structure. For example, on the 64th amino acid (E64) of the input, we want to bind to the second delta carbon and zeta carbon atoms in the sidechain. infer_ori_strategy is a way to automatically calculate the origin token, the center of the designed structure in 3D space. is_non_loopy minimizes loopy chaotic secondary structure (not alpha helices or beta sheets). Now we grab an L40 GPU and run rfd3 design inputs=ppi_tutorial.yaml out_dir=outputs/0Then we get a batch of output .cif files. Notice how the part on the left is the input structure, and the part on the right is newly generated, sort of like those generative image inpainting tools. The next step is typically verifying the amino acid sequence through inverse folding the structure with a model like ProteinMPNN, and then refolding the sequence with AlphaFold3 or RoseTTAFold3 to validate the accuracy of the original generated structure. Then you would score the designs on some criteria (like toxicity, immunogenicity) and finally validate the top few in a wet lab.ConclusionToday I learned that not all generative AI slop is slop! Protein design and structure models are awesome and only getting better. They might elevate the risk of human extinction, but might also make lifesaving drugs, enzymes to break down plastics, better GMOs, and more. Current protein models can be run on a single GPU: Design your protein today! Discuss ​Read More

Leave a Reply

Your email address will not be published. Required fields are marked *