14 votes

A watershed moment for protein structure prediction

8 comments

  1. [6]
    JakeTheDog
    Link
    If anyone has any questions—I'm an enthusiastic researcher in this field (not specifically AI/ML but structural biology) and I'd be happy to answer!

    If anyone has any questions—I'm an enthusiastic researcher in this field (not specifically AI/ML but structural biology) and I'd be happy to answer!

    3 votes
    1. [2]
      skybrian
      Link Parent
      I don't have anything specific to ask about, but I'm curious what areas are you working in and what do you think are the most promising areas for research or development?

      I don't have anything specific to ask about, but I'm curious what areas are you working in and what do you think are the most promising areas for research or development?

      1 vote
      1. JakeTheDog
        Link Parent
        I work somewhat broadly on method development, mostly specifically in mass spectrometry. I've worked on modelling a bunch of different protein systems. As awesome as these computational predictive...

        I work somewhat broadly on method development, mostly specifically in mass spectrometry. I've worked on modelling a bunch of different protein systems. As awesome as these computational predictive methods are, nothing beats good ol' fashioned empirical analysis with wet-lab techniques.

        One of the major issues is that, despite there being a lot of different techniques you can use to acquire data on protein structures—all with their own strengths and weaknesses—there are some serious limitations that may be impossible to overcome by any one technique. I say "impossible" because the chemistry and/or physics of the protein is simply not compatible with the technique (like large and "floppy" proteins for X-ray crystallography).

        So, the current frontier of the field is to solve protein structures using data obtained by several different techniques in parallel, and then computationally combine the data to model the structure.

        The usefulness of the purely computational approach, as you share here, is that a) predicting a set of 1000 new sequences you just obtained is way easier and faster than going the empirical route, so it's a good first pass and, b) predicted structures can also be combined (and/or validated) with empirical methods.

        5 votes
    2. [2]
      Comment deleted by author
      Link Parent
      1. JakeTheDog
        (edited )
        Link Parent
        Hah, I probably should have made a top-level comment answering this in anticipation—such is the burden of knowledge (i.e. forgetting that others don't know what you know). In the simplest terms:...

        Hah, I probably should have made a top-level comment answering this in anticipation—such is the burden of knowledge (i.e. forgetting that others don't know what you know).

        In the simplest terms: knowing the structure (i.e. the "fold(s)") of a protein means we can design drugs to target the protein.

        If genes are the "blueprint", than proteins are the physical "things" that make up an object. For a car, proteins would be the tires, pedals, gears and fuel lines. Let's say this car has some sort of mechanical failure, caused by a bad gear. If a gear is missing a tooth, and you know exactly how the gear is shaped, and you know where the missing tooth is and how it should look like, then you can design a new tooth to weld onto the gear. In biology, we are the cars, a mechanical failure is the disease (or symptom), a gear is the protein, and the "new tooth" you fix the gear with is the drug.

        Right now most drugs are not so much designed as they are discovered by massive library screens, testing tens of thousands of candidate molecules by trial and error. The holy grail would be to computationally predict the ideal drug for a particular issue.

        So to answer your question: it has everything to do with tackling diseases effectively. Not only that, but also the manufacturing of valuable molecules (via synthetic biology; use bacteria or yeast to synthesize complex drugs) and new materials (think of new versions of spider silk for construction).

        I'll 1-up this by also adding that not only is the structure important, but also the dynamics. Proteins are not solid objects, many of them are floppy and move around a lot. Another tricky part is knowing how the movement is "wrong" and how to make it "move better".

        Here's one of my favorite proteins that is essentially a motor: ATP synthase. A similar protein powers a sperm's flagellum, so literally a motor.

        Here is an awesome group that does more protein animations (based on real simulations).

        2 votes
    3. [2]
      moocow1452
      (edited )
      Link Parent
      So my understanding is that this builds out of the Folding@Home project, but with a new algorithm that sorts out all of the impossible/junk answers to find the ones that could exist and be helpful...

      So my understanding is that this builds out of the Folding@Home project, but with a new algorithm that sorts out all of the impossible/junk answers to find the ones that could exist and be helpful quicker than the brute force method?

      1. JakeTheDog
        Link Parent
        Sort of. The efficiency is acquired before the possible answers are found, by essentially simplifying the search space. First some terms: "residue" is an amino acid; a chain of amino acids make up...

        Sort of. The efficiency is acquired before the possible answers are found, by essentially simplifying the search space.

        First some terms: "residue" is an amino acid; a chain of amino acids make up a protein. A protein is like a string of beads (residues) "folded" up into a particular shape.

        There are two steps involved. First, is using their neural network, trained on the existing database of protein structures (which is big, to say the least), to predict residue-residue distances i.e. approximate the structure based on what seems to occur in nature.

        Second, they used this approximated candidate structure as a starting point for a physics-based (or a good enough approximation) simulation, essentially energy minimization, to make it realistic. Sort of like when an architect designs a crazy new building and the engineers have to make it functional/realistic for the real world.

        This pairing is optimal. The issue with step 1 is that there are no "real world" physics involved and is generally a very rough solution. And also it's based on a biased data set of preexisting structures. The issue with step 2 is that, in the absence of any concrete starting point, the search space is effectively infinite (as in it's so computationally expensive it's pointless to do).

        To be honest, this isn't really that much of an innovation (as the techniques are in regular use) but it's surprising just how good it manages to be. More than a fractional increment in accuracy, which is usually what we see.

        1 vote
  2. skybrian
    Link
    From the article: [...]

    From the article:

    The course of the structure-prediction field changed nearly a decade ago with the publication of a series of seminal papers exploring the idea that the evolutionary record contains clues about how proteins fold. The idea is predicated on the following premise: if two amino-acid residues in a protein are close together in 3D space, then a mutation that replaces one of them with a different residue (for example, large for small) will probably induce, at a later time, a mutation that alters the other residue in a compensatory direction (in our example, swapping small for large). The set of co-evolving residues therefore encodes valuable spatial information, and can be found by analysing the sequences of evolutionarily related proteins.

    By transforming this co-evolutionary information into a matrix known as a binary contact map, which encodes which residues are proximal, the set of conformations that merit consideration by algorithmic searches can be restricted. This in turn makes it possible to accurately predict the most favourable protein conformation, especially for proteins for which many evolutionarily related sequences are known. The idea was not new, but the rapid growth in available sequence data in the early 2010s, coupled with crucial algorithmic breakthroughs, meant that its time had finally come.

    [...]

    In lieu of binary contact data, AlphaFold predicts the probabilities of residues being separated by different distances. Because probabilities and energies are interconvertible, AlphaFold predicts an energy landscape — one that overlaps in its lowest basin with the true landscape, but is much smoother. In fact, AlphaFold’s landscape is so smooth that it nearly eliminates the need for searching. This makes it possible to use a simple procedure to find the most favourable conformation, rather than the complex search algorithms employed by other methods.