Kurt Wüthrich

Structural Genomics - Exploring the Protein Universe

Category: Lectures

Date: 1 July 2009

Duration: 36 min

Quality: HD MD SD

Subtitles: EN

Kurt Wüthrich (2009) - Structural Genomics - Exploring the Protein Universe

In today’s post-genomic era, with the availability of the complete DNA sequences of a wide range of organisms, structural biologists are faced with new opportunities and challenges in “structural genomics”

Thank you for the kind introduction. My business is to determine 3-dimensional structures of proteins at atomic resolution by NMR in solution. NMR stands for Nuclear Magnetic Resonance, it’s not, No Meaningful Results And in 1984 we solved the first structure of a globular protein in solution, established the protocol on how to do things, got the structure, went to Stockholm. And today I would like to tell you what we are doing with this technique these days. And when doing so I should at the very start and you’ll see that this goes all through the lecture, I should emphasise that whenever we use NMR in structural biology we are tightly connecting to results obtained with x-ray crystallography which has been around in the field for a much longer time, since the late 1950’s with the heroics of Max Perutz and John Kendrew and their post docs and students at the time and only in the mid ‘80’s that NMR came to be a second technique by which the novo structures could be solved. Now there are two lines of uses of these techniques that I would like to discuss with you and the first part is about the use of these techniques, it applies to both NMR and to crystallography, what I’m going to say here in the next few minutes in conventional or classical structural biology. And I take the example of haemoglobin which is work done by Max Perutz, he started studying haemoglobin in 1936, he did not finish his work when he died in 2002 and he saw a lot of additional questions that were raised during his studies of haemoglobin. And that’s a typical archetype of a structural biology project. You come to a molecule and you stay with it possibly for life Now I’m going to use my belt, for you see I eat so well here. I don’t need it for its original purpose. Now I usually use my belt to demonstrate folding of proteins but today I need it for another purpose. The belt represents a human genome. Not entirely, the human genome is 1.8 metres of DNA and I’m not quite there yet. But the best part of the human genome is present here Now when we spend a lifetime working with haemoglobin, then we touch upon about 100th of a per mil of those 1.8 metres of DNA and that's where we sit for a lifetime. Now my involvement with haemoglobin came through sports. And as you will see it still occupies me these days actively. I’m not Max Perutz but I got hooked with haemoglobin and that's it for life because it’s a structural biology project. So I was always out of place and so we were thinking intensively, that was in the late 1950’s, early ’60’s on how to improve stamina. And we realised that the major limiting factor was the uptake of oxygen. And so what we were talking about in the early 1960’s was high altitude training. We even had the idea that high altitude training would not only lead to increase of haemoglobin content, so called hematocrit in the blood but it might even lead to increase of muscle haemoglobin. I mean what is referred to as myoglobin, the heart muscle is red because it’s full of myoglobin, if we could increase the content of myoglobin by high altitude training you would have a much more long lasting effect on stamina than by increasing the level of the much shorter lived haemoglobin. So here you have the structure of haemoglobin as it is shown on a book, the first book that made 3 dimensional structures generally accessible. The book by Dickerson and Geis and Geis is an artist, there were no computers in the 1960’s that would have been able to draw structures of proteins so Irving Geis would paint the structures, this is a hand painting by Irving Geis, if you go to New York, those of you who are not quite as old as me, may not be knowledgeable about this, there is a museum with 300 paintings of proteins and nucleic acids by Irving Geis. The reality looked like this, this is the model of haemoglobin built by Max Perutz from plywood pieces that were pasted on top of each other and you can well see that the resolution is far from showing individual atoms and so when I recorded the spectrum of my haemoglobin it was easy to add some information to the x-ray data at this resolution. And so this was quite important. And as I said it didn’t help me at all in running faster or longer when I played football but it was still satisfactory to know how the molecule behaves that let me run around. Now where am I with this today, today I’m a member of anti-doping Swiss, this is a governing body that has been institutionalised last year by the Olympic Committee and the other sports organisations in Switzerland, we are tightly connected with the seat of the Olympic Committee in Lausanne. We have an active laboratory in Lausanne that develops analytical tools to catch competitive sports athletes who cheat. And we just an exciting story relating to haemoglobin. You see nowadays it’s no longer a question of doing high altitude training or not because everybody goes to high altitude training. We have a centre in the Engadine 2,000 metres altitude. The Kenyans who win all the marathons, they live at 4,000 metres or approximately there and that's what it’s all about So you have to go on. And the next step was to use EPO, erythropoietic factor. If you push EPO which is widely used for cancer patients and others with, I mean kidney failure patients who cannot synthesise red blood cells anymore then you treat them with EPO. It’s a billion dollar drug, it’s a big business, it can be misused in sports. You can increase the hematocrit beyond 50, that’s an extremely high value and then your stamina is greatly increased. I have tested it on myself, I can certify that the effect is dramatic. And so it was not possible to detect EPO analytically in blood or in urine until approximately 2001. So the athletes, it is now known that the bicycle racers used EPO for a decade before they were caught, particularly in the Tour de France, in triathlon, in marathon running. Then came Roche and pegylated EPO and this pegylated EPO is sold under the name of CERA. Now CERA could not be detected with the available analytical tools and so starting in about 2002 the athletes started to inject CERA. And we were working, not me personally, I’m just in the advisory body, but the laboratory in Lausanne was working on a test. And last July we knew the test would be coming along but it wasn’t ready for the Tour de France, nor for the Olympics in Beijing. So the organizers of these 2 events were informed about the upcoming test and told that they should simply freeze the blood samples from athletes with suspiciously high hematocrit, that is content of red blood cells. So they did send the samples to Lausanne, 2 months later the number 2, number 3 and the best climber in the Tour de France were disqualified because they had used CERA. There has been a big scandal in Austria, everything happened including tri-athlete, female tri-athlete who has just been put in jail for 3 months because she tried to bribe an anti-doping laboratory for cheating, exchanging samples so that she would get out of the suspicion of having misused CERA. And only 4 weeks ago finally there was enough money to start analysing the samples from Beijing and 3 athletes from Beijing so far were disqualified in May, it was just weeks ago, the most famous of those is a Moroccan middle distance runner who set all the world records in 1,500 metres since 2003. There has never before been a Moroccan middle distance runner of any renown and he has been pushing CERA ever since he started his career, so he was caught and now disqualified. So you see I’m still involved with haemoglobin, this is a structural biology project, it catches you for life. So nowadays I’m not working so much in structural biology, I have joined a consortium that works in structural genomics. Now what is the point there? I showed you the 2/3 of the human DNA that’s represented by my belt. And we have now focused on as I said about 100 of a per mil of the DNA when talking about haemoglobin. And over the years with all the structural biology that has been done, For the rest we know is not even that the gene protein exist, we don’t know at what lifespan before birth, at old age, in diseased states they are expressed. And so the job that we are working on in structural genomics is to go after all the dark spots that have not been studied so far The only condition is that we select, that we annotate the DNA, that we select annotated presumed genes that have less than a certain percentage of similarity with the sequence of other genes for which is it of a structure also function as previously been studied. That’s a job to do. It’s done by a number of groups of scientists, it’s not the kind of work that can be done by individuals or even by a small group, here the whole thing, the whole program is financed in the US by the Protein Structure Initiative and the group I’m with is the Joint Centre for Structural Genomics which is located at Scripps in La Jolla in California and we are a group of about 80 scientists who are organised into 4 cores, bioinformatics core that selects the targets, it annotates the DNA, selects the targets. Then we have biochemistry core, it’s called crystallomics core, that clones, expresses and crystallises the targets, that were proposed by the bioinformatics group. Then there is a structure determination core, that’s located at the SSRL in Stanford. And then there is an NMR core and you see the NMR core is of modest size in this particular group, that the biochemistry that prepares crystals and crystallography are significantly bigger. And the administrative core is commendably small when compared to a classical university in Switzerland or in Germany. Now what is kind of problem that we are facing. Currently about 6 million protein sequences have been deposited in the data bank. And every day 10,000s are added and it’s very cheap, very fast to determine more protein sequences. But what does this mean, this means that the DNA sequence that codes for the protein has been determined, we know nothing more, we don’t know whether that particular protein can actually be expressed. We don’t know whether it is soluble, we don’t know whether it can be crystallised. We don’t know whether it has a function, it can well be and in many cases it has been that these protein sequences have been miss-assigned, that they have not been properly annotated, we are still learning on how to do this and so on. On the other end we have a data base of approximately 50,000 3 dimensional structures. Now here in this drawing, in the centre, this is represented by this small circle which should actually be much smaller to represent 50,000 items as compared to 6 million that are represented by the red circle. Now these 50,000 protein structures have a leverage of 50 to several 100 fold because proteins can be classified into families with similar structures and often similar functions. So that we have a leverage from the experimentally determined protein structures that represents a significantly large proportion of the number of protein sequences that are known. Now what does all this mean, this means that the proteins represented by the dark blue circle can be used to target in applied research, say for drug development, say in agricultural research and similar. The light blue area covers proteins for which sufficient information is available to make tentative approaches which might then be substantiated by subsequent experimental structure determination. And the red area means that so little is known about the proteins that we cannot rationally use them as targets for applied research. The idea now is that we increase these 2 circles and that we look for important structural leverage so that we increase the number of experimental 3-D structures by a factor of 10 that the leverage is several hundred and helps us to cover a significant portion of the protein universe that is spanned by the available sequences. So that is the job that we face. I like this so much, I didn’t do it on purpose but. Now I have to wait for it to grow, it’s a slow business, it still costs about $40,000 on average to solve one 3-dimentional protein structure. So here you have another illustration of what I as just talking about. We know today that there are about 32,000 protein families. Of these a large percentage, about 20,000 have only one member, so if we select a protein from these 20,000 we have no leverage, there will be one structure and that’s it. If we go to the largest families we may cover with a single structure determination a significant part of the protein universe. Now in order to attack this problem we need high throughput technology. This high throughput technology has been developed in a very impressive way by crystallography. Just to give you, to show you what the numbers are, during this 3 month period, late last year the crystallomics code at this the biochemists worked on 320 different targets. They generated close to 8,000 crystals, these 8,000 crystals were automatically screened to synchrotron and the end of the whole exercise was 51 structures solved. You see so it is a very wasteful approach but it’s a very efficient approach in the end. You see 8,000 crystals, 50 structures. Now this needs a lot of robotics and crystallography. JCSG is well equipped and NMR has not approached high throughput in any way. So we have to find a role for NMR or stop using it in this business Now there has been recent publication by exclusively crystallographers and bioinformaticians and what they did was to analyse the reports on positive and negative outcome of individual projects. This is a big contrast to classical structural biology. In this structural genomics enterprise we keep track of positive and of all negative experiments. And this enables to do statistics on good and bad amino acid sequences for crystallography. And so they now classify on the basis of the DNA annotation, the genes into providing optimal, sub optimal, all the way to very difficult proteins for crystallography. And if you have one of the very difficult ones the chances of getting a structure is about 0.1, if you have an optimal sequences it’s 0.3, 3 times higher. Now they also analysed the data banks for NMR structures and they find that this has occurred for NMR. So the probability of solving an NMR structure for a protein that has been classified as being very difficult for crystallography is actually very high. So the situation is at the moment clear for us, we are not going to try high throughput structural genomics with NMR, we are going to do high output structural genomics, we’ll let the crystallographers go ahead. Whenever they fail down here we take the targets and try to fill in the gaps. That means I have illustrated this here, this is a particular group of proteins, a Pfam families, so let’s assume crystallography has gone across this part of the protein universe. They have solved all those protein families that are indicated in green. The red ones have not been touched so they are off limits for us. The black ones are those where crystallography went through cloning expression and crystallisation trials so that a lot of the biochemistry is already available and they failed, they didn’t get crystals that would diffract at high resolution and these targets we take over and try to solve them by NMR. Now what is the possible outcome, I show you 2 examples. We got a sequence from a data base of proteins that are regulated in human cancers. The sequence had less than 5% homology with known proteins, we solve the structure, it immediately turned out to be a non specific nucleoside triphosphatase and so the project was finished. We had the structure, it was clear where the active site is, we could just run NTBA assays and prove that it is an NTBA, that’s it project is finished in principle. In this case it’s probably not an interesting target if this happens with an important target we may immediately have a new potential target for drug development. Second example, the SARS coronavirus we got a special grant to study the entire proteome of the SARS coronavirus, which contains, it was predicted that it contains 28 non structural proteins. And so we started this exercise whenever crystallography solved the proteins that are indicated in black, with all the other crystallisation failed, it turned out because the annotation was not very precise. The proteins invariably had tails, you see this here and so we solved in this case more structures by NMR than by crystallography. Now here these 3, these have now also been solved, so we, for example we now have all the structures for the first 1,300 amino acid residues of the non structural proteins 3 and these are many more protein domains that had been predicted, about 3 times more. There are also 3 times more functions and in most cases we only have suspicions regarding the function, the structures are now available as a basis for biochemical work to investigate the functions of these proteins. And in many cases we also found new folds, which have no resemblance with previously determined proteins. Now how am I doing with time, 5 minutes. Ok I’ve only to finish by giving you some glimpses of technical developments that we are pursuing within the project of structural genomics. Because even if we talk about high output use of NMR we have to improve the efficiency. I mean typically it was a matter of several months to determine the structure of medium sized for us proteins, We have now developed an automated protocol with which we would, within 2 weeks or less determine a structure. We only use 2 types of NMR measurements. I show here what is involved, NMR data collection and automated NMR structure determination typically 1 week, we would in all cases, we have run this now for more than 20 proteins, we would get an RMSD to the x-ray we ran, just as an exercise we ran a dozen proteins for which the crystal structure has been done on site. We would go to within about 1 angstrom, RMSD for the backbone, there would be another week of interactive work and validation and that is it. You see here a superposition with the blue lines, representing the NMR structure, the red line, the corresponding crystal structure. What are the important things that come in. We would go from 600 micro litre samples to 5 micro litre samples, that’s very essential for work in structural genomics. So new probe design, much improved sensitivity so it will go from typically 9,000 micrograms to using 150 micrographs of proteins. Another essential ingredient for the technology is the development of Automated Projection Spectroscopy or APSY, which was done in Zurich. Mostly very little has been done at Scripps, so I transfer this technology to Scripps, what happens is we followed work that had been done by Kupce and Freeman, famous NMR spectroscopists, instead of running high dimensional NMR experiments would project the spectra on 2 dimensional planes, the point is that we can record 30 such 2 dimensional projection planes in half a day. And we have then enough redundancy for the computer to distinguish between random noise and artefacts and significant peaks. This is typical result of such an experiment. Because of the way we treat the data we can go to high dimensions, 5 dimensions is now routine, And so this then is the protocol, the backbone assignment using APSY is fully automated but one would be foolish not to check before going on of course, so there is a couple of hours of interactive checks on the automated assignment. Then we only need one additional type of experiment in addition to APSY and these are NOSEY experiments which serve to get cychain assignments and to solve the structure. In all cases we obtain an accurate backbone fold at this stage. Before any interactive involvement of the scientists, others are injecting on the backbone assignments. And then we go through the interactive structure validation. And finish the structural work, as I said the NMR experiments used our APSY for the backbone assignment, NOSEY for the side chain assignments and the structure determination And then there are another 12 years or so of software development, the programs names are GAPRO, MATCH, ASCAN, ATNOS, CANDID, they are now assembled into UNIO and these programs do the work for the automated steps of the structure determination and here you have another result that I will call a representative result of what we get. I show this to make sure that you see we get the complete structure with side chains at atomic resolution. These are just the parameters that define the quality of the structure determination. That’s the end of my talk, I just want to make one additional comment. I have now talked exclusively about the use of NMR for structure determination. That means for the determination of the shape of the proteins. In the future, the next phase of structural genomics will be much more directly linked to biomedical, biological research. And there we will be forced by those who pay for the program to study in more depth the functional properties in addition to the structural properties We should see we are now ahead of the pack, you see in the past in structural biology we would usually work with systems where the biochemist, the biologists had accumulated a huge amount of data and then you would get the structure and rationalise what has happened. Now we have structures and no biological knowledge, biochemical knowledge yet. So we are now ahead of the pack And now we are asked to help the biochemists to make up to the level where we are with the structures. And here NMR will play a very different role because you see getting the shape of a molecule, you see this has the resemblance of Switzerland, here would be the lake of Constance. This is again, here are the Alps around here. There’s a lot of colour on here, now this is again the molecule that I showed at the very beginning, the protein that we solved in 1984 but now we have done a large number of additional measurements. Giving us information on the hydration, giving us information on interactions with small molecules. Giving us information on the mobility of parts of the protein and so on and so forth. So in addition to getting the shape of the molecules at atomic resolution NMR will be more and more important in adding all those colours on top of the structure and helping us to get information on the functions based on the structural information. And that’s for the next 5 years. Thank you for your attention.


In today’s post-genomic era, with the availability of the complete DNA sequences of a wide range of organisms, structural biologists are faced with new opportunities and challenges in “structural genomics”. In contrast to classical structural biology, research in structural genomics is focused on gene products with unknown structures, unknown functions, and minimal similarity to previously studied proteins. A precisely formulated goal of structural genomics is to determine representative three-dimensional structures for all protein families, which requires ‘high-throughput’ technology for protein production and structure determination, and the long-term outlook is to predict physiological protein functions from knowledge of new three-dimensional structures. The California-based Joint Center for Structural Genomics (JCSG; PI Dr. Ian A. Wilson) is one of four large-scale consortia in the NIH-funded Protein Structure Initiative (PSI). The JCSG developed and operates an extensively automated high-throughput pipeline for protein production, crystallization and crystal structure determination. However, there remain gaps in the coverage of protein fold space that arise because certain proteins are not readily amenable to crystal structure determination. My research team (the “NMR Core”) works on filling such gaps with a ‘high-output’ approach, which involves novel strategies of target selection as well as new technology for NMR structure determination. When compared to structure determination by X-ray crystallography, the NMR method is complementary by the fact that atomic resolution structure determination and measurements of supplementary function-related data can be performed under solution conditions that can be very close to the physiological milieu in body fluids. By generating data on protein structure stability, dynamics and intermolecular interactions in solution, NMR has an exciting role also in the longer-term challenge leading from the expanding protein structure universe to new insights into protein functions, chemical biology and biomedical applications.

The JCSG is supported by the National Institute of General Medical Sciences, Protein Structure Initiative: Grant U54 GM074898.