Articles, Blog

Viral Resources at NCBI

December 14, 2019

I’d like to virtually introduce Rodney Brister, who is the head of the Viral RefSeq Group, and he’s going to talk about viral resources at NCBI. These are rapidly evolving resources. I learn about new stuff that we have every time he talks, so this should be good. Okay, thank you guys. And so it’s been mentioned, a lot of the things I’m going to be talking about are rapidly evolving. Basically, as we go through things today, be aware that the core functionalities will probably be maintained over the next couple of years but the pages may look differently as we improve our data models and try to serve the public in better ways. So to find the viral general resources at NCBI, the first thing one should do is just search Google, for viral genomes. And luckily enough, we are the top hit right down here. And so if you click on that hit, you will come to our homepage and this homepage aggregates a number of different resources, and depending on what you’re trying to do, we have a variety of tools and sub-report resources designed to help you do it. Right off the bat I’ll tell you a little bit about what we do as part of the RefSeq project. Actually, my group deals with RefSeq, but we also deal with a number of other resources, so we have a very large scope that includes creating reference genomes, and also aggregating other genomes that one might find in a species for example, and curating those. And we also build a lot of value-added resources that are not genomes, per se, that are dependent on viral sequences via nucleotide or protein, and enhancements to sequence data, annotation pipelines, and mapping of metadata associated with sequence records to standardize terms. And finally, working with people in the community to improve something we call actionable intelligence, where we bring in a lot of metadata fields to help describe sequences so that you can know something about the sequence in terms of maybe potential pathogenicity or know more about the sequence in terms of potential biological relevance. And so as you come here, you’ll see that we kind of have this broken down into some subtitles of exploring sequences and downloading sequences, as well as some tools that support the functionalities I just mentioned. So right off the bat, for a lot of people who are heavy users, right here, this whole section is really for you. And what this section is here is a group of, in one case an FTP site, and accession lists that are hosted at another conceptualized FTP site, where you can find every reference genome for viruses, and something that we call neighbor genomes. And so our data model is pretty simple, we create one or more reference genomes for each viral species and then we aggregate all other genomes at that same viral species as neighbors. And essentially these all go through some sort of curation process that’s machine aided, but human beings are basically making decisions about whether or not this is the specific genome based on some criteria. Generally that criteria is that the genome includes all of the coding sequences. As many of you know; for a number of RNA viruses, current sequencing protocols require primers to be placed at the non-translated regions of the genome to amplify the internal sequences. And so for many, many RNA viruses, and some single-stranded DNA viruses, there are very few actual complete genomes there because the primer sequences actually obscure the sequence at the very end. But anyway, so if you are interested in simply grabbing a list of every single viral genome out there that’s passed through our process, you can get that right here. And so if you’re creating your own local databases or something like that, that’s probably the place for you. Now for a lot of other users, they’re more interested in exploring viral genomes, and so I’m going to take you through that process now. Now right here at the top you see this thing, this link that says viral genome browser, and this is it here. Basically what this does is it lists every single viral genome by species, and in this table here you will see that we have a lot of information, including information about the — excuse me, about the taxonomy in the space. The names of over here are species names. The accession for the reference genome and other information about the genome, such as its length, its number of proteins. And then something here where we’re calling host. Actually, excuse me, I’m going to launch this page. And I’ve actually already kind of sorted here by host, and I didn’t want to do that. Sorry about that. And you can go into this table and you can select, for example, a particular host if you want to find viruses that infect that host and then filter all of the results by that. And let’s see if the whole thing loads here. And it takes a while. Sorry about that. So here we go, here’s the entire list. And as you can see, it’s sorted by taxonomy first, and then finally by alphanumeric order. So I can come in here and I can just simply select the host. I can select and sort by taxonomy class. And, anyway, so that’s taking long, so I’ll go back here. So here I’ve presorted by algae, so now we’re listing every virus that we have that infects algae. These host terms are curated, we’re dependent on the information in the sequence record sometimes, sometimes on literature, to find the host. And so once you have the host or any other filtering that you want to do, you can come in and retrieve sequences via a number of different ways. So one way is to basically come over here and go, I just want to see all the RefSeq nucleotide sequences for that particular host, and that will bring you Entrez Nucleotide, and here you will find a list of all the RefSeqs, and you can go and look at them if you want. You can also go in here and download sequences and do all the normal things you do in Entrez Nucleotide, and I’ll come back to that in a little bit. You can also come in here and select, for example, all the RefSeq proteins associated with these particular species that you’ve now filtered. RefSeq proteins are basically all the protein records that exist on the RefSeq nucleotide sequences. And then of course we have the ability to find all the neighbor sequences, which are the non-reference sequences of each particular species, as well as a dataset that includes RefSeq plus neighbor sequences. You can now download all this information directly as well simply by hitting this button here, and you can download the table, you can do download the list of exceptions, and you can download the genome neighbors and RefSeq data. And that’s a pretty comprehensive table that’s very similar to the table that I showed you that exists right here, these two tables, except it’s now been sorted by host class. All right, so we have a couple other functionalities in here, so if you go in here and click on the links to neighbors you can see a couple different bits of information. One is the alignment to the RefSeq. I can tell you that we’re in discussions right now on how to improve this tool. And this should look very differently, hopefully within a year, and give you added functionality to sort of pinpoint where differences between the reference sequence and the various other neighbor genome records at that — or neighbor genome sequences at that same species. And then you can also, with neighbors you can, again, see the link to Entrez Nucleotide, which puts you basically on the same page I showed you before, and this time it’s just been sorted by only neighbors to that particular species. So now I’m going back to the full table here. And one of the ways of interacting with this type of data is these links up here — or excuse me, or these links up here, which allows you to see different groups based on taxonomy. And if we go back to the homepage, you can see here I can browse for all genomes by family, and so when you do that you come to this page, which basically lists every family we have in the NCBI taxonomy database, plus some non-family groupings based on unclassified. These are basically groups of viruses that have not been classified by ICTV, the International Committee for the Taxonomy of Viruses, so we maintain them in these unclassified bins. And so if you go in here, you can, for example, pick a particular family and you can go ask to see the complete genomes for that family, and, again, it just takes a little while to load. And you’ll be brought to a page that’s been filtered simply by taxonomy, and, again, you can look through here. You have a little bit of information about the genomes, the references for that particular species. And then, for example, here, you can see that in human mastadenovirus D, we have 109 curated validated genomes for that species. And you can then go click on this and go to Entrez Nucleotide, and now you have all the genomes for that species, and you can do with what you would like with them. You can download them. You can go here and find related data to them. For example, I can go in here and I can say I want to find protein data here, and then say, okay, give me all proteins that are derived from this nucleotide dataset, and now I have 3,900 proteins that represent all the GenBank proteins from those 109 validated nucleotide genomes. Okay, so now this is — you know, obviously we’re a sequence database, so sequences are important to us, and these tools I just showed you are fairly old. They’ve been around for a while, and as I said, that we’re working on actually improving them right now. And one model that we’re using to improve data is something that we call the virus variation model, and here you can find links to the resource itself right here, and then to individual modules within the resource right here. So let’s see, I’m just going to go ahead and skip to this page. So if you click the homepage virus variation — and, again, if you are just wondering how to get there and forget the links, you just enter “virus variation” in your Google search, and we are the top hit. So that brings you to this homepage, and the homepage basically tells you that we have these five modules. I can tell you that by the end of the week, there will be another module for rotavirus that will be up, and then some how-to and other information about the resource over here on the left-hand column. So virus variation, in general, is kind of a value-add approach to viral sequence data. The scope of the resource is not just restricted to genomes but is all viral sequences that are in GenBank. So I think there are currently about 2.5 million viral sequences, including those for influenza. And I think at this point we have about 600,000 of those loaded in the virus variation. We are quickly moving to adapt this model for a number of other things, and in the next few months we will be loading our complete viral genomes viral database, the RefSeq stuff that I was just talking about, this will all be loaded into the virus variation resource. And, as I’m about to explain, this is going to be a good thing because you’re going get a lot more value out of all the data. So we’ll click on the module here, for dengue virus, and you get to a homepage that has a number of links; for example, the links to CDC and WHO, they explain diseases associated with dengue, as well as public health records. And then Health Map, which is a very cool little resource that aggregates data available from both reporting — clinical reporting, as well as media reporting on disease and viruses and things like that. If you go to this Health Map link you’ll see that we have a specialized map that’s been built that only has dengue reporting on it, and we have these for some of our other modules as well. And so rather than wait for that to load, we’ll move on. So the key functionality here is that we have a specialized database, and in the specialized database we have brought in all the viral sequence data for, in this case, dengue, and in other cases the other viruses in the resource. And we have a set of computational pipelines, which automatically, in this case, annotate or validate annotation, depending on the virus we’re talking about, map the proteins through standardizes protein names. I don’t know how many times you guys have ever experienced this, you go and you search for something like DNA polymerase for adenovirus, and you know that every single adenovirus needs a DNA polymerase, but you find that your Entrez search only, you know, gives you 69 DNA polymerases, and you know there’s 109 genomes, and that’s just not possible. And sometimes the reasoning is because there may be DNA polymerases in all those genome records but they are not annotated as DNA polymerases. They may be annotated with a different word or groups of words, and it makes it very difficult to find data. So what we’ve done here is we’ve mapped all the protein data and all the CDS data to standardized names. We also are parsing the records. Right now we’re working with GenBank records, but we’re about to improve our scope to BioSample records as well, and we’re extracting information out of the record, and from that information we are mapping things like host, like region or country, where the isolation occurred, the disease that may be associated with that particular sample, and now there are metadata. Now, metadata varies from virus to virus, depending on what’s important metadata for that particular virus. So what does that allow me to do? Well, what that allows me to do is say I want to search for all E proteins that come from humans and were isolated in Africa, and I just hit “Add to query” and I get a certain number of sequences returned, 13 in this case. And so let’s say, well I want a bigger dataset than that, let’s go any region, and I hit “Add query” again, and now I see that there are actually 3,800 sequences in total, 3,826 human region sequences in the database. Well let’s see, I want to actually restrict my search a little bit more and I only want to get things that have been isolated since 2010. And now I say okay. Now I have 467 sequences. So that’s pretty interesting. So I’m going to deselect these other guys here, and with this single query I want to show results. So now I’m brought to this results page, and here I have a tabular view of everything I’ve retrieved. And I can, of course, sort this table based on these headers and I can see that I have a bunch of sequences from Argentina and Brazil, and now I can do things with these sequences. So one thing I can do with these sequences is I can select individual ones and I can download them, do a multiple alignment, or build a tree from them. So I’ll talk about this a little bit. Oftentimes people are downloading sequences, so the download, I’m able to download to a number of different formats, as you can see here, and for those of you who like the download fast-A sequences, sometimes you come to the conclusion that the headers associated with fast-A sequences are kind of useless, because, depending on who has submitted them, the headers are very different, and so we have this customized def line here that we can basically use to — or the user can use to set their own def line the way they want it. And so, in this case, I’m taking out the definition, and I’ve got here accession, GI, accession, type, country, and maybe we want to add host to that, and I simply just click it like that. And then so when you download your table, the headers of your fast-A accession will be — or excuse me, your fast-A header will look just like this. Very helpful for a lot of people. I can also take this and I can build an alignment. We’re using a optimized version of muscle here. As you can see, this is very fast. These are not pre-calculated. These are being done on the fly. And this allows you to look at an alignment for your entire dataset. And we have up here the differences, marked in red, between the consensus sequence in this case and the various individual isolates. You can go in here and you can change the anchor to a different sequence, and now the alignment is redrawn. And you can also do things like, if we go in here, look at the feature table associated with a particular virus rather than just looking at it for the reference feature table. And that’s sometimes useful for spotting things that have truncations in them. And I should point out that we have — and then you can change the scoring method — that we have really been thinking a lot about doing more with these sort of visual displays and thinking about how we want to use them; for example, I could move this over to go to the regions. And we are really, really eager to enhance functionality from these sorts of displays. And I think that maybe a year or two we’re going to have a lot more utility built into these displays. At some point you have a set of a thousand sequences. An alignment done like this is actually not particularly useful. We’re sort of aware of that. And hopefully you’ll be seeing new things coming out of this, where we’re trying to make these really large datasets visually more informative so you can make a lot of decisions about what data you want and do some exploration of the data prior to downloading. You can download this alignment in a number of different formats, and with this downloaded alignment, obviously you can use them to display in your own viewer but also to build trees and do other sorts of data analysis. So these alignments can be very useful to users. We also have the built-in tree-building algorithm. This particular algorithm that I’m about to show you is our historic older one. And as you can see, it’s fairly slow compared to the alignments. We are actually about to release a new tree-building tool that will be released for, actually, all viruses except for flu and dengue, probably by the end of this week, actually. We will be making improvements to that. That alignment viewer is a little bit more bare bones. Unfortunately, this is exactly why we’re going to replace this. It doesn’t like very large sequence sets. So, anyway, we will have a new viewer coming out very soon, and this new tree viewer, while more bare bones, will handle much, much larger datasets and will do it much quicker. And we hope that over the next coming months, as people use it and we get feedback from folks, that we will greatly improve it. Let’s just try one small tree-building set and see if this will work. As I said, we’re rolling resources out right now, so there could be some server issues going on. So here we’re basically just selected the particular clustering algorithm that you want to use. I’ll stick with neighbor joining. And then this is the distance matrix being used, and so basically you get a tree. And what’s kind of cool about this tree is there is a menu here that allows you to mark up data based on particular — or excuse me, points on the tree based on particular metadata. And, for example, I can enter human here. I think all these are human actually. Or maybe none of them are human. So, anyway, if you go here and just mark up, for example, yield just very quickly. You can see that everything from 2010 turns red and you can easily spot this on the tree. This menu is kind of weird. The way it was built, it reflects more historical step-wise additions to the functionality. With the new tree viewer, we’re going to make it a lot more intuitive and create something that, you know, people are not restricted just to the fields here, to actually look up things associated with the particular sequences, i.e., that you can search for particular GIs, not just accessions or other information that may be associated with the record. So, anyway, tune in later this week and you’ll see the new tree viewer up for, again, all the resources except for dengue. Moving on from there, it’s important to just point out that you can get a link to the query and you can share with other people, so that if you’re working with a group of people, you can just send them this link that I just highlighted up top and they can go and see exactly what you’re looking at, and then obviously there’s information about how [indiscernible] and stuff like that is on the page as well. So moving on, so in addition to our virus variation — and just to sum up a virus variation, like I said, we have these viruses — or modules for these viruses right now. We’ll be adding a module for rotavirus by the end of the week, and we’re improving the tree viewer, the alignment viewer, and a number of other things. So, over the next coming year, we see this as a model for everything, this sort of mapping, improving the data as best we can, standardizing and mapping all the metadata associated with the sequence data to standardize terms. All this sorts of stuff we see as the future, and we’ll be adding more virus modules. We will be improving the tools associated with the modules in the coming, literally days and months. So, in addition to the virus variation resource, we maintain the retrovirus resource, and this is sort of a little bit different. The focus of the retrovirus resource isn’t necessarily just sequences themselves, probably are more key module within this resources or HIV human interaction database. And this is a collaboration between us and Southern Research Institute. And the idea here is that they are mapping all the known associations between human and HIV protein, as well as RNA interference experiments and trying to create — or we’re trying to create an environment where you can retrieve this interaction data. So if I go in here and I’m interested in, say, a particular “Go” term, for example, DNA replication checkpoint, and I could also filter that search based on whether or not something has a phenotype, whether there are biological pathways that have been mapped out associated with that, or if there’s gene expression data. Let’s just click that and hit “Search.” And basically what you get back is a group of, in this case, gene records for which there are known associations between HIV and humans. And so, again, these can be protein/protein interactions. They can also be knockout experiments done by RNA interference. And so if I go here, I can basically see that gene record, and in that gene record learn a lot more about that particular protein. And so you can see here the whole list of things under this HIV interaction. And so this is the protein, BPR, and here are all the various interactions. Here is the PubMed citation for those interactions. And, in this case, we actually have a couple different protein interactions. So it’s a really cool way of kind of interacting with data in the gene database, which, I don’t know, for those of you who haven’t experienced, the gene database is a very interesting database with a lot of curated data that sort of goes beyond just giving you the sequence of a gene but tells you about functionality and other aspects of the gene. And so we’ll go over here. We can click, and now we’ll switch from looking at human genes. We’ll start looking at HIV genes. And so if I go to “Attach,” for example, it takes a while to load. Let’s go to another one. There we go. So here’s Gag-Pol, and, again, you have these HIV interactions, and you can see a very long list of, here, protein and genes for which Gag-Pol interacts with. It was 1,078 interactions between Gag-Pol and human genes, so pretty extensive. This is manually curated information, and all the information includes links to PubMed. So a very useful resource for people in the HIV community. I should point out that this model could be replicated for other systems as well. And we are certainly looking for collaborators who might be interested in something like this. We’ve done it for HIV-1, but there certainly could be other interesting viruses where the same sort of data exists, and we’d be happy to talk to people about that. That’s kind of true for anything I’m talking about today. As we expand the virus variation modules, as we are changing our data structure with genomes, we are particularly interested in speaking with stakeholders and seeing how they use the database and the current resources and helping us to improve any future resources. So now let’s go back here to the retrovirus database — or excuse me, resource. And so one other thing I’m going to point out very quickly is that we have something called the retrovirus genotyping tool. This is something that actually extends beyond retroviruses at this point. This is another tool that is, right now, under redevelopment. Basically this was created as a sliding window algorithm that was created quite a while ago, and I’m just going to give an example. And as you can see, it has a very ugly output. It’s actually fairly powerful because what you’re seeing here is basically I have a query sequence, and then I’m comparing it to, in this case, three different groups of references, one green, one red, one blue, and you’re seeing a sliding window that, as I’m passing, I believe here it’s a 200 nucleotide window across each sequence, and where this goes up, that means you have high amount of similarity, and when it goes down, low amount of similarity. So you can see for this HIV query sequence, this particular sequence has high similarity to the red references in this region, high similarity to the blue references in this region, back to the red. Actually sort of across the board the same. And then high similarity to the green references in this region. And so that allows you to, among other things, spot recombination. Now we want to expand greatly the utility of this, and there’s a number of different fields, including human adenovirus, where these sorts of tools are very useful in spotting not just very wholesale recombination events like this but much more discrete ones. Right now we have reference sets for polio and hepatitis B and C, as well as retroviruses, but if you play around with this tool you’ll realize that you can select your reference sets. So if I go here, for example, to hepatitis B, I have quite a few references, and, you know, there’s a number of different subtypes. I can now go in here and select those references, so hepatitis B, and then give it a query sequence, presumably hepatitis B-1, and see the output for hepatitis B. And you can do the same thing with multiple sets of HIV references, as well as polio virus and hepatitis C references. So it’s got a lot of functionality. It’s kind of ugly. We want to improve it. We’re working on it. And if anyone out there is using it, please let us know how you’re using it, and we’ll definitely be interested in talking to you over sort of a longer term as we redevelop the code and start developing the interfaces and displays. Okay, I wanted to finish up by saying that a lot of the viral sequence data is projected through Entrez, if I can find the tab. So, here, all I’ve done is I’ve gone to Entrez Genome, which you can find at the NCBI homepage, and I’ve simply searched by taxonomy group, in this case filoviridae, and now I see all the genomes we have for filoviridae. And it’s important to note that when you go into any of the Entrez resources, you can find related data by selecting, first, the database, and then, here, components or other genomes by species. And this a little bit cryptic but what components mean are essentially the RefSeq records I showed you, our references for each species, and other genomes for species are all those neighbor sequences I showed you. So I can just click on this, I can go “Find Items,” and you can basically get — unfortunately, the link’s not working. Very sorry. Usually the link works. When it does work, you should be able to get — let’s try it one more time — all the other genomes. And, okay, so that’s not going to work today. And when you’re going to an individual nucleotide — or excuse me, genome record, you’ll see this sort of display. And you’ll also see the related information links, including the other genomes by species link. And this time it actually works, so, again, you get all the other validated genomes here. And here’s another look, a view, this time from Entrez Nucleotide. And, again, you’ll see filters here that I can filter based on — I only want RefSeqs, as you can see here, I only want GenBanks, as you can see here. I can also filter based on particular taxonomy nodes. I can select an individual record, and now I’m taken to that GenBank record. You can analyze the sequence by class and everything like that. But, again, you can go and find related information, the proteins for that particular sequence, and things like that. So you’ll see our data kind of throughout the NCBI Entrez resources, although you may not actually recognize it as our data. So you should be only always a click away. So, for example, when I go to filoviridae and I want to go and see genome information, I just go to the database genome and select other options genome. And, again, this link is not working. I’m very sorry about that. So, anyway, when they work they’re great. [Ben Busby speaking] Rodney, we should wrap up in the next minute to allow a second. I’ll wrap up. So, anyway, that’s sort of an overview of our resources. If you have any questions, obviously the people to direct those questions to are there in front of you right now. I’m very easy to find as well on the Net. If you talk to Wayne or send Wayne an e-mail, he can forward it to me as well. And, like I said, we’re very interested in collaborations with users and trying to figure out how people are using things and trying to make all this information more useful to a variety of different types of users, those who are power users who are trying to identify, you know, different viruses from SRA reads or short reads, to people who are just exploring the genomes and particular viruses, and everything in between. So that’s about it. [Ben Busby speaking] Well thank you. Are there any questions? That’s a great question. Is there a bacteriophage database is the question from the audience? How do you mean “a bacteriophage database”? Rodney, were you able to hear the question? Unfortunately, I could not hear a word. Let me repeat it real quick. So she said that they’re using phages for bacterial genome markers; right — for bacterial genotyping, and so what they’re wondering is, in the bacterial records are there ever links to phage genomes? [Rodney Brister speaking] Okay, so first of all, bacteriophage, we have a number of different projects that are in the conception phase. I would very much suggest that if you’re interested in bacteriophage in the context of relationships between free living phage and integrated prophage, e-mail me. So the reason I say that is, as we change our data model, we’re going to be moving to a data model based on what I’ll loosely call alignments, and instances. So if I have a reference genome for a particular bacteriophage that is a free-living bacteriophage, that particular reference may have homologies in a number of places, including bacterial genomes, and we’re trying to build a model that supports that, so that if I went and I said, “Ooh, this is my favorite bacteriophage, where do I find it,” you would not just find free-living phage neighbors, but you would also find prophage neighbors. And so to go to a different part of your question, if I come here and I filter by host, I select “bacteria,” you will get all the documented bacteriophage genomes, both references and neighbors. I will tell you that the bacteriophage taxonomy landscape is very rapidly changing. I know that because we’re the ones changing it. We’ve been working with ICTV. They are going to be fundamentally updating taxonomy across all bacteriophages over the course of the next year. You should already be seeing changes in our database. But here is a list of all bacteriophage records and all the neighbors. So if you’re looking to identify bacteriophage in host genomes, in bacterial genomes, and you need sequences to begin to identify that, this is where you would go. If you want to just download this to fill your own database, go over here to download, and you can download a list of accessions or you can download this neighbors data here, and hopefully this happens quickly. And that will give you the accessions for both the references and the neighbors, as well as some taxonomy information about the sequences and, oh, and host type. But in this case you’ve already selected the bacteria, so you already know the host type. Also, I should point out that if you are on this page right here and you download accession list of all viral genomes, the table that you get also includes host type. So you can download this whole table and, you know, load it in your own database or just work with it as an Excel file, and sort by host type. And so the host types include — if I go back here, host types include bacteria, archaea, as well as some environmental samples that might include a bacteriophage as well. Does that kind of answer your question? [Ben Busby speaking] Also, Wayne is probably going to mention this type of stuff when he talks about blast resources. Are there any other questions for Rodney? If not, we’re running a little bit behind, and we should probably move on. Well, thank you, Rodney, and, as always, it is appreciated. And we will get questions from the audience e-mailed to you. [Rodney Brister speaking] Okay, thank you. [Ben Busby speaking] All right, great. Thanks. Have a great day.

No Comments

Leave a Reply