Why a protein model search tool?

Example 1 - Example 2 - Methods

Surprisingly, some very useful modern resources (databases and datasets) of protein models do not allow sequence searches. This is especially so in resources that contain models built with techniques other than comparative modeling. Hence, if you wanted to see whether somebody ever modeled a given protein or a similar sequence through an ab initio technique, or using tools that integrate different kinds of sparse and low-resolution data, etc. you have to manually browse the dataset and literature. Chances you'll find something useful are very low, I'd say.

The community needs agreement among all groups running modeling services and, more important, those running databases of literature-associated models, to create a single repository with features that facilitate discovery, especially through sequence queries.

In the meantime I have created this palliative solution, by no means definitive, which gathers >5000 sequences for which models are available in resources based primarily in methods other than homology modeling, mostly coevolution-based predictions (Gremlin/Rosetta, DMPfold, PConsC3, Tara3D), integrative models (PDB-Dev and SASBDB), CASP models, and a models taken from literature. My web app allows to search these models at the sequence level.

How could this help your research?
Essentially by enabling you to find possible templates for modeling the topology of a protein when there are no clear templates in the PDB (especially when querying the Gremlin, DMPfold, PConsc3, Tara3D and CASP datasets).
Also to find your query protein making part of larger complexes (especially when querying the PDB-Dev and SASBDB)



Example 1: Search for ubiquitin in the PDB-Dev and in SASBDB

If you search the exact sequence of human ubiquitin, the result points at model PDBDEV_00000004 of the PDB-Dev, including a link to the download page. Likewise, searching in SASBDB returns links to SAXS-based models of covalently linked ubiquitin molecules

example



Example 2: Search for a possible template to model a lipoprotein from Flavobacteriaceae

Searches of this protein sequence at the PDB or PMP do not retrieve any similar sequences. But searching in the web app reveals a coevolution-based model of 77% similarity from the Gremlin 2017 dataset, which actually corresponds to the same gene in another bacterial species:

example



Methods: Compilation of entries

Models from the PDB-Dev and from Ovchinnikov et al Elife 2015 were downloaded manually from https://gremlin2.bakerlab.org/struct.php; models from Ovchinnikov et al Science 2017 were extracted from the file at https://gremlin2.bakerlab.org/meta/aah4043_final.zip. Models from SASBDB were downloaded from https:www.sasbdb.org/media/pdb_file/(identifier)_fit1Imodel1.pdb where (identifier) is each entry from a list maintained at http://ftp.pdbj.org/emnavi/data/ids/sasbdb.txt . CASP models were downloaded from CASP's Prediction Center website at http://predictioncenter.org/. PConsC3 models were extracted from the file at http://pconsc3.bioinfo.se/static/download/pfam.tar.gz. DMPfold models from the file at http://bioinf.cs.ucl.ac.uk/downloads/dmpfold/pfam_models.tgz. Tara models were obtained from https://zhanglab.ccmb.med.umich.edu/Tara-3D/. All these models were obtained between march and november of 2019.

Protein sequences were extracted for all chains, retaining their chain IDs and sources in the FASTA headers. Searches of user-provided query sequences proceed through alignments using a BLOSUM62 matrix, with 3 parameters the user can adjust to filter results (fraction of internal gaps, similarity and coverage). Preset values are provided for searching nearly exact sequences in larger assemblies, and for searching at lower similarity for homology modeling.