Sequence-based Clustering

● Introduction

○ Sequences and Sequence Clusters

○ What are Sequence Clusters?

○ Why use Sequence Clusters?

● Documentation

○ How are protein sequences in the PDB clustered?

○ How to use sequence clusters to explore the 3D structures in RCSB.org?

● References

Introduction

The amino acid sequence of a protein directs its folding into shapes that enable specific functions. For most of the proteins in cells, protein folding is a rapid and in most cases repeatable process (Anfinsen, 1973) suggesting that protein sequences have the necessary information to fold into functional proteins and that each protein sequence forms a characteristic structure. While local regions of the protein may adopt slightly different conformations in different biological contexts, the overall structure remains the same. In a few exceptions a protein may adopt a completely different shape in the presence of a specific environment or binding partner(s). Research directed towards predicting protein structure from sequence has been ongoing for more than 50 years. Recently, our ability to compute the 3D shapes of proteins using their amino acid sequence has made tremendous progress by applying machine learning techniques to the archived experimental structural data in the PDB (Baek et al., 2021, Jumper et al., 2021).

When exploring PDB structures, the level of similarity between the amino acid sequences of two or more proteins can be used to infer their structural and functional similarity (Sander and Schneider, 1991). Protein sequences that are 100% identical to each other belong to the same protein, but high levels of sequence identity (e.g., >90%) is also indicative of the same protein, perhaps with a few mutations or variations due to different sources of the protein. Lower levels of sequence similarity between protein sequences may indicate some relationship between their structures and functions. The threshold of sequence similarity that indicates structural homology depends on the length of the alignment. As a rule of thumb for protein sequences that are longer than 100 amino acids, >25% sequence identity indicates similar structure and function (Sander and Schneider, 1991).

Sequences and Sequence Clusters

As the single worldwide repository for macromolecular structures, the Protein Data Bank holds many structures with the same or similar sequence and structures. This redundancy enables deep understanding of the biology of these proteins. However, some bioinformatics analyses may benefit from grouping these redundant sequences and structures. For example, all protein structures of the same protein have the exact same sequence. These may be grouped together. Protein sequences where 90% of the sequence is identical is said to have a 90% sequence identity, while proteins whose sequences are only 30% identical have a 30% sequence identity. Grouping proteins into clusters by sequence identity is a way to reduce/remove redundancy in 3D structures (including experimental structures and Computed Structure Models or CSMs). The sequences in a particular cluster are expected to share structural and functional properties depending on the level of sequence identity.

What are Sequence Clusters?

The amino acid sequences of all proteins, whose 3D structures are available from RCSB.org (including experimental structures and CSMs) are grouped at different levels of sequence identity (e.g., 100%, 95%, 90%, 70%, 50% and 30%) to yield sequence clusters. These pre-computed sequence groups are available for exploring the PDB archive and grouping search results.

Why use Sequence Clusters?

Instead of using all sequences of the 3D structures available from RCSB.org for analysis, representative sequences from each of the sequence clusters can be used. Depending on the level of sequence similarity, properties and features of the representative proteins can be extended to other members in the cluster. Using sequence clusters has the following advantage:

It reduces the size of the sequence data set of all 3D structures available and can help simplify, optimize, and make their analysis more efficient.
Monitoring growth in the non-redundant sequence clusters enables monitoring the variety of structures being deposited to the PDB
It can be used to organize sequences from both experimental structures and CSMs to explore evolutionary relationships between specific proteins.

Documentation

How are protein sequences in the PDB clustered?

Sequence clusters are calculated with DIAMOND software (Buchfink, 2023).

DNA and RNA sequences are filtered out of the results, as well as small peptides of less than 10 amino acids. Clustering involves an all by all comparison of protein sequences in the PDB. As described in the article and documentation, sequence identity is defined as an “approx-id” score which takes into account the local alignment and the substitution matrix. Sequence coverage is defined as “member-coverage” relative to the alignment of a member with its representative sequence, where the threshold applies only to the member sequence.

Diamond is run with the following parameters:

The “cluster” option is specified for highest sensitivity.
Alignments are calculated with identity thresholds of 100%, 95%, 90%, 70%, 50%, and 30%.
The member-coverage threshold is set to 80%.
The default BLOSUM62 substitution matrix is used for local alignments.
For more details on the procedure, please refer to the documentation.

Note: The sequence clusters are subject to change over time as new protein sequences continue to be added to the archive.

How to use sequence clusters to explore the 3D structures in RCSB.org?

Each week, RCSB PDB computes sequence clusters for all protein sequences available from RCSB.org [including experimental (PDB) structures, and available CSMs]. You can use these pre-computed clusters the following ways:

Examine and explore growth of sequences in experimental (PDB) structures:

Learn about the number of unique protein sequences added to the PDB annually - review the non-redundant protein sequences statistics and explore groups of structures with these sequences for further analysis.
Learn about the cumulative growth of unique protein sequences in the PDB and explore groups of structures with these sequences for further analysis.

Organize search results (including experimental structures and CSMs) based on sequence similarity:

Group search results by sequence identity and either view representative members of the group or explore all members of these groups (using the Group Summary Pages).
Explore the sequences, structural features, and ligand interactions of other sequences in the same protein sequence cluster to gather structural and functional insights and develop hypotheses.

Learn about the number of unique protein sequences added to the PDB annually - review the non-redundant protein sequences statistics and download these sequences.
Learn about the cumulative growth of unique protein sequences in the PDB and download them.
Organize search results by sequence identity into meaningful/manageable groups and explore representatives or as groups (using the Group Summary Pages).
Explore the pre-computed clusters that include a specific polymer entity in a PDB entry.

References

Anfinsen, C. (1973), Science, 181, 223-230; doi: 10.1126/science.181.4096.223
Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G. R., Wang, J., Cong, Q., Kinch, L. N., Schaeffer, R. D., Millán, C., Park, H., Adams, C., Glassman, C. R., DeGiovanni, A., Pereira, J. H., Rodrigues, A. V., van Dijk, A. A., Ebrecht, A. C., Opperman, D. J., Sagmeister, T., Buhlheller, C., Pavkov-Keller, T., Rathinaswamy, M. K., Dalwadi, U., Yip, C. K., Burke, J. E., Garcia K. C., Grishin, N. V., Adams, P. D., Read, R. J., Baker, D. (2021). Accurate prediction of protein structures and interactions using a three-track neural network. Science (New York, N.Y.), 373, 871–876; doi: 10.1126/science.abj8754
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J., Back, T., Petersen, S., Reiman, D., Clancy, E.,, Zielinski, M., Steinegger, M., Pacholska, M., Berghammer, T., Bodenstein, S., Silver, D., Vinyals, O., Senior, A. W., Kavukcuoglu, K., Kohli, P., Hassabis, D. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596, 583–589; doi: 10.1038/s41586-021-03819-2
Sander, C., Schneider, R. (1991). Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68; doi: 10.1002/prot.340090107
Steinegger, M., Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol 35, 1026–1028. https://doi.org/10.1038/nbt.3988

Please report any encountered broken links to info@rcsb.org

Last updated: 9/3/2025