Arthroverse Downloads

All datasets are available for free download. Click a file type to begin.

Summary

This TSV file contains a unique identifier for each family (e.g., IF00002, IF00067), the classification type of the family, the number of members within the family and the count of datasets and scaffolds associated with the family.

Metadata

This TSV file contains metadata for genomic families. Each row represents a family and includes details such as the number and percentage of metagenomes, metatranscriptomes, and isolates, as well as taxonomy group distributions across Bacteria, Archaea, Eukaryota, Viruses and Unclassified groups.

Domains

This TSV file contains PFAM domain annotations for each family, including the PFAM hit, HMM alignment start/end positions, genomic start/end positions, and an accuracy score for each alignment.

Sequences

This TSV file contains representative sequence data for each family, including the representative sequence length, average family sequence length, the sequence header, and the sequence itself.

Families

This archive contains FASTA files for all protein families. Each file corresponds to a specific family and includes its protein sequences for use in alignment, annotation, and phylogenetic analyses.

Families Aligned

This archive contains FASTA files of aligned sequences for all protein families, where each file includes the multiple sequence alignment of the representative protein sequences within a family.

HMM

This archive contains HMM profile files for all protein families — probabilistic models built from aligned sequences for use in sequence similarity searches and annotation.

Structural Models

This archive contains predicted protein structure files in CIF format for all protein families, each corresponding to the representative 3D structure of a family's protein sequence.