qualibact v1 methods
General Strengths
-
The pipeline is fully automated, generic, and can be applied to any set of genomes — including arbitrary subsets such as species, clonal complexes, or lineages.
-
Quality assessment is based on multiple standard metrics (e.g. N50, number of contigs, genome size, GC%), allowing reproducible filtering.
-
Species-specific thresholds can be derived from available reference genomes, and thresholds can be updated as more genomes are added.
-
Variation between species — even within a genus — supports the need for species-level cutoffs, which this approach accommodates.
-
Variation between SRA and Refseq: We have observed that Genome size and assembly length distributions differ significantly between RefSeq and SRA (i.e. ATB). The cause is unclear, but relying on RefSeq-derived thresholds alone may result in unfairly excluding valid genomes. This approach combines both datasets to ensure a more inclusive and representative set of thresholds.
Caveats
- Species Definitions Depend on GTDB: This tools uses Sylph for species designation, so all GTDB-related quirks apply. E.g., Shigella spp. is included in E. coli, and there are issues separating Burkholderia mallei from Burkholderia pseudomallei and Bordetella pertussis/Bordetella parapertussis from Bordetella bronchiseptica.
- No Ground Truth Claims: This evaluation reflects what has been previously observed in available datasets. It does not attempt to define a universal "ground truth" for any species.
- Assembly-Method Specific: The metrics (e.g. N50, number of contigs) are meaningful primarily for assemblies generated with Shovill (or similar SPAdes-based pipelines). Exact thresholds will vary for long-read or alternative assemblers like SKESA. However, not using Shovill implies rejection of the Torstyverse, which is heresy.
- Long-Read Assemblies Not Explicitly Handled: These cutoffs are not designed for long-read assemblies. That said, genome size and GC content thresholds should still apply, and it's reasonable to expect long-read assemblies to exceed the quality of short-read derived thresholds — not fall below them.
- Generic vs. Specific Tradeoff: While the generic approach is broadly applicable, it may miss species-specific quality nuances or lineage-level exceptions.
Citation
If you use QualiBact, please cite the following:
Alikhan, NF. Species specific quality control of bacterial de novo genome assemblies using QualiBact. Available at: https://github.com/happykhan/qualibact (Accessed: [insert date]).
Datasets
Criteria cutoffs were determined using a combination of standard statistical methods and machine learning techniques to identify outliers and establish robust QC thresholds. Two datasets were used for this purpose:
-
Allthebacteria dataset: The 2024-08 release comprises 2.4 million uniformly reprocessed genome assemblies, including taxonomic estimates aligned to the GTDB phylogeny. Taxonomic classification was based on sylph (v0.5.1) results with the GTDB r214 database. For more information on how sylph was run, see the Allthebacteria documentation. Only species with more than 1,000 genome records (as defined by sylph) were included. For more information about sylph visit sylph documentation.
-
NCBI RefSeq complete genomes: Complete genomes from RefSeq were used to compare metrics such as genome size, number of coding sequences, and GC content. Metadata was downloaded using the NCBI Datasets command-line tool, filtered for completeness, cached as JSON, and parsed for analysis.
Quality control metrics
Pre-calculated "assembly stats" (N50, longest contig, total assembly length, number of contigs, and longest contig size) and CheckM results (contamination and completeness) were used. The GC content was recalculated to obtain data with at least two decimals (higher precision).
Outlier Detection and Filtering
To identify and remove potential outliers prior to downstream analyses, an unsupervised machine learning approach using the Isolation Forest algorithm was implemented. This method is well-suited for high-dimensional biological data and does not require labeled training examples.
The outlier filter was applied to features representing key genome quality and assembly metrics: total_length, GC_Content, N50, number (contigs), longest (contig length), Completeness_Specific, and Contamination. The scikit-learn (v1.6.1) IsolationForest implementation was used with a fixed random seed (random_state=42) for reproducibility. Each sample received an anomaly score, and was classified as either inlier (anomaly = 1) or outlier (anomaly = -1). Outlier-filtered datasets were retained in parallel with unfiltered data for comparative analyses. Source code for this process is available at QualiBact.
Statistical Analysis and Threshold Selection
Summary statistics (mean, standard deviation, quartiles, interquartile range) were calculated for genome quality metrics. Normality was assessed using the D’Agostino–Pearson test. For metrics without RefSeq values (e.g., N50, number of contigs), lower and upper bounds were set by applying Isolation Forest filtering, then selecting the 0.5th and 99.5th percentiles of the filtered data. For genome size and GC content, distributions were compared to RefSeq using the Kolmogorov–Smirnov test and Wasserstein distance. Final bounds were set as the most extreme values between RefSeq-derived limits and the 0.5th–99.5th percentile range of SRA genomes after filtering. Histograms were generated to visualize comparative distributions.