Contributing to QualiBact
We welcome contributions to QualiBact! Major contributions to source code, manuscript development, metric validation and adoption, and providing additional data for calibrating quality thresholds will be granted authorship on publications. Pull requests are welcome through GitHub.
We Need More Data!
There are many bacterial species for which we have limited data to establish robust quality control (QC) thresholds. If there is a species you work with that is not mentioned in our current dataset, please make a request. See our current list of requests, and for details on how to suggest a species, see the Requests page.
The Qualibact standard of QC thresholds included here (usually qualibact-v1) is based on AllTheBacteria data, which used Shovill as the genome assembly software. We are aware that not all species of interest are covered, despite the dataset containing 2.4M genomes. We are also aware that the choice of assembler will affect certain metrics (such as N50 and number of contigs), and that many species have limited public data, which impacts our ability to set robust thresholds.
We are actively seeking contributions of additional data. There are two avenues for contribution:
- Direct contribution of genome assembly statistics, nucleotide composition data, and metadata for species of interest (see below for details). We will run the qualibact pipeline on these data and include them in future releases. This is the preferred route for most users, and provides consistent automated processing of data.
- Presenting thresholds you have established for species of interest based on your own data and analyses. We will include these thresholds in future releases, with appropriate attribution.
Route 1 - Direct Data Contribution
In this route, you will provide us with values for common assembly metrics, metadata describing assembly methods, nucleotide composition data, and checkm2 results (optional) for genome assemblies of species you are interested in. We will run the qualibact pipeline on these data and include them in future releases. Specifically, we require:
- Genome assembly statistics
- Nucleotide composition data for genome assemblies
- Metadata describing assembly methods and parameters
- OPTIONAL: CheckM2 quality assessment results
We do not need the assembly sequences themselves. The data you provide should be for genomes you consider to be of sufficient quality for tasks such as genotyping (e.g., cgMLST, nkST), antimicrobial resistance gene detection, and SNP-based phylogeny.
As part of your submission, please also include the list of contributors, so we can appropriately attribute them. Please also include a description of the dataset you are submitting: number of genomes, how they were selected, and where they were sourced.
Data Requirements
Please provide the following files for each dataset:
- Nucleotide counts (TSV format, compressed as .xz or .gz).
- Metadata description TSV file with assembly/sequencing details.
- Genome assembly stats TSV file with genome assembly stats, like N50.
- CheckM2 results (TSV format, compressed as .xz or .gz) - This is optional.
File Naming Convention
- Nucleotide counts:
nucleotide_counts_[dataset_name].tsv.xz - Metadata:
metadata_[dataset_name].tsv.xz - Assembly stats:
assembly_stats_[dataset_name].tsv.xz - CheckM2 results:
checkm2_results_[dataset_name].tsv.xz
You can email links to these files to: nabil.alikhan@cgps.group
Metadata Requirements
Please include the following information in your metadata file:
- Assembly software/pipeline and version
- Assembly parameters used
- Sequencing platform and instrument
- Species name
- Data source and accession numbers (if applicable)
That would be a csv file like:
filename platform instrument species accession software version custom_parameters Notes
SAMN40089455.fa Illumina NextSeq550 Salmonella enterica SAMN40089455 SKESA 1.1 None SKESA, default parameters
SAMN40089456.fa Illumina NextSeq550 Escherichia coli SAMN40089456 SPAdes 2.3 --isolate SPAdes
my_reads_pri.fa ONT MinIon Escherichia coli NA Flye 1.3 None None
The values in the filename field should match what you have provided in the other files.
Genome assembly statistics
We must have the following metrics:
| Metric | Description |
|---|---|
total_length |
Total number of base pairs across all contigs/scaffolds |
number |
Total number of contigs/scaffolds |
N50 |
Contig length such that 50% of the assembly is in contigs ≥ this size |
DO NOT APPLY A MINIMUM CONTIG SIZE FILTER.
For the sake of consistency, please use assembly-stats as described by AllTheBacteria.
Assembly-stats is available on conda and biocontainers, you can also download the source from GitHub and compile it yourself.
It is very easy to use. The following command will run on all files in the folder matching the wildcard:
assembly-stats -t /workdir/assembly/*.fa.gz
The file should look something like this:
filename total_length number mean_length longest shortest N_count Gaps N50 N50n N70 N70n N90 N90n
SAMD00127152.fa.gz 5180815 165 31398.88 486012 206 0 0 143489 12 83834 22 25975 43
SAMD00127152.fa.gz 5180815 165 31398.88 486012 206 0 0 143489 12 83834 22 25975 43
SAMD00127153.fa.gz 5278830 113 46715.31 633980 211 100 1 245414 6 143082 12 43792 24
SAMD00127154.fa.gz 5263743 218 24145.61 390782 208 100 1 147090 12 94642 22 26394 41
SAMD00127155.fa.gz 5261650 147 35793.54 533596 209 100 1 178971 10 119101 16 33179 32
SAMD00127156.fa.gz 5262861 207 25424.45 588167 208 100 1 147090 11 96125 20 30157 39
We will also accept output from other tools, such as Quast, as long as the required metrics are present and clearly labelled.
Nucleotide Composition Analysis
Other tools do provide GC content, but for some analyses we require more detailed nucleotide composition data. We need a table containing counts for each nucleotide in genome assemblies:
Required Output Format
Filename A T G C N Other
SAMN40089455.fa 1234567 1200000 1345000 1320000 1000 50
SAMN40089456.fa 1000000 980000 1010000 990000 500 10
The Filename should match what you have provided in the other files.
Example script
Here is an example script that you can adapt:
#!/bin/bash
# Usage: ./count_bases.sh path/to/file.fa
FASTA="$1"
FILENAME=$(basename "$FASTA")
# Write header if needed
echo -e "Filename\tA\tT\tG\tC\tN\tOther"
# Count nucleotides
grep -v "^>" "$FASTA" | tr -d '\n' | awk -v file="$FILENAME" '
BEGIN {
A=0; T=0; G=0; C=0; N=0; other=0;
}
{
for (i = 1; i <= length($0); i++) {
b = toupper(substr($0, i, 1));
if (b == "A") A++;
else if (b == "T") T++;
else if (b == "G") G++;
else if (b == "C") C++;
else if (b == "N") N++;
else other++;
}
}
END {
printf "%s\t%d\t%d\t%d\t%d\t%d\t%d\n", file, A, T, G, C, N, other;
}'
Running CheckM2 - Optional
I am aware that CheckM2 can be a tall order with thousands of genomes, and hence for submission this is optional.
To ensure consistency with existing analyses, please follow the same protocol used by AllTheBacteria:
Requirements
- CheckM2 version 1.0.1
- CheckM2 database: uniref100.KO.1.dmnd
We recommend using the same Singularity container used by AllTheBacteria:
Container download:
- Source: https://osf.io/7vpy3
wget -O checkm2.1.0.1--pyh7cba7a3_0.img https://osf.io/download/7vpy3/
CheckM2 database download:
- Source: https://osf.io/x5vtj
wget -O uniref100.KO.1.dmnd https://osf.io/download/x5vtj/
Example CheckM2 Command
# Define variables
WORKDIR="/path/to/working_directory"
IMG="$WORKDIR/checkm2.1.0.1--pyh7cba7a3_0.img"
DB="$WORKDIR/path/to/uniref100.KO.1.dmnd"
OUTDIR="$WORKDIR/output"
FASTA="$WORKDIR/path/to/assembly.fa"
# Set up the CheckM2 command
singularity exec --bind $WORKDIR $IMG checkm2 predict --allmodels --lowmem --database_path $DB --remove_intermediates --force -i "$FASTA" --threads 4 -o $OUTDIR
The output from CheckM2 will look like:
Name Completeness_General Contamination Completeness_Specific Completeness_Model_Used Translation_Table_Used Coding_Density Contig_N50 Average_Gene_Length Genome_Size GC_Content Total_Coding_Sequences Additional_Notes
SAMD00127152 100.0 0.09 100.0 Neural Network (Specific Model) 11 0.875 143489 303.8069048574869 5180815 0.5 4982 None
Route 2 - Present Your Own Thresholds
If you have established quality control thresholds for a species of interest based on your own data and analyses, we welcome you to present these thresholds for inclusion in future releases of QualiBact. Please provide:
- Species name
- Quality control thresholds for relevant metrics (e.g., N50, total length, number of contigs, GC content, completeness, contamination) as a CSV. The table should include columns for metric name and threshold values. See example below.
- Description of the dataset used (summary table) to establish these thresholds. See example below.
- Any supporting analyses or visualizations
An example of the threshold table is shown below.
| Species | Metric | Lower Bounds | Upper Bounds |
|---|---|---|---|
| Acinetobacter baumannii | N50 | 17000.0 | |
| Acinetobacter baumannii | no_of_contigs | 490 | |
| Acinetobacter baumannii | GC_Content | 38 | 40 |
| Acinetobacter baumannii | Completeness | 96.0 | |
| Acinetobacter baumannii | Contamination | 7 | |
| Acinetobacter baumannii | Total_Coding_Sequences | 3400 | 4500 |
| Acinetobacter baumannii | Genome_Size | 3600000 | 4600000 |
An example of the summary table is shown below.
| metric | mean | std | median | q1 | q3 | iqr | min | max | MY_LOWER | MY_UPPER | species | count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| N50 | 142216.91 | 76223.46 | 134688.50 | 96516.00 | 164093.75 | 67577.75 | 9210.00 | 718672.00 | 17467.69 | 466321.33 | Acinetobacter baumannii | 26690 |
| number | 120.38 | 66.22 | 105.00 | 83.00 | 136.00 | 53.00 | 19.00 | 779.00 | 34.00 | 482.00 | Acinetobacter baumannii | 26690 |
| longest | 339276.47 | 165020.98 | 301308.50 | 250723.00 | 373248.00 | 122525.00 | 44374.00 | 1648648.00 | 73316.14 | 1110621.82 | Acinetobacter baumannii | 26690 |
| GC_Content | 0.3910 | 0.00 | 0.3910 | 0.3901 | 0.3902 | 0.00 | 0.3800 | 0.4000 | 0.39 | 0.40 | Acinetobacter baumannii | 26690 |
| Completeness_Specific | 100.00 | 0.04 | 100.00 | 99.99 | 100.00 | 0.01 | 94.33 | 100.00 | 96.86 | 100.00 | Acinetobacter baumannii | 26690 |
| Contamination | 0.19 | 0.20 | 0.13 | 0.08 | 0.24 | 0.16 | 0.00 | 4.32 | 0.00 | 6.63 | Acinetobacter baumannii | 26690 |
| Total_Coding_Sequences | 3790.81 | 133.14 | 3785.00 | 3704.00 | 3868.00 | 164.00 | 3232.00 | 4545.00 | 3452.00 | 4445.00 | Acinetobacter baumannii | 26690 |
| Genome_Size | 3956786.76 | 110442.03 | 3953434.50 | 3884469.75 | 4019485.25 | 135015.50 | 3216153 | 4532195 | 3638126 | 4566577 | Acinetobacter baumannii | 26690 |
You can look to the qualibact-v1 thresholds as an example of how to present your data.
As part of your submission, please also include the list of contributors, so we can appropriately attribute them. Please also include a description of the dataset you are submitting: number of genomes, how they were selected, and where they were sourced.
Getting Help
If you have questions about:
- Data formats: Check our example in qualibact-v1 thresholds.
- Technical issues: Open an issue on Github
- Collaboration opportunities: Contact
nabil.alikhan@cgps.group
Thank you for contributing to improving bacterial genome quality assessment!