nf-core/genomicrelatedness
Bioinformatics pipeline for estimating genetic relatedness from low-coverage whole-genome sequencing (sWGS) data
Introduction
This document describes the output produced by the nfcore/genomicrelatedness pipeline. The output primarily consists of the following main components: output files (e.g. txt, CRAM, BAM or VCF files), and summary statistics of the whole run presented in a MultiQC report. Intermediate files and module-specific statistics files are also retained.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. The results directory of the pipeline needs to be specified with the --outdir flag when running the pipeline. During the run, intermediate files will be written to the work directory, which can be specified with the -work-dirparameter.
{outdir}
├── bootstrapping
│ ├── round_1
│ │ ├── bqsr
│ │ | ├── cram
| | | | └── merged
│ │ | └── qc
| | ├── stats
│ │ └── variants
│ │ ├── db
│ │ ├── filtered
│ │ └── merged
| |
│ ├── round_2
│ │ ├── …
| â‹® â‹®
| â‹®
│ └── round_3
│ ├── …
| â‹®
|
├── intervals
|
├── multiqc
|
├── pipeline_info
|
├── preprocessing
│ ├── alignment
| | ├── bam
| | | └── bwamem2
│ | └── cram
| |
│ ├── coverage
│ ├── fastp
│ ├── preseq
│ └── stats
|
├── variant_calling
│ ├── bcftools
| | ├── merged
│ | └── bam
| |
│ └── gatk
| ├── I<interval>_joint
| â‹®
| ├── merged
│ └── stats
|
└── relatedness_estimation
├── exclude
|
├── intersection
|
├── thinned
|
└── angsd_ngsrelate
work/
.nextflow.logPipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Preprocessing Section
- Bootstrapping Section
- Variant Calling Section
- Relatedness Estimation Section
- MultiQC
- Pipeline Information
Preprocessing section
The first section prepares the input files for downstream analyses.
Prepare Reference Genome
The subworkflow prepare_genome indexes the reference genome with BWA-mem2 -index, creates a sequencing dictionary with the GATK4 -createsequencedictionary, and generates a fasta index file with samtools -faidx, providing the basis for mapping of the sequencing reads.
Prepare Intervals
The subworkflow build_intervals builds intervals with [gawk -build_intervals] and splits them with [split_intervals], providing a computational efficiente basis for the Bootstrapping and Variant Calling Sections.
Prepare Input Files
The raw sequencing reads are prepared for downstream analysis. The input files are the sample table, raw sequencing reads FASTQ files, and the reference genome in FASTA format. The sequencing reads undergo quality control, trimming, filtering, and merging of paired reads with fastp.
Map to Reference
The processed sequencing reads are mapped to the reference genome with the BWA-mem2 -mem. Sorts the reads with samtools -sort and add read groups with GATK4 -addorreplacereadgroups.
Mark Duplicates
Reads stemming from the same sample are merged with samtools -merge, duplicates removed with GATK4 -markduplicates, and indexed with samtools -index, providing the cram file for further analysis.
Preprocessing statistics
Library complexity is assessed with preseq -c_curve and preseq -lcextrap (errorStrategy = ‘ignore’; maxRetries = 1). mosdepth and samtools -stats are used to summarize coverage and mapping statistics.
Output files
preprocessing/alignment/: directory containing the bam files for each sample_RGID (in the subdirectorybam/bwamem2/) and the deduplicated cram and crai files for each individual with corresponding deduplication metrics (in the subdirectorycram/).coverage/: directory containing text files with coverage statistics output from mosdepth (global and summarized for each contig) for each individual.fastp/: directory containing for each individual log, html and json files of the fastp preprocessing statistics, as well as trimmed and filterd fastq files.preseq/: directory containing the estimates on library complexity for each individual provided by c-Curve and lc_extrapstats/: directory containing the sequencing read statistics for each individuals.
Bootstrapping Section
The second section, bootstrapping, corrects systematic errors introduced during sequencing by recalibrating base quality scores. If an external VCF file with a reference variant set is provided this section directly starts with the BQSR subworkflow.
Call Variants
In the absence of a known variant set, the workflow first performs an internal bootstrapping procedure to create a temporary high-confidence variant resource. The cram files from the Preprocessing section, the bed file with intervals from the prepare_intervals subworkflow, and the fasta reference genome are combined and used for an initial round of variant calling with the subworkflow GATK_variant_calling. GATK4 -HaplotypeCaller produces gVCFs for each sample and scaffold (-ERC GVCF), which are then combined with GATK4 -GenomicsDBImport and jointly genotyped with GATK4 -GenotypeGVCFs and GATK4 -mergevcfs.
Hard Filter Variants
The subworkflow variant_filtering then uses hard filter criteria to create a reference variant set in vcf format with et (.vcf) with GATK4 -VariantFiltration (Qual >=100, QD < 2.0; MQ < 35.0; FS >60; HaplotypeScore > 13.0; MQRankSum < -12.5; ReadPosRankSum < -8.0) and GATK4 -SelectVariants.
Base Quality Score Recalibration
In the subworkflow BQSR GATK4 -BaseRecalibrator takes as input the output from the preprocessing section and the vcf file with the reference variant set (either the one produced previously or an existing one) to compute recalibration tables. These are compiled with GATK4 -GatherBQSRReport and applied with GATK4 -ApplyBQSR followed by samtools -merge and samtools -index to produce recalibrated CRAMs. The BQSR subworkflow is designed to be iterated, using the recalibrated CRAM files instead of the ones from the preprocessing section. For each round, ith GATK4 -AnalyzeCovariates generates diagnostic plots to evaluate the effectiveness of recalibration, andbcftools -stats produces variant quality summaries.
Output files
bootstrapping/round_Xbqsr/: directory containing the cram files for each individual and interval (in the subdirectorycram/) as well as the recalibrated cram with corresponding index .crai files merged for each individual (in the subdirectorycram/merged/), and the recalibration diagnostics as pdf and csv files for each sample (in the subdirectoryqc/).stats/: directory containing the text file with variant statistics.variants/: directory containing the called variants for each individuals as vcf and tbi files for each interval, the merged vcf and tbi file (in the subdirectorymerged/) and the hard filtered variant set (in the subdirectoryfiltered/).
Variant Calling Section
In the third section, variant calling, the recalibrated CRAMs are processed to generate genotype likelihoods and multi-sample VCFs. Two variant calling approaches, bcftools and GATK4, are chosen to mitigate caller-specific biases. The subworkflow call_variants_gatk is executed like in the previous section with GATK4 -HaplotypeCaller , GATK4 -GenomicsDBImport, GATK4 -GenotypeGVCFs, GATK4 -mergevcfs, the subworkflow call_variants_bcftools converts the recalibrated CRAM with samtools convert, employs bcftools bcftools -mpileup (—output-type z -d 100) and bcftools -call (—output-type z -m -v –write-index=tbi) per scaffold, and concatenates results into a full cohort VCF with bcftools -concat (—output-type z –write-index=tbi). The callsets from both subworkflows are accompanied by variant-level quality summaries from bcftools stats. This stage outputs two harmonized, multi-sample VCFs — one from GATK and one from bcftools. Both subworkflows run in parallel. Additionally, summary statistics are produced such as transition/transversion ratios that can be used to judge the quality of/improvement in the called variants and whether subsequent rounds of BQSR could be beneficial. Finally, in the subworkflow intersect_variants, the two callsets are intersected using bcftools -isec to retain only sites called by both subworkflows and filtered using vcftools -exclude (using parameter -include_scaffolds or -exclude_scaffolds), e.g. to exclude mitochondrial or gonosomal scaffolds or only include autosomal scaffolds, and vcftools -thin (—remove-filtered-all –remove-indels –maf 0.025 –recode –recode-INFO-all –max-missing 0.75) producing the final variant set.
Output files
variant_calling/bcftools/: directory containing the called variants as vcf and tbi files for each interval, the individual bam files per sample (in the subdirectorybam/) and the merged vcf and tbi file (in the subdirectorymerged/).gatk/: directory containing the called variants for each individuals as vcf and tbi files for each interval, the merged vcf and tbi file (in the subdirectorymerged/) and a text file with variant statistics (in the subfolderstats/).
Relatedness Estimation Section
The fourth section, relatedness estimation, uses the final variant set to produce robust estimates of pairwise relatedness suitable for low-coverage whole-genome sequencing data. It infers relatedness and inbreeding from genotype likelihood data using ANGSD’s NgsRelatev2. The final output is a pairwise relatedness matrix.
Output files
relatedness_estimation/angsd_ngsrelate/: directory containing the tab-delimited file with the pairwise relatedness matrix of the analyzed samples.excluded/: directory containing the vcf file that was filtered to only included variants of the scaffolds defined to be included.include_contigs/: directory containing bed file with the scaffolds to define which variants to use for relatedness estimation.intersection/: directory containing the vcf files produced by bcftoolsisec, a README.txt to explain the content of the four different vcf files, and a text file with the sites.thinned/: a directory containing the vcf file with the filtered variant data used as input for the relatedness estimation.
MultiQC
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results from the preprocessing and bootstrapping section are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastP. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info. An excellent overview of MultiQC outputs and how to interpret the various plots and values is provided in the nf-core/eager pipeline.
Output files
multiqc/multiqc_report.html: a standalone HTML file, which includes the most important overview statistics and can be viewed in your web browser.multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/: directory containing static images from the report in various formats.
Pipeline information
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Output files
pipeline_info/- Reports generated by Nextflow:
execution_report.html,execution_timeline.html,execution_trace.txtandpipeline_dag.dot/pipeline_dag.svg. - Reports generated by the pipeline:
pipeline_report.html,pipeline_report.txtandsoftware_versions.yml. Thepipeline_report*files will only be present if the--email/--email_on_failparameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv. - Parameters used by the pipeline run:
params.json.
- Reports generated by Nextflow: