nf-core/genomicrelatedness
Edit

Bioinformatics pipeline for estimating genetic relatedness from low-coverage whole-genome sequencing (sWGS) data

wgs

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/genomicrelatedness

Introduction

This document describes the output produced by the nfcore/genomicrelatedness pipeline. The output primarily consists of the following main components: output files (e.g. txt, CRAM, BAM or VCF files), and summary statistics of the whole run presented in a MultiQC report. Intermediate files and module-specific statistics files are also retained. The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. The results directory of the pipeline needs to be specified with the --outdir flag when running the pipeline. During the run, intermediate files will be written to the work directory, which can be specified with the -work-dirparameter.

{outdir}
├── bootstrapping
│   ├── round_1
│   │   ├── bqsr
│   │   |   ├── cram
|   |   |   |   └── merged
│   │   |   └── qc
|   |   ├── stats
│   │   └── variants
│   │       ├── db
│   │       ├── filtered
│   │       └── merged
|   |
│   ├── round_2
│   │   ├── …
|   ⋮    ⋮
|   ⋮
│   └── round_3
│       ├── …
|       ⋮
|
├── intervals
|
├── multiqc
|
├── pipeline_info
|
├── preprocessing
│   ├── alignment
|   |    ├── bam
|   |    |   └── bwamem2
│   |    └── cram
|   |
│   ├── coverage
│   ├── fastp
│   ├── preseq
│   └── stats
|
├── variant_calling
│   ├── bcftools
|   |    ├── merged
│   |    └── bam
|   |
│   └── gatk
|        ├── I<interval>_joint
|        ⋮
|        ├── merged
│        └── stats
|
└── relatedness_estimation
    ├── exclude
    |
    ├── intersection
    |
    ├── thinned
    |
    └── angsd_ngsrelate
work/
.nextflow.log

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

Preprocessing Section
Bootstrapping Section
Variant Calling Section
Relatedness Estimation Section
MultiQC
Pipeline Information

Preprocessing section

The first section prepares the input files for downstream analyses.

Prepare Reference Genome

The subworkflow prepare_genome indexes the reference genome with BWA-mem2 -index, creates a sequencing dictionary with the GATK4 -createsequencedictionary, and generates a fasta index file with samtools -faidx, providing the basis for mapping of the sequencing reads.

Prepare Intervals

The subworkflow build_intervals builds intervals with [gawk -build_intervals] and splits them with [split_intervals], providing a computational efficiente basis for the Bootstrapping and Variant Calling Sections.

Prepare Input Files

The raw sequencing reads are prepared for downstream analysis. The input files are the sample table, raw sequencing reads FASTQ files, and the reference genome in FASTA format. The sequencing reads undergo quality control, trimming, filtering, and merging of paired reads with fastp.

Map to Reference

The processed sequencing reads are mapped to the reference genome with the BWA-mem2 -mem. Sorts the reads with samtools -sort and add read groups with GATK4 -addorreplacereadgroups.

Mark Duplicates

Reads stemming from the same sample are merged with samtools -merge, duplicates removed with GATK4 -markduplicates, and indexed with samtools -index, providing the cram file for further analysis.

Preprocessing statistics

Library complexity is assessed with preseq -c_curve and preseq -lcextrap (errorStrategy = ‘ignore’; maxRetries = 1). mosdepth and samtools -stats are used to summarize coverage and mapping statistics.

Output files

preprocessing/
- alignment/: directory containing the bam files for each sample_RGID (in the subdirectory bam/bwamem2/) and the deduplicated cram and crai files for each individual with corresponding deduplication metrics (in the subdirectory cram/).
- coverage/: directory containing text files with coverage statistics output from mosdepth (global and summarized for each contig) for each individual.
- fastp/: directory containing for each individual log, html and json files of the fastp preprocessing statistics, as well as trimmed and filterd fastq files.
- preseq/: directory containing the estimates on library complexity for each individual provided by c-Curve and lc_extrap
- stats/: directory containing the sequencing read statistics for each individuals.

Bootstrapping Section

The second section, bootstrapping, corrects systematic errors introduced during sequencing by recalibrating base quality scores. If an external VCF file with a reference variant set is provided this section directly starts with the BQSR subworkflow.

Call Variants

In the absence of a known variant set, the workflow first performs an internal bootstrapping procedure to create a temporary high-confidence variant resource. The cram files from the Preprocessing section, the bed file with intervals from the prepare_intervals subworkflow, and the fasta reference genome are combined and used for an initial round of variant calling with the subworkflow GATK_variant_calling. GATK4 -HaplotypeCaller produces gVCFs for each sample and scaffold (-ERC GVCF), which are then combined with GATK4 -GenomicsDBImport and jointly genotyped with GATK4 -GenotypeGVCFs and GATK4 -mergevcfs.

Hard Filter Variants

The subworkflow variant_filtering then uses hard filter criteria to create a reference variant set in vcf format with et (.vcf) with GATK4 -VariantFiltration (Qual >=100, QD < 2.0; MQ < 35.0; FS >60; HaplotypeScore > 13.0; MQRankSum < -12.5; ReadPosRankSum < -8.0) and GATK4 -SelectVariants.

Base Quality Score Recalibration

In the subworkflow BQSR GATK4 -BaseRecalibrator takes as input the output from the preprocessing section and the vcf file with the reference variant set (either the one produced previously or an existing one) to compute recalibration tables. These are compiled with GATK4 -GatherBQSRReport and applied with GATK4 -ApplyBQSR followed by samtools -merge and samtools -index to produce recalibrated CRAMs. The BQSR subworkflow is designed to be iterated, using the recalibrated CRAM files instead of the ones from the preprocessing section. For each round, ith GATK4 -AnalyzeCovariates generates diagnostic plots to evaluate the effectiveness of recalibration, andbcftools -stats produces variant quality summaries.

Output files

bootstrapping/round_X
- bqsr/: directory containing the cram files for each individual and interval (in the subdirectory cram/) as well as the recalibrated cram with corresponding index .crai files merged for each individual (in the subdirectory cram/merged/), and the recalibration diagnostics as pdf and csv files for each sample (in the subdirectory qc/).
- stats/: directory containing the text file with variant statistics.
- variants/: directory containing the called variants for each individuals as vcf and tbi files for each interval, the merged vcf and tbi file (in the subdirectory merged/) and the hard filtered variant set (in the subdirectory filtered/).

Variant Calling Section

In the third section, variant calling, the recalibrated CRAMs are processed to generate genotype likelihoods and multi-sample VCFs. Two variant calling approaches, bcftools and GATK4, are chosen to mitigate caller-specific biases. The subworkflow call_variants_gatk is executed like in the previous section with GATK4 -HaplotypeCaller , GATK4 -GenomicsDBImport, GATK4 -GenotypeGVCFs, GATK4 -mergevcfs, the subworkflow call_variants_bcftools converts the recalibrated CRAM with samtools convert, employs bcftools bcftools -mpileup (—output-type z -d 100) and bcftools -call (—output-type z -m -v –write-index=tbi) per scaffold, and concatenates results into a full cohort VCF with bcftools -concat (—output-type z –write-index=tbi). The callsets from both subworkflows are accompanied by variant-level quality summaries from bcftools stats. This stage outputs two harmonized, multi-sample VCFs — one from GATK and one from bcftools. Both subworkflows run in parallel. Additionally, summary statistics are produced such as transition/transversion ratios that can be used to judge the quality of/improvement in the called variants and whether subsequent rounds of BQSR could be beneficial. Finally, in the subworkflow intersect_variants, the two callsets are intersected using bcftools -isec to retain only sites called by both subworkflows and filtered using vcftools -exclude (using parameter -include_scaffolds or -exclude_scaffolds), e.g. to exclude mitochondrial or gonosomal scaffolds or only include autosomal scaffolds, and vcftools -thin (—remove-filtered-all –remove-indels –maf 0.025 –recode –recode-INFO-all –max-missing 0.75) producing the final variant set.

Output files

variant_calling/
- bcftools/: directory containing the called variants as vcf and tbi files for each interval, the individual bam files per sample (in the subdirectory bam/) and the merged vcf and tbi file (in the subdirectory merged/).
- gatk/: directory containing the called variants for each individuals as vcf and tbi files for each interval, the merged vcf and tbi file (in the subdirectory merged/) and a text file with variant statistics (in the subfolder stats/).

Relatedness Estimation Section

The fourth section, relatedness estimation, uses the final variant set to produce robust estimates of pairwise relatedness suitable for low-coverage whole-genome sequencing data. It infers relatedness and inbreeding from genotype likelihood data using ANGSD’s NgsRelatev2. The final output is a pairwise relatedness matrix.

Output files

relatedness_estimation/
- angsd_ngsrelate/: directory containing the tab-delimited file with the pairwise relatedness matrix of the analyzed samples.
- excluded/: directory containing the vcf file that was filtered to only included variants of the scaffolds defined to be included.
- include_contigs/: directory containing bed file with the scaffolds to define which variants to use for relatedness estimation.
- intersection/: directory containing the vcf files produced by bcftools isec, a README.txt to explain the content of the four different vcf files, and a text file with the sites.
- thinned/: a directory containing the vcf file with the filtered variant data used as input for the relatedness estimation.

MultiQC

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results from the preprocessing and bootstrapping section are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastP. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info. An excellent overview of MultiQC outputs and how to interpret the various plots and values is provided in the nf-core/eager pipeline.

Output files

multiqc/
- multiqc_report.html: a standalone HTML file, which includes the most important overview statistics and can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

Pipeline information

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

On this page

nf-core/genomicrelatedness Edit

Introduction

Pipeline overview

Preprocessing section

Prepare Reference Genome

Prepare Intervals

Prepare Input Files

Map to Reference

Mark Duplicates

Preprocessing statistics

Bootstrapping Section

Call Variants

Hard Filter Variants

Base Quality Score Recalibration

Variant Calling Section

Relatedness Estimation Section

MultiQC

Pipeline information

nf-core/genomicrelatedness
Edit