nf-core/genomeassembler
Introduction
This document describes the output produced by the pipeline..
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- Read preparation
- Assembly, choice between assemblers
- Polishing
- Scaffolding
- Annotation liftover
- Quality control
- Reporting
Output structure
Annotation and quality control are done at several stages of the pipeline, the output is organized by subworkflow, corresponding to the bolded steps above.
Read preparation
ONT reads
If the basecalls are scattered across multiple files, collect
can be used to collect those into a single file.
porechop is a tool that identifies and trims adapter sequences from ONT reads.
nanoq generates descriptive statistics of the nanopore reads.
genomescope estimates genome size and ploidy from the k-mer spectrum computed by jellyfish.
Output files
ont_reads/
collect/
: single fastq.gz files per sampleporechop/
: output from porechop, fastq.gznanoq/
: output from nanoqgenomescope/
: output from jellyfish and genomescopejellyfish/
count/
<SampleName>/
: output from jellyfish count
stats/
<SampleName>/
: output from jellyfish stats
histo/
<SampleName>/
: output from jellyfish histogram
dump/
<SampleName>/
: output from jellyfish dump
genomescope/
<SampleName>/
: genomescope plots
HiFi reads
lima performs trimming of adapters from pacbio HiFi reads.
Output files
hifi_reads/
lima/
: hifi reads after adapter removal with lima
Short reads
TrimGalore! can remove adapters from illumina short-reads. meryl calculates the k-mer spectrum of short reads.
Output files
short_reads/
trimgalore/
:<SampleName>_val_1.fq.gz
: Trimmed forward reads<SampleName>_val_2.fq.gz
: Trimmed reverse reads (if included)<SampleName>_1.fastq.gz.trimming_report.txt
: Trimming report forward<SampleName>_2.fastq.gz.trimming_report.txt
: Trimming report reverse (if included)
meryl/
: output from merylcount/
: k-mer counts per fileunionsum/
: union of k-mer counts per sample
Assembly
This folder contains the initial assemblies of the provided reads.
Depending on the assembly strategy chosen, different assemblers are used.
flye performs assembly of ONT reads
hifiasm performs assembly of HiFi reads, or combinations of HiFi reads and ONT reads in --ul
mode.
ragtag performs scaffolding and can be used to scaffold assemblies of ONT onto assemblies of HiFi reads
Output files
assemble/
flye/
: output from flye.<SampleName>/
<SampleName>.assembly.fasta.gz
: Assembly in gzipped fasta format<SampleName>.assembly_graph.gfa.gz
: Assembly graph in gzipped gfa format<SampleName>.assembly_graph.gv.gz
: Assembly graph in gzipped gv format<SampleName>.assembly_info.txt
: Information on the assembly<SampleName>.flye.log
: flye log-file<SampleName>.params.json
: params used for running flye
hifiasm/
: output from hifiasm. Contains one folder per sample<SampleName>
<SampleName>.asm.bp.p_ctg.fa.gz
: gzipped fasta file of the primary contigs<SampleName>.asm.bp.p_ctg.gfa
: primary contigs in gfa format<SampleName>.asm.bp.p_utg.gfa
: processed unitigs in gfa format<SampleName>.asm.bp.r_utg.gfa
: raw unitigs in gfa format<SampleName>.stderr.log
: Any output form hifiasm to stderr
ragtag/
: output from RagTag, only if'flye_on_hifiasm'
was used as the assembler. Contains one folder per sample.<SampleName>
<SampleName>.assembly.fasta.gz_on_<SampleName>.asm.bp.p_ctg.fa.gz/
<SampleName>.assembly.fasta.gz_ragtag_<SampleName>.asm.bp.p_ctg.fa.gz.agp
: Scaffolds in agp format<SampleName>.assembly.fasta.gz_ragtag_<SampleName>.asm.bp.p_ctg.fa.gz.fasta
: Scaffolds in fasta format<SampleName>.assembly.fasta.gz_ragtag_<SampleName>.asm.bp.p_ctg.fa.gz.stats
: Scaffolding statistics.
Polishing
Polishing can be used to correct errors in the assembly. This pipeline supports two polishing tools. medaka polishes assemblies using the ONT reads that were used for assembly. pilon polishes any type of assembly using short-reads.
Output files
polish/
pilon/
: output from pilonmedaka/
: output from medaka
Scaffolding
The (polished) assembly can be scaffolded using different tools. links performs scaffolding of the assembly using long-reads longstitch performs correction via Tigmint and scaffolding using long reads via ntLink and ARKS
Output files
scaffold/
links/
: output from links<SampleName>/
:<SampleName>_links.gv
: scaffolding graph<SampleName>_links.log
: log file<SampleName>_links.scaffolds
: scaffold statistics<SampleName>_links.scaffolds.fa
: scaffold fasta
longstitch/
: output from longstitch<SampleName>/
:<SampleName>_tigmint-ntLinks.arks.longstitch-scaffolds.fa
: Scaffolds after scaffolding with tigmint, ntLinks, and arks<SampleName>_tigmint-ntLinks.longstitch-scaffolds.fa
: Scaffolds after scaffolding with tigmint, and ntLinks
ragtag/
: output from RagTag<SampleName>/
:<SampleName><suffix>_ragtag_<Reference>/
<SampleName><suffix>_ragtag_<Reference>.agp
: agp file, scaffolding results<SampleName><suffix>_ragtag_<Reference>.fasta
: Scaffold fasta file<SampleName><suffix>_ragtag_<Reference>.stats
: Scaffolding statistics
Annotations
If a reference is provided, and annotation liftover is desired, the pipeline will lift-over annotations at each stage of the assembly. liftoff performs lift-over of annotations from a closely related species / individual.
Output files
assemble/<SampleName>
|polish/<tool>/<SampleName>
|scaffold/<tool>/<SampleName>
:liftoff/
:<SampleName>.<suffix>_liftoff.gff
gff file produced by liftoff. Exact name depends on the stage of the pipeline.
Quality control
All quality control files end up in QC
. Below is the tree assuming that all steps of the pipeline were run
For each step three quality control tools can be run.
QUAST
provides assembly statistics (e.g. size, N50, etc. )
BUSCO
assess genome quality based on the presence of lineage-specific single-copy orthologs
merqury
compares the genome k-mer spectrum to the short-read k-mer spectrum to assess base-accuracy of the assembly.
Folder contents
busco
: BUSCO analysis of the assembly<SampleName>/
:<SampleName>-<Stage>-<BuscoLineage>-busco/
: BUSCO output folder, please refer to BUSCO documentation for details.<SampleName>-<Stage>-<BuscoLineage>-busco.batch_summary.txt
: BUSCO batch summary outputshort_summary.specific.<FastaFile>.{txt,json}
: BUSCO short summaries in txt and json format
quast
: QUAST analysis of the assembly, per sample, contains:<Sample Name>
:map_to_ref
andmap_to_assembly
: mapping of long reads to the reference and assembly respectively.map_to_ref
is only performed once, during the first run of QUAST, typically inassemble
align/
: Alignment of long reads to the genome in ` format<FastaFile>.bam
: Alignment of long reads to the genome
samtools/
:<FastaFile>.bam.bai
: bam index<FastaFile>.bam.idxstats
: samtools idxstats<FastaFile>.bam.flagstat
: samtools flagstats<FastaFile>.bam.stats
: samtools stats
<Sample Name>_<stage>/
: QUAST results, cp. QUAST Docsreport.txt
: summary tablereport.tsv
: tab-separated version, for parsing, or for spreadsheets (Google Docs, Excel, etc)report.tex
: Latex versionreport.pdf
: PDF version, includes all tables and plots for some statisticsreport.html
: everything in an interactive HTML fileicarus.html
: Icarus main menu with links to interactive viewerscontigs_reports/
: [only if a reference genome is provided]misassemblies_report
: detailed report on misassembliesunaligned_report
: detailed report on unaligned and partially unaligned contigs
reads_stats/
: [only if reads are provided]reads_report
: detailed report on mapped reads statistics
<Sample Name>_<stage_report.tsv>
: QUAST summary report
merqury
: merqury analysis of the assembly<SampleName>
:<FastaFile>.<SampleName>.assembly.qv
: QV of the assembly (per sequence)<FastaFile>.<SampleName>.assembly.spectra-cn.fl.png
: Copy Number plot, filled<FastaFile>.<SampleName>.assembly.spectra-cn.ln.png
: Copy Number plot, lines<FastaFile>.<SampleName>.assembly.spectra-cn.st.png
: Copy Number plot, semi-transparent<FastaFile>.<SampleName>.assembly.spectra-cn.hist
: Copy Number histogram file<FastaFile>.completeness.stats
: Assembly completeness statistics (overall)<FastaFile>.qv
: Assembly QV (overall)<FastaFile>.spectra-asm.fl.png
: Assembly k-mer spectrum, filled<FastaFile>.spectra-asm.ln.png
: Assembly k-mer spectrum, lines<FastaFile>.spectra-asm.st.png
: Assembly k-mer spectrum, semi-transparent<FastaFile>.spectra-asm.hist
: Assembly QV (overall)<FastaFile>.dist_only.hist
: Number of k-mers distinct to the assembly<SampleName>.assembly_only.bed
: bp errors in assembly (bed)<SampleName>.assembly_only.wig
: bp errors in assembly (wig)<SampleName>.unionsum.hist.ploidy
: ploidy estimates from short-reads
Output folders
QC/
assemble/
: qc after the initial assemblypolish/
:pilon/
: qc after polishing with pilonmedaka/
: qc after polishing with medaka
scaffold
: qc of scaffoldinglinks
: qc after scaffolding with linkslongstitch
: qc after scaffolding with longstitchragtag
: qc after scaffolding with ragtag
Report
The pipeline collects the quality control outputs into an html report. Below is the tree assuming that all steps of the pipeline were run:
Output files
report/
:busco_files/reports.tsv
: Table containing aggregated BUSCO reportsquast_files/reports.tsv
: Table containing aggregated QUAST reportsreport.html
: The report filereport_files/
: Folder containing js and css. Required to properly display the.html
file
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.