Question

Processing fastq files for genome assembly

0

Entering edit mode

15 months ago

Neil ▴ 20

Hello. I had sequencing project conducted with the AmpliSeq for Illumina SARS-CoV-2 Research Panel. For further analysis I have several questions.

I have several questions:

Do I need fastq preprocessing? Fastq analysis looks good to me, but I might be wrong.
I have 4 forward and 4 reverse fastq files for one sample. For alignment, should I concatenate the 4 forward files into 1 and the 4 reverse files into 1 to get 2 fastq files? Is this correct?
After this step, I have to obtain an assembled fasta file of the sample. To do this, should I align the two concatenated fastq files against the reference and obtain a fasta file? I believe that I might get these from bam files, but I am not certain. If this is correct, how can I get this assembled fasta file?
Is there a way to get information about amino acid substitution from VCF file?

I would greatly appreciate any help.

fastq vcf assembly amino-acid • 2.5k views

ADD COMMENT • link updated 15 months ago by Brian Bushnell 20k • written 15 months ago by Neil ▴ 20

1

Entering edit mode

These are quite broad concepts that would be hard to answer specifically. I suggest you consult to a bioinformatician near you who can work on your project.

ADD REPLY • link 15 months ago by barslmn ★ 2.4k

score 3 · Answer 1 · 2024-03-09

I have general guidelines for preprocessing Illumina data prior to assembly. But for SARS-CoV-2 specifically, I had to make a different pipeline because the data was very nuanced, with amplification-specific artifacts. You can download BBTools and look in /bbmap/pipelines/covid/processCorona.sh, but I'll post it here for convenience. The script is a bit customized for UCSF's conventions of storing/naming files but the general process should be applicable to any artic3-amplified Illumina samples.

#!/bin/bash

##  Written by Brian Bushnell
##  Last modified April 11, 2020
##  Description:  Calls SARS-CoV-2 variants from Illumina amplicon data.
##                This script assumes input data is paired-end.

##  Usage:  processCorona.sh <prefix>
##  For example, "processCorona.sh Sample1" if the data is in Sample1.fq.gz

##  Grab the sample name from the command line.
NAME="$1"

##  Set minimum coverage for genotype calls.
##  Areas below this depth will be set to N in the consensus genome.
MINCOV=5

##  Specify the viral reference file.
##  NC_045512.fasta contains the SARS-CoV-2 genome, equivalent to bbmap/resources/Covid19_ref.fa
REF="NC_045512.fasta"

##  PCR primers
##  artic3 primers are in /bbmap/resources/artic3.fasta
PRIMERS="artic3.fasta"

##  This line is in case the script is being re-run, to clear the old output.
rm "$NAME"*.sam.gz "$NAME"*.bam "$NAME"*.bai "$NAME"*.txt "$NAME"*.fa "$NAME"*.vcf

##  If data is paired in twin files, interleave it into a single file.
##  Otherwise, skip this step.
##  In this case, the files are assumed to be named "Sample1_R1.fq.gz" and "Sample1_R2.fq.gz"
#reformat.sh in="$NAME"_R#.fq.gz out="$NAME".fq.gz

##  Split into Covid and non-Covid reads if this has not already been done.
##  This step can be skipped if non-Covid was already removed.
bbduk.sh ow -Xmx1g in="$NAME".fq.gz ref="$REF" outm="$NAME"_viral.fq.gz outu="$NAME"_nonviral.fq.gz k=25

##  Recalibrate quality scores prior to any trimming.
##  *** Requires a recalibration matrix in the working directory (see recal.sh for details). ***
##  This step is optional but useful for Illumina binned quality scores.
bbduk.sh in="$NAME"_viral.fq.gz out="$NAME"_recal.fq.gz recalibrate -Xmx1g ow

##  Discover adapter sequence for this library based on read overlap.
##  You can examine the adapters output file afterward if desired;
##  If there were too few short-insert pairs this step will fail (and you can just use the default Illumina adapters).
bbmerge.sh in="$NAME"_recal.fq.gz outa="$NAME"_adapters.fa ow reads=1m

##  Remove duplicates by sequence similarity.
##  This is more memory-efficient than dedupebymapping.
clumpify.sh in="$NAME"_recal.fq.gz out="$NAME"_clumped.fq.gz zl=9 dedupe s=2 passes=4 -Xmx31g

##  Perform adapter-trimming on the reads.
##  Also do quality trimming and filtering.
##  If desired, also do primer-trimming here by adding, e.g., 'ftl=20' to to trim the leftmost 20 bases.
##  If the prior adapter-detection step failed, use "ref=adapters"
bbduk.sh in="$NAME"_clumped.fq.gz out="$NAME"_trimmed.fq.gz minlen=60 ktrim=r k=21 mink=9 hdist=2 hdist2=1 ref="$NAME"_adapters.fa altref=adapters maq=14 qtrim=r trimq=10 maxns=0 tbo tpe ow -Xmx1g ftm=5

##  Trim artic3 primers, if present.
##  Disable this line if you are not amplifying with primers.
bbduk.sh in="$NAME"_trimmed.fq.gz out="$NAME"_trimmed2.fq.gz ref="$PRIMERS" ktrim=l restrictleft=30 k=22 hdist=3 qhdist=1 rcomp=f mm=f

##  Align reads to the reference.
##  Local flag is due to primer-amplification-induced anomalies at read ends;
##  for randomly-sheared, unamplified data, "local" should be omitted.
bbmap.sh ref="$REF" in="$NAME"_trimmed2.fq.gz outm="$NAME"_mapped.sam.gz nodisk local maxindel=500 -Xmx4g ow k=12

##  Deduplicate based on mapping coordinates.
##  Note that if you use single-ended amplicon data, you will lose most of your data here.
dedupebymapping.sh in="$NAME"_mapped.sam.gz out="$NAME"_deduped.sam.gz -Xmx31g ow

##  Remove junk reads with unsupported unique deletions; these are often chimeric.
filtersam.sh ref="$REF" ow in="$NAME"_deduped.sam.gz out="$NAME"_filtered.sam.gz mbad=1 del sub=f mbv=0 -Xmx4g

##  Remove junk reads with multiple unsupported unique substitutions; these are often junk, particularly on Novaseq.
##  This step is not essential but reduces noise.
filtersam.sh ref="$REF" ow in="$NAME"_filtered.sam.gz out="$NAME"_filtered2.sam.gz mbad=1 sub mbv=2 -Xmx4g

##  Trim soft-clipped bases.
bbduk.sh in="$NAME"_filtered2.sam.gz trimclip out="$NAME"_trimclip.sam.gz -Xmx1g ow

##  Call variants from the sam files.
##  The usebias=f/minstrandratio=0 flags are necessary due to amplicon-induced strand bias,
##  and should be removed if the data is exclusively shotgun/metagenomic or otherwise randomly fragmented.
callvariants.sh in="$NAME"_trimclip.sam.gz ref="$REF" out="$NAME"_vars.vcf -Xmx4g ow strandedcov usebias=f minstrandratio=0 maf=0.6 minreads="$MINCOV" mincov="$MINCOV" minedistmax=30 minedist=16 flagnearby

##  Calculate reduced coverage as per CallVariants defaults (ignoring outermost 5bp of reads).
pileup.sh in="$NAME"_trimclip.sam.gz basecov="$NAME"_basecov_border5.txt -Xmx4g ow border=5

##  Generate a mutant reference by applying the detected variants to the reference.
##  This is essentially the reference-guided assembly of the strain.
##  Also changes anything below depth MINCOV to N (via the mindepth flag).
##  Does not apply indels below MINCOV
applyvariants.sh in="$REF" out="$NAME"_genome.fa vcf="$NAME"_vars.vcf basecov="$NAME"_basecov_border5.txt ow mindepth="$MINCOV"

##  Make bam/bai files; requires samtools to be installed.
##  This step is only necessary for visualization, not variant-calling.
#samtools view -bShu "$NAME"_trimclip.sam.gz | samtools sort -m 2G -@ 3 - -o "$NAME"_sorted.bam
#samtools index "$NAME"_sorted.bam

##  At this point, "$NAME"_sorted.bam, "$NAME"_sorted.bam.bai, "$REF", and "$NAME"_vars.vcf can be used for visualization in IGV.

score 2 · Answer 2 · 2024-03-07

2

Entering edit mode

15 months ago

Michael 55k

with the AmpliSeq for Illumina SARS-CoV-2 Research Panel

If you want to submit for inclusion in a certain panel or database, I would look out for their analysis guidelines and standardized workflows they might have set up and follow these closely.

Do I need fastq preprocessing? Fastq analysis looks good to me, but I might be wrong.

So you did a QC and it looked good without adapters and high quality? In principle, you don't need to do trimming and there are few adapters in modern Illumina data. However, I do it anyway to avoid having to do everything out of fear of an "em, actually,... you should have..." type of reviewer. As it is the first step of the analysis, you would have to run everything else again. But no one can complain about doing read trimming when it wasn't necessary. I'd run them through fastp, as little effort as necessary.

I have 4 forward and 4 reverse fastq files for one sample. For alignment, should I concatenate the 4 forward files into 1 and the 4 reverse files into 1 to get 2 fastq files? Is this correct?

I'd try to avoid this at all costs for several reasons and rather merge the BAM files afterward. This is better for QC and keeping all the files properly paired.

After this step, I have to obtain an assembled fasta file of the sample. To do this, should I align the two concatenated fastq files against the reference and obtain a fasta file? I believe that I might get these from bam files, but I am not certain. If this is correct, how can I get this assembled fasta file?

It depends on your research question. You can either do a de novo assembly or an alignment against a reference with variant calling. What do you need? These two options will produce different results and have different strengths and abilities to detect different kinds of variation. If you need a genome in fasta format, don't confuse a genome assembly with an ALT-genome (genome generated by inserting all alternative alleles into the reference). They are not comparable.

Is there a way to get information about amino acid substitution from VCF file?

Yes, by using tools like SnpEff with a reference annotation.

ADD COMMENT • link 15 months ago by Michael 55k

1

Entering edit mode

I'd try to avoid this at all costs for several reasons and rather merge the BAM files afterward.

Trying to understand what those reasons are. Lane specific file output is a by-product of how the data is post-processed by Illumina software. A single file per sample can be produced by the same software with an appropriate option. It is also possible that some workflows expect a single pair of files per sample and thus may require concatenation of lane specific pieces. Keeping the order of files identical when concatenating is critical as you mention.

ADD REPLY • link 15 months ago by GenoMax 151k

1

Entering edit mode

There are several theoretical reasons, even though I haven't encountered any of the problems and all would be recoverable if you keep the raw data:

Preconditions: Are these technical replicates and are you not interested in any specific variation? Then it should be ok in principle
Robustness: If one file is broken (truncated, not in proper order), the merged file will be too.
QC: IF one of the runs had low alignment or pairing rate, you might not notice at all because it averages out. You have a harder time to find out which file is the reason if you discover a problem anyway.
Redundancy: Due to reasons 2,3 you have to keep the original files anyway, doubling your storage requirement at this step.
It's not necessary, because you can merge at the next step.

ADD REPLY • link 15 months ago by Michael 55k

1

Entering edit mode

I should clarify that I was referring to concatenating lane specific files across a single flowcell. Good thing you mention the reasons are "theoretical". A single large pool is generally run across four S4 FC lanes in cores for cost efficiency. Having looked hundreds of such runs over the last decade I have never seen any major lane specific issues. There is generally some variation in read numbers across "lanes" but the data itself has never been affected.

If one has the resources available then concatenating the files is not going to cause any obvious issues and may simplify file handling.

ADD REPLY • link 15 months ago by GenoMax 151k

0

Entering edit mode

It depends on your research question. You can either do a de novo assembly or an alignment against a reference with variant calling. What do you need? These two options will produce different results and have different strengths and abilities to detect different kinds of variation. If you need a genome in fasta format, don't confuse a genome assembly with an ALT-genome (genome generated by inserting all alternative alleles into the reference). They are not comparable.

I definitely need an alignment with variant calling. I have one reference genome and two different samples (actually are strains) aligned on this reference. How can I get the latest two as fasta files after variant calling?

ADD REPLY • link 15 months ago by Neil ▴ 20

1

Entering edit mode

What do you need the fasta file for exactly? I am asking because you are going to lose a lot of information when going from VCF to fasta. Most analyses, like variant effect prediction, can be better done on the VCF file.

ADD REPLY • link 15 months ago by Michael 55k

0

Entering edit mode

These fasta files will be used in the Pango nomenclature: https://cov-lineages.org/resources/pangolin.html

ADD REPLY • link 15 months ago by Neil ▴ 20

1

Entering edit mode

You could use standard NextStrain pipeline for all of this: https://nextstrain.org/sars-cov-2/ specifically https://docs.nextstrain.org/projects/ncov/en/latest/

ADD REPLY • link 15 months ago by GenoMax 151k

0

Entering edit mode

Thank you for help! But my advisor told me that I MUST use https://cov-lineages.org/resources/pangolin.html. Can you help me with extracting fasta file from VCF after variant calling?

ADD REPLY • link 15 months ago by Neil ▴ 20

1

Entering edit mode

Looks like Nextclade can be used to create pango lineages: https://docs.nextstrain.org/projects/nextclade/en/stable/user/algorithm/nextclade-pango.html

You can use this tutorial to generate consensus fasta after variant calling from your alignment files: Generating consensus sequence from bam file

ADD REPLY • link 15 months ago by GenoMax 151k