Question

SNP analysis with an assembly

1

Entering edit mode

19 months ago

luzglongoria ▴ 50

Hi there,

I am new in SNP analyses so before starting doing anything I would like to check if my pipeline is correct. What I have now is : RNA-seq samples (.fq.gz) + Trinity assembly (from those reads).

My model organism has not an assembled genome, that's the way I need to use the Trinity assembly.

The step would be as follows:

1) The idea is to use this Trinity assembly as a reference for the SNP analyses. For the mapping process I would use Bowtie software since it is recommendable for RNA samples and (as far as I know) support RNA assemblies. I would get a .bam file as an output.

2) Then, I'd use the .bam files for the variant calling. In this step, I'm not sure which software to use: SAMtools, GATK, or FreeBayes

3) I have read that at this step is needed to filter the SNPs based on various criteria, such as read depth, mapping quality, and allele frequency, to remove potential false positives and low-quality variants. Not sure the software I need to use here. I'm mainly focused on allele frequency, (in case there is a specific software for these analyses).

4) I would like to perform too a population-level analysis with my several individuals (same individuals sampled at different time points). Is it correct to use tools like PLINK, VCFtools, or ADMIXTURE for these analyses?

Any help is more than welcome.

Thank you so much in advance.

SNP Trinity Bowtie • 1.5k views

ADD COMMENT • link 19 months ago by luzglongoria ▴ 50

score 2 · Accepted Answer · 2023-06-14

You are using RNA-seq, and therefore are assembling the transcriptome; and you can use the resulting scaffolds as a reference for other RNA-seq data. As such, any variant detection you perform should be done in a manner consistent with best practices for variant detection from RNA-seq (https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-). Note that the recommended aligner here ("STAR 2-PASS") is unavailable to you, as it requires both a genome and a transcriptome; so you can replace it with any aligner that takes a single fasta (I believe kallisto just uses transcript fastas).

Your settings may depend on the ploidy of your organism; and you should be very cognizant of genotype likelihoods. In fact, it is preferable to use a "likelihood-aware" or "dosage-aware" methodology for downstream population genetics.

score 2 · Accepted Answer · 2023-06-14

Once you have your .bam files you can use samtools to filter reads for a mapping quality score (for example >20), and then sort and index your bam files before placing your data through your chosen GATK pipeline.

examples of samtools commands are all over Biostars for example answers in here.

The GATK pipeline, consists of a number of steps and I believe they have an RNA variant detection pipeline. More specifically the SNP discovery steps usually consist of Haplotypecaller (this step takes in your bam files), GenomicsDBImport, GenotypeGVCFs (you might need gatherVCFs) and SelectVariants. The GATK website has a lot of really good documentation with examples.

Filtering wise, there is hard filtering step that you can use in GATK with recommended settings and then afterwards vcftools can be used for further filtering if required, for example for depth. ADMIXTURE is an analysis tool to examine population structure and I think takes in .ped .map or bed files. You can convert a vcf file to these using plink or vcftools.