Question

Creating a "reference sequence" using velvet?

0

Entering edit mode

10.4 years ago

jt.bricker • 0

I'm pretty novice when it comes to bioinformatics. I am currently working on a project where I am trying to call SNPs from frog species which have not been completely sequenced. I have RNA from 10 different tissues among 5 closely related frog species.

I've tried aligning these sequences to a reference genome of a more distantly related frog that has been completely sequenced but the results are very poor (2-3% of the reads are mapped to the reference genome).

I thought perhaps I may be able to do a de novo assembly of the RNA sequences of each tissue and use the resulting contigs as a "reference genome" for the alignment.

Does anyone have any experience with this? Is this a reasonable way to call SNPs from data without a reference genome?

velvet Assembly SNP reference sequence • 2.9k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by jt.bricker • 0

Ram · Answer 1 · 2014-07-03

1

Entering edit mode

10.4 years ago

pierre.peterlongo ▴ 900

Hi there,

If you're curious and/or if you cannot achieve a confident enough assembly, I may suggest to try using discoSnp which detects SNPs from raw read sets without a reference genome.

Pros: needs no memory - fast - easy to use - 1 to n datasets - similar results than assembly+mapping approaches

Cons: snps are output with no genome location (as no reference is provided).

It has not been tested on RNA seq data yet. The ranking score, based on allele coverages will certainly not be efficient. However, we (authors of this work) would be happy to guide you using the tool and analyzing the results.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.4 years ago by pierre.peterlongo ▴ 900

1

Entering edit mode

Thank you for the fantastic tool !

I am working on it currently and have a small doubt with the genotypes output file.

Please see it below:

GENOTYPES_SNP_38543_THRESHOLD_10 0 0 TAAAATATGAGAAATAGTTGTGTTGATATAA/GGGTAAAAGTGTAAGGTTTTTTGAAAATTTA
GENOTYPES_SNP_36100_THRESHOLD_10 -1 0 ACGTCCAAGATCAAGTCGTCACCAGGGATGT/GTGTTCACTGTTGAAGGCTTTGACTTCAGTG
GENOTYPES_SNP_27996_THRESHOLD_10 -1 1 ATCGAACAATTGAAACTGATGTTTCAAGTTC/TAGGCTAAGCAAGAGCGTTTTGCAACCATGA
GENOTYPES_SNP_22993_THRESHOLD_10 0 0 CTCCTTAGATGAAATCTCCTTAAATATACCC/GGGGCTTCCAGCTATTAATATGGAATACCAG
GENOTYPES_SNP_10511_THRESHOLD_10 2 0 AGAGGAGTTTACGGCCCAAGCATATTCTCGA/GGAAATAAACTCATGTTTACTCTCGAAAATG
GENOTYPES_SNP_42798_THRESHOLD_10 2 2 CCCGGATGCAAATCCGATAAGGGCCAGCACC/TGGTGCGTTAGACCTTGTTGATATAGTGAGG
GENOTYPES_SNP_39034_THRESHOLD_10 1 -1 CCATTTTAAGGCCATTTGAGCCAAATCCCCC/TATGAAAGGAAGTAAAGTTCGTAGCTTTACG
GENOTYPES_SNP_67935_THRESHOLD_10 -1 1 ACGGTCCTTATTAGCTGTAAGTGACTAATTA/CCCAATGCGAATCTCGGTAACTCATGAATTA
GENOTYPES_SNP_60582_THRESHOLD_10 -1 1 CAATCATCAAGTGACTCAGACTCCGAGTACT/GTTCCTACTGGTCATAAACAACCCATTCCTA

I am trying to compare 2 samples and ended up with a lot of combinations of genotype coverage values i.e., (2 2) / (2 1) / (2 0)/ (2 -1) / (1 -1) etc.. I understand that I can choose (1 -1) combination. Please let me know what other combinations can I choose?

Many Thanks,

Siva

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.3 years ago by sivawarwick ▴ 10

0

Entering edit mode

Hi there,

Thanks for this question Siva.

A value is associated to each SNP and to each sample (explaining why you have 2 values per SNP here while usng two samples).

You've indicated a coverage threshold T to the "genotyper".

Given one sample, for each allele of the SNP you have a coverage. Thus you end with two coverage values per SNP.

Code 0 indicates that both coverages are < T.
Code 2 indicates that both coverages are >= T
Code 1 indicates that only one of the two paths is >=T. Code -1 is symmetrical, depending on which of the two allele is >=T.

Choosing (1 -1) or (-1 1) is a way to conserve SNPs that are homozygous and distinct in the two samples. Choosing (2 2) is a way to conserve SNPs that are heterozygous in both samples,

...

Best, Pierre

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.3 years ago by pierre.peterlongo ▴ 900

score 0 · Answer 2 · 2014-07-01

Hello there,

I followed quite a similar approach for my project. This is a cool approach, but you need to be really confident of your assembly. And there's no objective way because N50 is not a great measure for transcriptome assembly.

Use Velvet/Oases or Trinity, then map back and check coverage. Also ensure you have a reasonable N50. Use GATK's pipeline and BaseRecalibrate/VariantRecalibrate until you see a AnalyzeCovariate run with a good score match to the empirical quality values. Even so, you cannot be a 100% sure.

Let me know if you have any specific questions. Cheers!