Question

How to work with SNPs when low coverage

1

Entering edit mode

5.0 years ago

luzglongoria ▴ 50

Hi there,

I am working with RNA-seq of a organism (Plasmodium) that does not have reference genome. Which is readily available is a genome of a very related species. So I willl use the reads and the reference genome for determining the SNPs and differences between these two species. The problem I am considering is that since I am working with RNA-seq not all the genes will be expressed and in most of the cases some genes (like the one in the middle --- picture) would get zero SNPs just because they are not expressed or to low coverage. If there are accumulation of such genes it might look like that there are no SNP in those genes but in fact I just don't know.

avatar para foros

What is the proper way of dealing with this issue? Maybe to choose a threshold? and in this case...how to decide which one?

Thank you so much in advance

RNA-Seq SNP coverage • 1.3k views

ADD COMMENT • link 5.0 years ago by luzglongoria ▴ 50

0

Entering edit mode

You can downsample your dataset and determine precision and recall for calling of variants comparing vs. full-coverage dataset.

I'd wont try to approximate amount of SNVs you miss in the low-covered regions - this lower coverage may correlate with DNA-accessibility and it is well known that there is a correlation between DNAse-accessible regions and amount of SNVs observed there.

You may also want to check out this tool: https://academic.oup.com/gigascience/article/8/9/giz100/5559527

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

0

Entering edit mode

Thank you so much for your response. I have already done some analysis and get a .vcf file with the calling of variants. Shall I compare this data with full-coverage dataset? And how can I do that?

ADD REPLY • link 5.0 years ago by luzglongoria ▴ 50

1

Entering edit mode

No, in theory you should prepare vcf will your initial dataset, then downsample it like 10%, then check if you can still retrieve the same SNVs, and stop when you understand that you coverage is not enough. This will be your limit and you'll have to discard all the regions from the inital dataset which are covered less than this value. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4296149/ - here I guess they have it described (but I am not sure)

ADD REPLY • link 5.0 years ago by German.M.Demidov ★ 2.9k

1

Entering edit mode

Thank you so much. I think the paper you sent about the subSeq R package can help me :)

ADD REPLY • link 5.0 years ago by luzglongoria ▴ 50

0

Entering edit mode

¿No es / it's not P. falciparum? Have you done de novo transcriptome assembly?

ADD REPLY • link 5.0 years ago by Kevin Blighe 88k

0

Entering edit mode

It is Plasmodium relictum lineage GRW4. An avian malaria parasite. And yes, I have done de novo transcriptome assembly.

ADD REPLY • link 5.0 years ago by luzglongoria ▴ 50