too many non-synonymous SNPs
1
0
Entering edit mode
10.4 years ago

Hello,

I have a non-model between species RNA-seq data set. I did the transcriptome assembly with trinity, did the SNP calling with GATK, and then called the ORFs with transdecoder. I then took this info and input it into SNPdat

The problem is that I get 20,000 SNPs labeled as non-synonymous and only 10,000 labeled as Synonymous. I know there should be waaay many more synonymous than non-so I am sort of baffled by this result. Has anyone else with a non-model system tried to annotate their SNPs? Has anyone had this problem? I would be open to use another annotation program if anyone knows one!

SNP RNA-Seq • 2.5k views
ADD COMMENT
0
Entering edit mode

Hi Emma, it's hard to offer specific advice because there are so many variables in assembly, snp calling and annotation. One thing I'm not clear about is whether your SNPs are segergating withing species, or substitutions between species are both? If you split them between within- and among-species differences you might see a different pattern?

ADD REPLY
3
Entering edit mode
10.4 years ago
Bert Overduin ★ 3.7k

I'm not so sure your expectation is right. Wouldn't you expect that most substitutions in the coding sequence would be missense?

A quick look at Ensembl tells me that for human there are more than twice as many missense variants as synonymous ones:

mysql -u anonymous -h ensembldb.ensembl.org
mysql> use homo_sapiens_variation_75_37;
mysql> SELECT COUNT(*) FROM variation_feature;
+----------+
| COUNT(*) |
+----------+
| 69114909 |
+----------+
1 row in set (0.01 sec)
mysql> SELECT COUNT(*) FROM variation_feature WHERE consequence_types LIKE 'missense_variant';
+----------+
| COUNT(*) |
+----------+
|   321337 |
+----------+
1 row in set (26.05 sec)
mysql> SELECT COUNT(*) FROM variation_feature WHERE consequence_types LIKE 'synonymous_variant';
+----------+
| COUNT(*) |
+----------+
|   137136 |
+----------+
1 row in set (27.98 sec)
ADD COMMENT
1
Entering edit mode

Good point. Most (about two thirds) of the mutations in a coding sequence will be non-synonymous, and many of them will segregate for a while at least. But when you look at _substitutions_ between species you expect non-synonymous one to be selected against and so you'd get fewer. The OP's numbers are what you'd expect if non-syn. an syn. mutations had equal chances of fixing, which is unlikely in the long run. So the "surprise" in these numbers might depend on how much within-species sampling is included in the between-species.

ADD REPLY

Login before adding your answer.

Traffic: 1471 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6