how to create a distance matrix from huge file of nucleotide sequence using R- language??
1
0
Entering edit mode
22 months ago
Manaswini • 0

I want to generate a distance matrix with many nucleotide sequences. I need to perform clustering(NMDS analysis) with that data. Can anyone suggest which r-package should be used to create a distance matrix so that I can go for clustering analysis.

clustering NMDS R • 1.9k views
ADD COMMENT
0
Entering edit mode

Did you google first?

ADD REPLY
0
Entering edit mode

I am sorry sir I am an extreme beginner to the r- language. I have seen DistanceMatrix in google but the issue is I do not want to calculate the hamming distance between sequences. I want to calculate p-distance and again my sequences are not of the same length so got confused about what packages to use. and I do not intend to generate a phylogenetic tree I just need to generate an NMDS plot(Non-metric multidimensional scaling) to see which sequences are clustering together.

ADD REPLY
0
Entering edit mode

If your sequences are not aligned you need to align them first, and that should be better done outside of R. How many sequences are there really?

You can possibly do that with MEGA, see here: https://www.megasoftware.net/mega1_manual/Distance.html#:~:text=p%2Ddistance,p%20%3D%20nd%2Fn.

It really depends on the size of your dataset.

ADD REPLY
0
Entering edit mode

I have around 30000 seq per file. thank you for the help

ADD REPLY
0
Entering edit mode

Aligning them will be a challenge, especially if you don't have a high-mem server available. Try MAFFT, e.g. here: this server: https://mafft.cbrc.jp/alignment/server/large.html , or in Galaxy or on a local server if you have. But these options may take a long time or fail.

Typical data size is up to ∼200,000 sequences × ∼5,000 sites (including gaps), but depends on similarity. Not for long genomic sequences.

ADD REPLY
0
Entering edit mode

ok sir thank you

ADD REPLY
0
Entering edit mode
22 months ago
Michael 55k

One easy-to-find solution to distance matrix is this: https://rdrr.io/bioc/DECIPHER/man/DistanceMatrix.html

However, whether this is applicable or even advised depends. First, your data must be aligned. Then you say you need to perform clustering, but what for? For generating phylogenetic trees, for example, clustering a distance matrix is not state-of-the-art. It should only be used if there are so many sequences that not even ML methods like Fasttree or Iqtree are applicable.

ADD COMMENT

Login before adding your answer.

Traffic: 2033 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6