Question

Frequency of a unique sequence in a fasta file

0

Entering edit mode

9.8 years ago

Tanvir Ahamed ▴ 350

Hi,

I have a FASTA file containing sequences with id as header . The header for all sequence are unique but some sequences are similar to each other. (all.fasta)

To get a FASTA file of unique sequences, I used gt-sequniq command (unique.fasta)

Now I want to find which unique sequence occurs how many times including the header name of similar sequences (Frequency of each unique sequence from all.fasta file).

Thank you

sequencing FASTA • 8.4k views

ADD COMMENT • link updated 16 months ago by Shashi • 0 • written 9.8 years ago by Tanvir Ahamed ▴ 350

3

Entering edit mode

9.8 years ago

umer.zeeshan.ijaz ★ 1.8k

Step 1: Linearize your FASTA file

$ awk 'NR==1 {print ; next} {printf /^>/ ? "\n"$0"\n" : $1} END {print}' yourfastafile.fasta > yourfastafile_linearized.fasta

Step 2: Dereplicate your FASTA file and keep an account of unique sequences

$ grep -v "^>" yourfastafile_linearized.fasta | 
  grep -v [^ACGTacgt] | \
  sort -d | \
  uniq -c | \
  while read abundance sequence
  do
    hash=$(printf "${sequence}" | sha1sum);
    hash=${hash:0:40};
    printf ">%s;size=%d;\n%s\n" "${hash}" "${abundance}" "${sequence}";
  done >  yourfastafile_linearized_dereplicated.fasta

The output will be in USEARCH format with size giving the frequency of unique reads:

$ head yourfastafile_linearized_dereplicated.fasta
>dff2e3c82439e27949c51a5e45acae06b4a2cafe;size=1;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAAACTGACTATTCCACTGGAACCACTGAACGGAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCAAGAAGGTGATAAGCAGGAGAAACATACGAAGGCGCATAACGATACCACTGACCCTCAGCAATCTTAAACTTCTTAGACGAATCACCAGAACGGAAAACAGCCTTCATAGAAATTTCACGCGGCGG
>a6387c43c45ee62484eafb6c3301a8c0ed0fd546;size=30;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAG
>66a3448cf6f606e3f843ea5f81d5a77d461bbf5e;size=13;
AAAAAAAGTTTGAATTATGGCGAGAAATAAAAGTCTGAAACATGATTAAACTCCTAAGCAGAAAACCTACCGCGCTTCGCTTGGTCAACCCCTCAGCGGCAAAAATTAAAATTTTTACCGCTTCGGCGTTATAACCTCACACTCAATCTTTTATCACGAAGTCATGATTGAATCGCGAGTGGTCGGCAGATTGCGATAAACGGTCACATTAAATTTAACCTGACTATTCCACTGCAACAACTGAACGGACTGGAAACACTGGTCATAATCATGGTGGCGAATAAGTACGCGTTCTTGCAAATCACCAGAAGGCGGTTCCTGAATGAATGGGAAGCCTTCA
>0746a116007d044ac40d6ead5d8fa029e034c9a6;size=1;

Best Wishes,
Umer

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by umer.zeeshan.ijaz ★ 1.8k

0

Entering edit mode

Thanks.

But I am more interested about the members (name or ID) of each group.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

Step #2 doesn't work

ADD REPLY • link 7.8 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

When you say something doesn't work, it always helps to give details. A sample of your input, the exact command you ran, and any errors you faced, plus, most importantly, why you think it doesn't "work". Remember, it may seem like it doesn't work while it works perfectly fine.

ADD REPLY • link 7.8 years ago by Ram 44k

0

Entering edit mode

9.8 years ago

Ram 44k

I wrote a script that deals with the problem that gt-sequniq deals with. You can change it a bit to suit your case: https://github.com/RamRS/myPerlScripts/blob/master/mulFaDeDup.pl

Modify the existing hash to hold sequence:comma-separated list of headers as key:value

Then, add a new hash with ~~header~~ sequence as key and a counter as value, increase counter to number of times the sequence is found. Filter the final hash by value > 1 and ~~write the keys out with their values~~ map the remaining keys back to the previous hash.

This way, you can retain header info as well as get a count.

EDIT; Updated with a HUGE bug fix.

ADD COMMENT • link 2.6 years ago by Ram 44k

Ram · Accepted Answer · 2015-02-16

4

Entering edit mode

9.8 years ago

5heikki 11k

So, you're looking for a clustering algorithm. I would recommend either Usearch or cd-hit..

ADD COMMENT • link 9.8 years ago by 5heikki 11k

0

Entering edit mode

I think it's not a clustering algorithm !! its more about frequency distribution of each unique sequence.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

i.e. cluster size at 100% identity

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by 5heikki 11k

0

Entering edit mode

This is a MUCH better option.

ADD REPLY • link 2.6 years ago by Ram 44k

Ram · Accepted Answer · 2015-02-19

3

Entering edit mode

9.8 years ago

Tanvir Ahamed ▴ 350

My solution:

Main file: all.fasta

Unique sequence file, generated from main file: unique.fasta

gt sequniq -o unique.fasta all.fasta

R function to find the similar sequence

funiq<-function(all.fa,uniq.fa)
        {
        all<-unlist(all.fa)
        sequ.names <- names(uniq.fa)
        sequ <- NULL
        sequ[sequ.names] <- list(NULL)
        for(i in 1: length(uniq.fa))
            {
            sequ[[i]]<-names(all[which(all%in%(uniq.fa[[i]][1]))])
            }
        return(sequ)
        }

Read FASTA file in R

library(seqinr)
all<-read.fasta("all.fasta",as.string = TRUE,seqtype="AA")
uniq<-read.fasta("unique.fasta",as.string = TRUE,seqtype="AA")

Name of similar sequences:

nam<- funiq(all,uniq)

Result:

$ADC37925
[1] "ADC37925" "EYO75956" "EVE92773"

$AFR73793
[1] "AFR73793" "EPZ11191" "EPZ09632" "EQM92926" "EOR47863" "CAQ50233"

$EFB95474
[1] "EFB95474"

Frequency count:

fcount<-lapply(nam,length)

Result:

$ADC37925
[1] 3

$AFR73793
[1] 6

$EFB95474
[1] 1

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

Well, that was rude! Anyway, glad you solved the problem.

ADD REPLY • link 2.6 years ago by Ram 44k

0

Entering edit mode

Thanks for your help!!

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

1

Entering edit mode

You're welcome, but just so you know, adding your own answer as the accepted answer after people have given you suggestions and answers is rude.

ADD REPLY • link 2.6 years ago by Ram 44k

0

Entering edit mode

Thanks for your suggestion.

Definitely I highly expect and prefer peoples suggestions and answers. That's why we share our problem and thinking.

But the the meaning of "Accepted " answer is most relevant solution of the problem/question.

is it ok to accept suggestion as solution or not to share my own solution here?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Tanvir Ahamed ▴ 350

0

Entering edit mode

Your interpretation of "Accepted" is one of many possible interpretations, but the point is, as long as answers such as 5heikki's give you an optimal solution to your problem, and have been added in a relevant time frame, adding your own answer and accepting only your answer is equivalent to hosting a party, holding a contest and kinda rigging it to win it yourself. It is a crude anomaly, but you get the gist. It is not bad, just rude.

Consider accepting all answers that get you from your question to the desired answer and do not need much effort (in this case, 5heikki's). This is good for people that may have the same problem in the future, so they have choices when it comes to solutions.