Question

Script To Print Number Of Occurences Of Genes From Multifasta File

0

Entering edit mode

13.2 years ago

Syed Imtiyaz ▴ 40

Hi, i need help to write a script that will take for input a multi fasta file and output the gene names and the number of times the gene is found in the file in two columns.

fasta scripting • 5.0k views

ADD COMMENT • link updated 13.2 years ago by Rm 8.3k • written 13.2 years ago by Syed Imtiyaz ▴ 40

6

Entering edit mode

It will be useful if you indicate whether the answers are helpful. When we get this type of "please write my code for me" question, I often wonder whether the answer even means anything to the questioner. If your problem is that you know nothing about scripting, my advice is to go away and learn some.

ADD REPLY • link 13.2 years ago by Neilfws 49k

score 7 · Answer 1 · 2011-09-16

7

Entering edit mode

13.2 years ago

Rm 8.3k

simplest will be: Asuming fasta header as ">gene_name description..."

grep "^>" multi_fasta.txt | sed 's/>//' | awk '{print $1}' | sort | uniq -c | awk '{print $2 "\t" $1}' >gene_count.txt

Edit: using the link provided

curl http://dl.dropbox.com/u/43445136/examplefasta.fa |  grep "^>" | sed 's/>//' | awk -F"|" '{print $1}' | sort | uniq -c | awk '{print $2 "\t" $1}'

Output

ENSTGUG00000000002      1
ENSTGUG00000000010      1
ENSTGUG00000000018      1
ENSTGUG00000000021      1
ENSTGUG00000000026      1
ENSTGUG00000000027      1
ENSTGUG00000000029      1
ENSTGUG00000000037      1
ENSTGUG00000000043      1
....

ADD COMMENT • link 13.2 years ago by Rm 8.3k

4

Entering edit mode

could also be a single awk command something like: awk '(/^>/){ a[substr($0, 2)]++ }END { for(header in a){ print a[header], header }}'

ADD REPLY • link 13.2 years ago by brentp 24k

0

Entering edit mode

thanks alot for spending ur precious time on my problem RM. But there is some problem with the commands i guess coz it's giving the gene name but it is not doing the counting part as expected could you plz check it out thank you

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

to test it, take off last part of the awk to see if uniq -c is giving out put or not...: BTW can you paste the sample file by editing your Question...

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

This code works for me using a simple test fasta file. If it doesn't work for you, the problem might lie with your data.

ADD REPLY • link 13.2 years ago by Neilfws 49k

0

Entering edit mode

ENSTGUG00000012287|ENSTGUT00000012814|1475 ACCGGTGCCAGGGGCCGCGGTTGGCTGCGAAGCGGCGGCTCCCGCCCCCTGCGGAATCAGCCCCAGGTCCGGGGCGGCTCTACCTGCCGGCACGATGAACCTCACCGCCGAGAGCCACCGCATTCCGCTGAGCGACGGCAACAGCATCCCGCTCTTGGGGCTGGGCACCTACGCCGACCCGCAGAAAACTCCCAAAGGTTCCTGTCTGGAGGCGGTGAAGATTGCCATCGATGCTGGTTACCGCCACATCGACGGTGCCTTTGTCTACTTCAATGAGCATGAAGTGGGACAAGCCATCCGGGAGAAGATTGCTGAAGGGAAGATCAAGAGAGAAGACATATTTTACTGTGGCAAGCTGTGGAATACCTGCCACCCCCCAGAGCTGGTGCGTCCCACACTGGAGAAAACCCTGAAGATCCTGCAGCTGGACTACGTTGACCTCTACATTATTGAGCTGCCAATGGCTTTCAAGCCTGGAGATGCACTCTACCCAAAAGATGAAAATGGAAAATTTATCTACCATGAGACAGACTTATGTGCCACTTGGGAGGCTCTG

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

@syed: its working for me with the sequence you provided.: output ENSTGUG00000012287|ENSTGUT00000012814|1475 1

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

thanks for checking. But for the count part it is not giving the expected it is always giving 1 as the count for each and every gene.

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

could u give me ur email id r some thing by taht i can send u the original file

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

check the headers for all the sequences if they are consisitant; Try replacing awk '{print $1}' with awk -F"|" '{print $1}'

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

Hi, Still the problem is same

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

i will send u the whole file plz test it

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

ENSTGUG00000012287|ENSTGUT00000012814|1475 ACCGGTGCCAGGGGCCGCGGTTGGCTGCGAAGCGGCGGCTCCCGCCCCCTGCGGAATCAGCCCCAGGTCCGGGGCGGCTCTACCTGCCGGCACGATGAACCTCACCGCCGAGAGCCACCGCATTCCGCTGAGCGACGGCAACAGCATCCCGCTCTTGGGGCTGGGCACCTACGCCGACCCGCAGAAAACTCCCAAAGGTTCCTGTCTGGAGGCGGTGAAGATTGCCATCGATGCTGGTTACCGCCACATCGACGGTGCCTTTGTCTACTTCAATGAGCATGAAGTGGGACAAGCCATCCGGGAGAAGATTGCTGAAGGGAAGATCAAGAGAGAAGACATATTTTACTGTGGCAAGCTGTGGAATACCTGCCACCCCCCAGAGCTGGTGCGTCCCACACTGGAGAAAACCCTGAAGATCCTGCAGCTGGACTACGTTGACCTCTACATTATTGAGCTGCCAATGGCTTTCAAGCCTGGAGATGCACTCTACCCAAAAGATGAAAATGGAAAATTTATCTACCATGAGACAGACTTATGTGCCACTTGGGAGGCTCTG

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

upload the file and share the link

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

http://www.4shared.com/folder/GgfoEnew/_online.html here is the file plz check it out thanks

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

Hi, I think there is some problem in downloading the file plz go to this link http://uploading.com/files/7bbe159c/examplefasta.fa/ THank you

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

same problem downloading it..its blocked by websense

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

http://www.fileserve.com/file/kts2Z3h/examplefasta.fa this should work in think

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

Hi RM, the above link is working?

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

Content blocked by your organization

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

upload it to dropbox or other more safe site...

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

http://dl.dropbox.com/u/43445136/examplefasta.fa

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

i ran the script its working fine, all your sequences are present only once...Iam editing my answer to include part of the result

ADD REPLY • link 13.2 years ago by Rm 8.3k

0

Entering edit mode

THank you for all your help :)

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

score 2 · Answer 2 · 2011-09-16

2

Entering edit mode

13.2 years ago

Martin A Hansen 3.0k

This can be done using Biopieces www.biopieces.org) like this:

read_fasta -i test_big.fna -n 10 |
count_vals -k SEQ_NAME |
uniq_vals -k SEQ_NAME |
write_tab -ck SEQ_NAME_COUNT,SEQ_NAME -x

Cheers,

Martin

ADD COMMENT • link 13.2 years ago by Martin A Hansen 3.0k

0

Entering edit mode

Thanks for reply, actually i am not aware of 'Biopieces'. Could you please suggest how to use it

Thank you

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40

0

Entering edit mode

The above does exactly what you wanted, but do have a look at the website and the documentation there.

ADD REPLY • link 13.2 years ago by Martin A Hansen 3.0k

0

Entering edit mode

thanks for ur suggestion maasha, i checked it is not installed on the server which i am working. could u plz try it in perl or awk

THank you

ADD REPLY • link 13.2 years ago by Syed Imtiyaz ▴ 40