Script To Print Number Of Occurences Of Genes From Multifasta File
2
0
Entering edit mode
13.2 years ago
Syed Imtiyaz ▴ 40

Hi, i need help to write a script that will take for input a multi fasta file and output the gene names and the number of times the gene is found in the file in two columns.

fasta scripting • 5.0k views
ADD COMMENT
6
Entering edit mode

It will be useful if you indicate whether the answers are helpful. When we get this type of "please write my code for me" question, I often wonder whether the answer even means anything to the questioner. If your problem is that you know nothing about scripting, my advice is to go away and learn some.

ADD REPLY
7
Entering edit mode
13.2 years ago
Rm 8.3k

simplest will be: Asuming fasta header as ">gene_name description..."

grep "^>" multi_fasta.txt | sed 's/>//' | awk '{print $1}' | sort | uniq -c | awk '{print $2 "\t" $1}' >gene_count.txt

Edit: using the link provided

curl http://dl.dropbox.com/u/43445136/examplefasta.fa |  grep "^>" | sed 's/>//' | awk -F"|" '{print $1}' | sort | uniq -c | awk '{print $2 "\t" $1}'

Output

ENSTGUG00000000002      1
ENSTGUG00000000010      1
ENSTGUG00000000018      1
ENSTGUG00000000021      1
ENSTGUG00000000026      1
ENSTGUG00000000027      1
ENSTGUG00000000029      1
ENSTGUG00000000037      1
ENSTGUG00000000043      1
....
ADD COMMENT
4
Entering edit mode

could also be a single awk command something like: awk '(/^>/){ a[substr($0, 2)]++ }END { for(header in a){ print a[header], header }}'

ADD REPLY
0
Entering edit mode

thanks alot for spending ur precious time on my problem RM. But there is some problem with the commands i guess coz it's giving the gene name but it is not doing the counting part as expected could you plz check it out thank you

ADD REPLY
0
Entering edit mode

to test it, take off last part of the awk to see if uniq -c is giving out put or not...: BTW can you paste the sample file by editing your Question...

ADD REPLY
0
Entering edit mode

This code works for me using a simple test fasta file. If it doesn't work for you, the problem might lie with your data.

ADD REPLY
0
Entering edit mode

ENSTGUG00000012287|ENSTGUT00000012814|1475 ACCGGTGCCAGGGGCCGCGGTTGGCTGCGAAGCGGCGGCTCCCGCCCCCTGCGGAATCAGCCCCAGGTCCGGGGCGGCTCTACCTGCCGGCACGATGAACCTCACCGCCGAGAGCCACCGCATTCCGCTGAGCGACGGCAACAGCATCCCGCTCTTGGGGCTGGGCACCTACGCCGACCCGCAGAAAACTCCCAAAGGTTCCTGTCTGGAGGCGGTGAAGATTGCCATCGATGCTGGTTACCGCCACATCGACGGTGCCTTTGTCTACTTCAATGAGCATGAAGTGGGACAAGCCATCCGGGAGAAGATTGCTGAAGGGAAGATCAAGAGAGAAGACATATTTTACTGTGGCAAGCTGTGGAATACCTGCCACCCCCCAGAGCTGGTGCGTCCCACACTGGAGAAAACCCTGAAGATCCTGCAGCTGGACTACGTTGACCTCTACATTATTGAGCTGCCAATGGCTTTCAAGCCTGGAGATGCACTCTACCCAAAAGATGAAAATGGAAAATTTATCTACCATGAGACAGACTTATGTGCCACTTGGGAGGCTCTG

ADD REPLY
0
Entering edit mode

@syed: its working for me with the sequence you provided.: output ENSTGUG00000012287|ENSTGUT00000012814|1475 1

ADD REPLY
0
Entering edit mode

thanks for checking. But for the count part it is not giving the expected it is always giving 1 as the count for each and every gene.

ADD REPLY
0
Entering edit mode

could u give me ur email id r some thing by taht i can send u the original file

ADD REPLY
0
Entering edit mode

check the headers for all the sequences if they are consisitant; Try replacing awk '{print $1}' with awk -F"|" '{print $1}'

ADD REPLY
0
Entering edit mode

Hi, Still the problem is same

ADD REPLY
0
Entering edit mode

i will send u the whole file plz test it

ADD REPLY
0
Entering edit mode

ENSTGUG00000012287|ENSTGUT00000012814|1475 ACCGGTGCCAGGGGCCGCGGTTGGCTGCGAAGCGGCGGCTCCCGCCCCCTGCGGAATCAGCCCCAGGTCCGGGGCGGCTCTACCTGCCGGCACGATGAACCTCACCGCCGAGAGCCACCGCATTCCGCTGAGCGACGGCAACAGCATCCCGCTCTTGGGGCTGGGCACCTACGCCGACCCGCAGAAAACTCCCAAAGGTTCCTGTCTGGAGGCGGTGAAGATTGCCATCGATGCTGGTTACCGCCACATCGACGGTGCCTTTGTCTACTTCAATGAGCATGAAGTGGGACAAGCCATCCGGGAGAAGATTGCTGAAGGGAAGATCAAGAGAGAAGACATATTTTACTGTGGCAAGCTGTGGAATACCTGCCACCCCCCAGAGCTGGTGCGTCCCACACTGGAGAAAACCCTGAAGATCCTGCAGCTGGACTACGTTGACCTCTACATTATTGAGCTGCCAATGGCTTTCAAGCCTGGAGATGCACTCTACCCAAAAGATGAAAATGGAAAATTTATCTACCATGAGACAGACTTATGTGCCACTTGGGAGGCTCTG

ADD REPLY
0
Entering edit mode

upload the file and share the link

ADD REPLY
0
Entering edit mode

http://www.4shared.com/folder/GgfoEnew/_online.html here is the file plz check it out thanks

ADD REPLY
0
Entering edit mode

Hi, I think there is some problem in downloading the file plz go to this link http://uploading.com/files/7bbe159c/examplefasta.fa/ THank you

ADD REPLY
0
Entering edit mode

same problem downloading it..its blocked by websense

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Hi RM, the above link is working?

ADD REPLY
0
Entering edit mode

Content blocked by your organization

ADD REPLY
0
Entering edit mode

upload it to dropbox or other more safe site...

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

i ran the script its working fine, all your sequences are present only once...Iam editing my answer to include part of the result

ADD REPLY
0
Entering edit mode

THank you for all your help :)

ADD REPLY
2
Entering edit mode
13.2 years ago

This can be done using Biopieces www.biopieces.org) like this:

read_fasta -i test_big.fna -n 10 |
count_vals -k SEQ_NAME |
uniq_vals -k SEQ_NAME |
write_tab -ck SEQ_NAME_COUNT,SEQ_NAME -x

Cheers,

Martin

ADD COMMENT
0
Entering edit mode

Thanks for reply, actually i am not aware of 'Biopieces'. Could you please suggest how to use it

Thank you

ADD REPLY
0
Entering edit mode

The above does exactly what you wanted, but do have a look at the website and the documentation there.

ADD REPLY
0
Entering edit mode

thanks for ur suggestion maasha, i checked it is not installed on the server which i am working. could u plz try it in perl or awk

THank you

ADD REPLY

Login before adding your answer.

Traffic: 1484 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6