Change fasta file header to include number of times a read apears
2
0
Entering edit mode
8.8 years ago

Hello,

I have a data set that looks like this:

>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG
>JAMESBROWN_1_FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA
>JAMESBROWN_1_FC20423AAXX_7_1_226_900
GCAAACTGTGCGTCATCGTGT

And I'd like to edit it to look like this:

>cel1_count=3
TGCCTTGTCTGTCCTAAAAATC
>cel2_count=9
GTTAAGTGGGAAACGATGT
>cel3_count=7
CCGACCTTGAAATACCAC
>cel4_count=7
TAGAAATCCACTATGCTTTGG
>cel5_count=5
CGCGGGTGAGCAGCCTGGTAGCTCGTC

Count in the header line specifies the number of times a sequence occurs in the data set. Kindly assist. Thanks!

sequence sequencing • 2.0k views
ADD COMMENT
2
Entering edit mode
8.8 years ago
cat in.fa | paste - - | cut -f 2 | LC_ALL=C sort |\
 uniq -c | sed 's/^[ ]*//' |\
 awk '{printf(">cel%d_count=%s\n%s\n",NR,$1,$2);}'
ADD COMMENT
0
Entering edit mode

Thanks for your input. For some reason, the reads are also altered instead of the header line only. A lot of bases are replaced by A. Here is what my result looked like:

>dme1_count=329515

>dme2_count=534
A
>dme3_count=15
AA
>dme4_count=4
AAA
>dme5_count=1719
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>dme6_count=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAC
>dme7_count=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAN
>dme8_count=2
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAT
>dme9_count=3
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACA
ADD REPLY
2
Entering edit mode

The command is not doing anything to your sequences other than counting.I guess, as they sorted, the sequences with "A" appeared first in your output. The order is not maintained.

ADD REPLY
0
Entering edit mode

Thank you, Pierre and Goutham! It's clear now.

ADD REPLY
0
Entering edit mode
8.8 years ago
Charles Plessy ★ 2.9k

The command fastx_collapser from the FASTX-Toolkit will produce what you want, except for the sequence name, which will look like x-y, where x is the position of the sequence in the output file, and y is the number of times it occurred in the input file.

For example:

$ cat in.fa
>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_82_883
GTTAGAGGTTCGAAG
>JAMESBROWN_1_FC20423AAXX_7_1_198_886
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>JAMESBROWN_1_FC20423AAXX_7_1_115_888
GGGGGTGTAGGGTGGGGTTGG
>JAMESBROWN_1_FC20423AAXX_7_1_99_894
GTTCGTATCCCACTTCTGACACCA
>JAMESBROWN_1_FC20423AAXX_7_1_226_900
GCAAACTGTGCGTCATCGTGT

$ fastx_collapser < in.fa
>1-2
GTTAGAGGTTCGAAG
>2-2
GGCTCAGTGGTCTAGTGGTATGATTCTCGCTT
>3-1
GGGGGTGTAGGGTGGGGTTGG
>4-1
GTTCGTATCCCACTTCTGACACCA
>5-1
GCAAACTGTGCGTCATCGTGT
ADD COMMENT

Login before adding your answer.

Traffic: 1991 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6