fasta redundant sequence
1
0
Entering edit mode
7.0 years ago
qudrat ▴ 100

Hello all, I have downloaded GENCODE and RefSeq transcripts and I want to combine these and filter for redundancy. Please suggest me how to proceed. Thank you

redundant sequence removal • 1.9k views
ADD COMMENT
1
Entering edit mode

Please post example input (one or two eg from gencode and refseq) and expected output.

ADD REPLY
1
Entering edit mode

Cluster the sequences with CD-HIT

ADD REPLY
0
Entering edit mode
>uc001aal.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCAC
>uc001aak.4
CACACAACGGGGTTTCGGGGCTGTGGACCCTGTGCCAGGAAAGGAAGGGCGCAGCTCCTGCAATGCGGAGCAGCCAGGGCAGTGGGCACCAGGCTTTAGCCTCCCTTTCTCACCCTACAGAGGGCAG
>NM_001005484.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCAC
>XM_017003010.1 CDS=512-893
AAATATGGGATTCCTGGGTTTAAAAGTATAAAATAAATATGTTTAATTTGTTAACTGATTACTATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTT

The above two sequences are from GENCODE nad the last two sequences are from RefSeq so there are total four sequences, first sequence and third sequence are redundant but they have different id. I want one of these two sequence to be removed while merging.The result should be like this

>uc001aak.4
CACACAACGGGGTTTCGGGGCTGTGGACCCTGTGCCAGGAAAGGAAGGGCGCAGCTCCTGCAATGCGGAGCAGCCAGGGCAGTGGGCACCAGGCTTTAGCCTCCCTTTCTCACCCTACAGAGGGCAG
>NM_001005484.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTA
>XM_017003010.1 CDS=512-893
AAATATGGGATTCCTGGGTTTAAAAGTATAAAATAAATATGTTTAATTTGTTAACTGATTACTATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTT
ADD REPLY
1
Entering edit mode

If it is ok for you to run legacy blast then this link could be of use https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
7.0 years ago

with seqkit:

 $ seqkit rmdup -w 0 -s -i  test.fa --quiet

output:

>uc001aal.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCAC
>uc001aak.4
CACACAACGGGGTTTCGGGGCTGTGGACCCTGTGCCAGGAAAGGAAGGGCGCAGCTCCTGCAATGCGGAGCAGCCAGGGCAGTGGGCACCAGGCTTTAGCCTCCCTTTCTCACCCTACAGAGGGCAG
>XM_017003010.1 CDS=512-893
AAATATGGGATTCCTGGGTTTAAAAGTATAAAATAAATATGTTTAATTTGTTAACTGATTACTATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTT

Input:

$ cat test.fa 
>uc001aal.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCAC
>uc001aak.4
CACACAACGGGGTTTCGGGGCTGTGGACCCTGTGCCAGGAAAGGAAGGGCGCAGCTCCTGCAATGCGGAGCAGCCAGGGCAGTGGGCACCAGGCTTTAGCCTCCCTTTCTCACCCTACAGAGGGCAG
>NM_001005484.1 CDS=1-916
ATGGTGACTGAATTCATTTTTCTGGGTCTCTCTGATTCTCAGGAACTCCAGACCTTCCTATTTATGTTGTTTTTTGTATTCTATGGAGGAATCGTGTTTGGAAACCTTCTTATTGTCATAACAGTGGTATCTGACTCCCAC
>XM_017003010.1 CDS=512-893
AAATATGGGATTCCTGGGTTTAAAAGTATAAAATAAATATGTTTAATTTGTTAACTGATTACTATCAGAATTGTACTGTTCTGTATCCCACCAGCAATGTCTAGGAATGCCTGTTTCTCCACAAAGTGTTT
ADD COMMENT

Login before adding your answer.

Traffic: 2600 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6