Excluding sequences from fasta based on a id list - faSomeRecords not working
2
1
Entering edit mode
2.8 years ago

Hello! So, first of all, I have already read and tried several solutions exposed on Biostars and other foruns for my issue, but none of them seem to work. The ones I have most invested my time are faSomeRecords and SeqFilter. The problems I had with them are purely technical (python version, several modules needed, pipeline dont work or works the wrong way because the inpired version author forgot to put some extra line codes) and because of my lack of knowledge to solve them, i gave up

I have several limitations because i'm a newbie with a short deadline and I use my university server, so a not very complicated and quick solution would be nice.

So......... Here's what I have to do:

I have a fasta file with several transcripts sequences

>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......

>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....

And I have a list with the IDs of transcripts that i dont want my output to have, so i want the ID's sequences out of my output file :

TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00000007 gene=RCAN1
TCONS_00002389 gene=RCAN1
TCONS_00002405 gene=ITSN1
TCONS_00002406 gene=ITSN1
TCONS_00002407 gene=ITSN1
TCONS_00002408 gene=ITSN1
TCONS_00002409 gene=ITSN1
TCONS_00002410 gene=ITSN1
TCONS_00002411 gene=ITSN1
TCONS_00002412 gene=ITSN1
TCONS_00002413 gene=ITSN1
TCONS_00000015 gene=CRYZL1

Currently im trying to run a pipeline with seqtk which I got from a post here, but its taking a long time, and i dont even know if its going to work:

seqtk subseq results_15m_200.fa $(grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt) > finalfile_without_orf_15m.fa

The solution you'll give me can also be on R!!

Thanks in advance

filter faSomeRecords fasta • 2.7k views
ADD COMMENT
0
Entering edit mode

what is the output of grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt.

input:

$ cat file.txt
TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00002389 gene=RCAN1
TCONS_00002413 gene=ITSN1
TCONS_00000015 gene=CRYZL1

$ cat file.fa                  
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......

>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....

output:

$ seqtk subseq file.fa file.txt
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......

$ faSomeRecords -exclude file.fa file.txt out.fa

$  cat out.fa 
>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....
ADD REPLY
0
Entering edit mode

Thank you for your input!! how could i use it without faSomeRecords? im not being able to make mine run, I have already tried several solutions....

ADD REPLY
0
Entering edit mode

Actually, if any of you could help me with solutions for faSomeRecords, I would be appreciate.

I downloaded the linux version for my machine but it is not working. I have already tried different Linux versions and even thought there was something wrong with my files, but they seem right. Can you give me some light on how should I proceed? My command line is:

/media/disk4/gopec/cracco/lncRNA/15m/faSomeRecords –exclude results_15m_200.fa transcripts.with.orf_15m_full.name.txt finalfile_without_orf_15m.fa

And when I run it I just get:

faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

The output file is not even created and I cannot find a solution for this online.

ADD REPLY
0
Entering edit mode

but it is not working

Just saying this does not help you or us. We have shown you that the program works as expected. You are able to get the in-line help to print to screen so it means that the program is executing fine on your machine. So that leaves your input data as the problem.

As asked by @shenwei365 did you manipulate any of the files on a machine other than a linux server at any time (e.g. on windows or macOS). If so that file(s) could have a problem with line endings. Those can be fixed with dos2unix command.

Please show us the output of command @cpad0112 asked in their comment.

What does cat -vet transcripts.with.orf_15m_full.name.txt | head -4 show?

ADD REPLY
0
Entering edit mode

Sorry, i missed those questions!! None of the files were manipulated on windows, I did all through linux on my uni's server.

The output of this command grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt never came out. After I press enter, nothing happen, and yesterday, with some hope, i runned it with nohup and &.

Now the output for cat -vet transcripts.with.orf_15m_full.name.txt | head -4 is:

TCONS_00002379 gene=ENSBTAG00000006648$
TCONS_00000007 gene=RCAN1$
TCONS_00002389 gene=RCAN1$
TCONS_00002405 gene=ITSN1$
ADD REPLY
1
Entering edit mode

If you get nothing from the grep command then that tells us that there is some discrepancy between your ID list and the original fasta headers. If I run that command on my toy example I get

$ grep ">" test.fa | tr -d ">" | grep -v -w -f list
NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence
test2
test3

Which shows the fasta headers that are not common between input fasta and file with list of headers that need to be removed. In other words sequences that will be kept in final output.

As for the cat output your line endings do look to be in proper unix format without any additional characters.

ADD REPLY
1
Entering edit mode
2.8 years ago
GenoMax 147k

Not sure why you can't get faSomeRecords to work. Here is an example.

$ more test.fa
>NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence
CCCGTTAATGACGATGGTGTACAGTTCTCTCGGCCAGACGCATATTCGTCAAACGCCGCCGTCGCGGTAACGCCGATCTC
>test2
ACATTGACATTTCCAAAGTCAGCACACGTCTGGCGTCGATTGAAGCGGTTAACGTCCGGACGATGGCCCGAAATCCGTTG
>test3
AGGCGGCCTGTGGCGTTTCAACACGTTGCAGATGGTGTCTGACCCGGCGGACGACGCACGTTGGAACCTTACAAAACTGG
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC
>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC

# Need to make sure that ">" is removed from the fasta headers when you put them in list file

$ more list
TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00000007 gene=RCAN1

$ faSomeRecords -exclude test.fa list outtest

$ more outtest 
>NC_053005.1 UNVERIFIED: Enterobacter phage KNP7 genomic sequence
CCCGTTAATGACGATGGTGTACAGTTCTCTCGGCCAGACGCATATTCGTCAAACGCCGCCGTCGCGGTAACGCCGATCTC
>test2
ACATTGACATTTCCAAAGTCAGCACACGTCTGGCGTCGATTGAAGCGGTTAACGTCCGGACGATGGCCCGAAATCCGTTG
>test3
AGGCGGCCTGTGGCGTTTCAACACGTTGCAGATGGTGTCTGACCCGGCGGACGACGCACGTTGGAACCTTACAAAACTGG
ADD COMMENT
0
Entering edit mode

Thank you for your comment!!! Thats the way I was using it, and I have no idea either why it was not working. I have even sent an email for dr. kent to see if he could give me some light.

Thank you once again

ADD REPLY
0
Entering edit mode

Does the ID file created in Windows? Try "dos2unix id.txt".

ADD REPLY
0
Entering edit mode

No, it wasnt!! I did everything through linux. I just tried this on my Id file, but it didnt work as well. I keep getting the same output:

faSomeRecords - Extract multiple fa records
usage:
   faSomeRecords in.fa listFile out.fa
options:
   -exclude - output sequences not in the list file.

But thank you

ADD REPLY
1
Entering edit mode
2.8 years ago
GenoMax 147k

If you want to try a different solution then filterbyname.sh from BBMap suite will do this as well.

filterbyname.sh -Xmx4g in=results_15m_200.fa out=filtered.fa names=transcripts.with.orf_15m_full.name.txt 
ADD COMMENT

Login before adding your answer.

Traffic: 1799 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6