Hello! So, first of all, I have already read and tried several solutions exposed on Biostars and other foruns for my issue, but none of them seem to work. The ones I have most invested my time are faSomeRecords and SeqFilter. The problems I had with them are purely technical (python version, several modules needed, pipeline dont work or works the wrong way because the inpired version author forgot to put some extra line codes) and because of my lack of knowledge to solve them, i gave up
I have several limitations because i'm a newbie with a short deadline and I use my university server, so a not very complicated and quick solution would be nice.
So......... Here's what I have to do:
I have a fasta file with several transcripts sequences
>TCONS_00002379 gene=ENSBTAG00000006648
ATGAATTGCAGCACGCCAGGCCTCCCTGTCCATCACCAACTCCTGGAGTTCACCCAGACTCACATCCATC.......
>TCONS_00000007 gene=RCAN1
GACAGCTTCTGGTAAAGGAACTCCATCCACTTGGGGCTCGACTGCGGGAGTCGCTGTAGCTCTCACTGCC.....
And I have a list with the IDs of transcripts that i dont want my output to have, so i want the ID's sequences out of my output file :
TCONS_00002379 gene=ENSBTAG00000006648
TCONS_00000007 gene=RCAN1
TCONS_00002389 gene=RCAN1
TCONS_00002405 gene=ITSN1
TCONS_00002406 gene=ITSN1
TCONS_00002407 gene=ITSN1
TCONS_00002408 gene=ITSN1
TCONS_00002409 gene=ITSN1
TCONS_00002410 gene=ITSN1
TCONS_00002411 gene=ITSN1
TCONS_00002412 gene=ITSN1
TCONS_00002413 gene=ITSN1
TCONS_00000015 gene=CRYZL1
Currently im trying to run a pipeline with seqtk which I got from a post here, but its taking a long time, and i dont even know if its going to work:
seqtk subseq results_15m_200.fa $(grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt) > finalfile_without_orf_15m.fa
The solution you'll give me can also be on R!!
Thanks in advance
what is the output of
grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt
.input:
output:
Thank you for your input!! how could i use it without faSomeRecords? im not being able to make mine run, I have already tried several solutions....
Actually, if any of you could help me with solutions for faSomeRecords, I would be appreciate.
I downloaded the linux version for my machine but it is not working. I have already tried different Linux versions and even thought there was something wrong with my files, but they seem right. Can you give me some light on how should I proceed? My command line is:
And when I run it I just get:
The output file is not even created and I cannot find a solution for this online.
Just saying this does not help you or us. We have shown you that the program works as expected. You are able to get the in-line help to print to screen so it means that the program is executing fine on your machine. So that leaves your input data as the problem.
As asked by @shenwei365 did you manipulate any of the files on a machine other than a linux server at any time (e.g. on windows or macOS). If so that file(s) could have a problem with line endings. Those can be fixed with
dos2unix
command.Please show us the output of command @cpad0112 asked in their comment.
What does
cat -vet transcripts.with.orf_15m_full.name.txt | head -4
show?Sorry, i missed those questions!! None of the files were manipulated on windows, I did all through linux on my uni's server.
The output of this command
grep ">" results_15m_200.fa | tr -d ">" | grep -v -w -f transcripts.with.orf_15m_full.name.txt
never came out. After I press enter, nothing happen, and yesterday, with some hope, i runned it with nohup and &.Now the output for
cat -vet transcripts.with.orf_15m_full.name.txt | head -4
is:If you get nothing from the
grep
command then that tells us that there is some discrepancy between your ID list and the original fasta headers. If I run that command on my toy example I getWhich shows the fasta headers that are not common between input fasta and file with list of headers that need to be removed. In other words sequences that will be kept in final output.
As for the
cat
output your line endings do look to be in proper unix format without any additional characters.