Entering edit mode
8.0 years ago
ashkan
▴
160
I have a file like the small example: small example:
>ENSG00000004142|ENST00000003607|POLDIP2|||2118
Sequence unavailable
>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA
but I have too many "Sequence unavailable". I want to get rid of those transcripts. and the results would be like this:
>ENSG00000003056|ENST00000000412|M6PR|9099001;9102084|9099001;9102551|2756
CCAGGTTGTTTGCCTCTGGTCGGAAAGGGAAACTACCCCTGCTTCCACTCTGACAGCAGA
I tried to filter out those parts in bash but
grep -v "$(grep -B 1 "Sequence unavailable" file.txt)" file.txt
but gave this error:
Argument list too long
how can i filter out them in bash or python?
How about (should work as long as the first record is Sequence Unavailable, you can be creative otherwise):
grep -A 2 "Sequence" your.fa | grep -v "\-\-" | sed -n '/Sequence/!p' > new.fa
It would be nice to provide feedback to the proposed solution of genomax2. In addition, you have more questions which you left "open/unsolved" after people tried to help you. That's not respectful.
I pledged to help you on your previous thread, but my questions remain unanswered, although it's clear that you have been active multiple times on biostars since my comment. You shouldn't take our help for granted.
Dear ashkan, please respond to questions/give follow up comments on your past posts. Abandoning a question after you ask it borders on troll-like behavior. Unless you follow up on your past questions, your future questions may not be taken seriously or your posts may be treated even more sternly.