Hi guys,
I have two files, a fasta file (containing a sequences and sequence IDs) and a text file containing a list of sequence ID.
I would like to extract the respective sequence for each of the sequence IDs in the text file. then I searched the net and found the following commend can do the job:
cut -c 2- EXAMPLE.TXT | xargs -n 1 samtools faidx EXAMPLE.FA
I used it, but it gives me the error message like this :
[fai_fetch] Warning - Reference omp2352_c1_seq1 not found in FASTA file, returning empty sequence
xargs: samtools: terminated by signal 11
My fasta file is like this:
>comp904_c0_seq2 len=1452 path=[4675:0-87 7837:88-89 7839:90-96 6352:97-97 6299:98-1223 5917:1224-1227 5921:1228-1451]
TCTAATAAACCATCCATTCATTCACACCACCACTCTTGCTCATTGGAGACACCATGCACT
AATCAAAAAAAAAACCTTAATGTATATTAAAAAAAAATAATAGCGGGAGGAGGAAGAGAC
GGGAGGGGCTTTTCTATCAAGCTTTTCTCATAAATAATACCGAAAGTACGATCAAATTTT
TCTCATTGTTTTTCTTGCTGTGATAGAAAAAGATGCGCGCGTGCCGAACGAGGGGAAACG
GGGAAACAAGAGAGAACGAAGAAAAGGGAACAAGGATGAAAAATAAATCGCTCATCGTAC
AACACGAGAGGACAAAAGGGTCGTAAATAGTAATAGTAGTAATAGTAAGTAAGTAAATAG
GAGGAGAGGCTCTCTCATTCATACAAATTGAGACAAACAGATACAATATAGAGTTACGGC
TAAGATAATTATAATAATACAGTATGTACTGGGATCGAGACGGTTCCCGTCTCTGATTCA
AATGTCTCTGTGGGCCTGGCCCGCACTACACAGCACTGTTAGAGTTCGTAATTTTTATTT
TATTTTTATTTTTTTTTAATAATGTGTGGAGATGATTGTGTCGTTCGCAAGTTACGACAC
GTCCTGGTCCTCTACGTCCTGGAGTTGTTCTTTGGGTTCCGCTTCTCCGTCCCCTTGCAT
ATCACTGGTCCACAGCGTCAGGTTGTCGCGCAACAGTTGCATGATCAACGTGCTGTCTTT
ATAGCTCTCCTCGGATAAAGTGTCTAGTTCTGCAATGGCGTCGTCGAAGGCGGCTTTCGC
TAGACGACAGGCCCTGTCCGGACTGTTCAATATTTCGTAATAGAATACGGAGAAATTAAG
GGCCAATCCGAGCCTGATCGGATGAGTGGGTGGCAGTTCTGTCATTGCGATATCACTGGC
TGATTTGTAAGCGACCAACGAATGTTCAGCGGCGTCTTTACGGTCGTTGCCTGTCGCGAA
TTCGGCCAGATATCGGTGGTAGTCACCCTTCATTTTGTAATAAAACACTTTAGATTCGCC
CGTGGAAGCCGCCGGGATAAGATGTTTGTCCAAAACGGCCAATATATCCGAACAGATGTC
CCTCAACTCCTTCTCCACCTGCGCCCGGTACTGCCTTATCATCTCCAACTTGTCGTCCGT
CCCTTTGCTCTCTTCTTTCTGTTCGATCGAGGAGATTATACGCCAAGAAGCCCTCCGTGC
ACCTATCACGTTCTTGTAAGCGACAGAAAGAAGGTTTCGCTCCTCCACGGTCAGCTCCAG
GTCCAGCTTGGCCACCTTCTTCATAGCGTCCACCATTTCATCGTAACGTTCGGCCTGCTC
CGCCAGTTTCGCCTTGTAGACGTTATCCTCCCGTTCAGACATATTGCGATGATTACCGAT
TGAAACTCCACAACAACACTTTATTCACTCGAGACACTCCGCGTACTGCCAATATGGCCG
CGCCCGAGATCG
ID list file like this:
comp2352_c1_seq1
comp3842_c0_seq2
comp3842_c0_seq4
comp3842_c0_seq6
comp2145_c1_seq3
comp2145_c1_seq4
comp5304_c1_seq5
comp5696_c0_seq18
comp5237_c0_seq5
comp5237_c0_seq7
comp7076_c0_seq2
I am new at kind of work. could any one give me some suggestions!
How could I make the samtools work? should I change the form of sequences ID in both file?
e.g. for the first file(fasta file) just keep the >comp904_c0_seq2
part, erase the rest from the ID?
For ID.txt file should I add >
before the each sequence's ID?
If that works, how can I do that? because there r many sequences in fasta file , it is hard to do that one by one?
The error is because you do not have '>' in your IDs file. Instead of the cut, just use a cat.
thanks @RamRS, yeah, after add ">" , it worked fine.
You could use sed: