Hi, I have a file (file 1) containing more than 100 k assembled sequences in fasta format from a RNA-seq experiment. I run differential expression analysis using DESeq and the output (file 2) contains only identifier (~ 5 k, no sequences) and the expression profiles. I want to run blast+ software using the differential expressed sequences.
How can I extract the sequences out of the main list for those differential expressed transcripts?
Samples for file 1:
>TRINITY_DN0_c0_g1_i1 len=307 path=[1:0-195 309:196-219 198:220-306] [-1, 1, 309, 198, -2]
GATGGCTACTTGTGATTCCTCTGAAGATATGTCCGTGAAGGCCGACCAAACACTTTATTG
GCTAACTTCTGGGGATAGAAATATGACAAGTATCTAGCCAAGTCAAGCAAGGAAGCGTCC
GACGAGATTCAATAGATGCAATCAAGAAATATGACCACGTAGCCTTGCTCCCCAGCCTTG
AAGCCGCCACGATTGATAAACAGATTTCCTCAGTCAAGATTTATCTCGGGGGCCCAACCA
GCCACCAGCCAGCGAACCTGGTGTTTCTCCACACGCAGAGTAGGACACTCCCCTTTCTCA
ACCCCCC
>TRINITY_DN10_c0_g1_i1 len=242 path=[1:0-52 31:53-241] [-1, 1, 31, -2]
CTGTGGGTGGTGGAAGGTCCAGCTCCGGCGGGTACAAATGTTTCGTGGTATGTTATGTTC
TTGAGATGATGGGTCCAGATCAAGAAGGAACCTCGAGCCGTACTACGCCCAAGGTAGTCA
TGCCTGAGATGGATTCGTCAATAAACAATTTCCTCATTGGTCAGTCAACGGTTTACCCGG
TATTTTTCACGACGAGGCGTAAATCGTGGCTCGACGCGAATTTGACCGCAGCTTGGATGT
CC
Samples for file 2:
>sampleA sampleB baseMeanA baseMeanB baseMean log2FoldChange lfcSE stat pvalue padj cond_A_rep1 cond_A_rep2 cond_A_rep3 cond_B_rep1 cond_B_rep2 cond_B_rep3 cond_C_rep1 cond_C_rep2 cond_C_rep3
TRINITY_DN10404_c0_g4_i1 cond_A cond_B 6096.14223224768 85.246531387864 3090.69438181777 6.14425950079493 0.153685401193497 39.9794609837992 0 0 69.396 63.534 64.984 0.882 0.759 0.730 0.590 0.465 0.380
TRINITY_DN13442_c0_g3_i1 cond_A cond_B 18162.5464494724 75.5054910306573 9119.02597025152 7.90260390907864 0.192771903141031 40.9945836520435 0 0 267.322 310.844 274.750 0.872 0.958 1.222 1.583 0.671 0.951
I managed to fish out the identifier in file 2 but still cannot figure out how to extract sequences out of file 1.
How To Extract A Sequence From A Big (6Gb) Multifasta File ? Extract Sequence From Fasta File Using Ids From A Separate File Extract A Group Of Fasta Sequences From A File
etc. This question has been asked like at least 100 times. Did you search before posting?
I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
In addition, I modified this post to a 'Question' because that's what it is.