Extract fasta sequences from main list using a list of titles
1
0
Entering edit mode
7.7 years ago
sbchua.1990 ▴ 50

Hi, I have a file (file 1) containing more than 100 k assembled sequences in fasta format from a RNA-seq experiment. I run differential expression analysis using DESeq and the output (file 2) contains only identifier (~ 5 k, no sequences) and the expression profiles. I want to run blast+ software using the differential expressed sequences.

How can I extract the sequences out of the main list for those differential expressed transcripts?

Samples for file 1:

>TRINITY_DN0_c0_g1_i1 len=307 path=[1:0-195 309:196-219 198:220-306] [-1, 1, 309, 198, -2]
GATGGCTACTTGTGATTCCTCTGAAGATATGTCCGTGAAGGCCGACCAAACACTTTATTG
GCTAACTTCTGGGGATAGAAATATGACAAGTATCTAGCCAAGTCAAGCAAGGAAGCGTCC
GACGAGATTCAATAGATGCAATCAAGAAATATGACCACGTAGCCTTGCTCCCCAGCCTTG
AAGCCGCCACGATTGATAAACAGATTTCCTCAGTCAAGATTTATCTCGGGGGCCCAACCA
GCCACCAGCCAGCGAACCTGGTGTTTCTCCACACGCAGAGTAGGACACTCCCCTTTCTCA
ACCCCCC
>TRINITY_DN10_c0_g1_i1 len=242 path=[1:0-52 31:53-241] [-1, 1, 31, -2]
CTGTGGGTGGTGGAAGGTCCAGCTCCGGCGGGTACAAATGTTTCGTGGTATGTTATGTTC
TTGAGATGATGGGTCCAGATCAAGAAGGAACCTCGAGCCGTACTACGCCCAAGGTAGTCA
TGCCTGAGATGGATTCGTCAATAAACAATTTCCTCATTGGTCAGTCAACGGTTTACCCGG
TATTTTTCACGACGAGGCGTAAATCGTGGCTCGACGCGAATTTGACCGCAGCTTGGATGT
CC

Samples for file 2:

>sampleA    sampleB baseMeanA   baseMeanB   baseMean    log2FoldChange  lfcSE   stat    pvalue  padj    cond_A_rep1 cond_A_rep2 cond_A_rep3 cond_B_rep1 cond_B_rep2 cond_B_rep3 cond_C_rep1 cond_C_rep2 cond_C_rep3
TRINITY_DN10404_c0_g4_i1    cond_A  cond_B  6096.14223224768    85.246531387864 3090.69438181777    6.14425950079493    0.153685401193497   39.9794609837992    0   0   69.396  63.534  64.984  0.882   0.759   0.730   0.590   0.465   0.380
TRINITY_DN13442_c0_g3_i1    cond_A  cond_B  18162.5464494724    75.5054910306573    9119.02597025152    7.90260390907864    0.192771903141031   40.9945836520435    0   0   267.322 310.844 274.750 0.872   0.958   1.222   1.583   0.671   0.951

I managed to fish out the identifier in file 2 but still cannot figure out how to extract sequences out of file 1.

sequence gene next-gen • 1.7k views
ADD COMMENT
3
Entering edit mode
ADD REPLY
0
Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

In addition, I modified this post to a 'Question' because that's what it is.

ADD REPLY
0
Entering edit mode
7.7 years ago

I have my own tool for this (like 99% of the bioinformaticians I guess):

https://github.com/MatteoSchiavinato/Utilities/blob/master/select-sequences-from-filename.py

ADD COMMENT

Login before adding your answer.

Traffic: 2264 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6