Question

Extract fasta sequences from main list using a list of titles

0

Entering edit mode

7.8 years ago

sbchua.1990 ▴ 50

Hi, I have a file (file 1) containing more than 100 k assembled sequences in fasta format from a RNA-seq experiment. I run differential expression analysis using DESeq and the output (file 2) contains only identifier (~ 5 k, no sequences) and the expression profiles. I want to run blast+ software using the differential expressed sequences.

How can I extract the sequences out of the main list for those differential expressed transcripts?

Samples for file 1:

>TRINITY_DN0_c0_g1_i1 len=307 path=[1:0-195 309:196-219 198:220-306] [-1, 1, 309, 198, -2]
GATGGCTACTTGTGATTCCTCTGAAGATATGTCCGTGAAGGCCGACCAAACACTTTATTG
GCTAACTTCTGGGGATAGAAATATGACAAGTATCTAGCCAAGTCAAGCAAGGAAGCGTCC
GACGAGATTCAATAGATGCAATCAAGAAATATGACCACGTAGCCTTGCTCCCCAGCCTTG
AAGCCGCCACGATTGATAAACAGATTTCCTCAGTCAAGATTTATCTCGGGGGCCCAACCA
GCCACCAGCCAGCGAACCTGGTGTTTCTCCACACGCAGAGTAGGACACTCCCCTTTCTCA
ACCCCCC
>TRINITY_DN10_c0_g1_i1 len=242 path=[1:0-52 31:53-241] [-1, 1, 31, -2]
CTGTGGGTGGTGGAAGGTCCAGCTCCGGCGGGTACAAATGTTTCGTGGTATGTTATGTTC
TTGAGATGATGGGTCCAGATCAAGAAGGAACCTCGAGCCGTACTACGCCCAAGGTAGTCA
TGCCTGAGATGGATTCGTCAATAAACAATTTCCTCATTGGTCAGTCAACGGTTTACCCGG
TATTTTTCACGACGAGGCGTAAATCGTGGCTCGACGCGAATTTGACCGCAGCTTGGATGT
CC

Samples for file 2:

>sampleA    sampleB baseMeanA   baseMeanB   baseMean    log2FoldChange  lfcSE   stat    pvalue  padj    cond_A_rep1 cond_A_rep2 cond_A_rep3 cond_B_rep1 cond_B_rep2 cond_B_rep3 cond_C_rep1 cond_C_rep2 cond_C_rep3
TRINITY_DN10404_c0_g4_i1    cond_A  cond_B  6096.14223224768    85.246531387864 3090.69438181777    6.14425950079493    0.153685401193497   39.9794609837992    0   0   69.396  63.534  64.984  0.882   0.759   0.730   0.590   0.465   0.380
TRINITY_DN13442_c0_g3_i1    cond_A  cond_B  18162.5464494724    75.5054910306573    9119.02597025152    7.90260390907864    0.192771903141031   40.9945836520435    0   0   267.322 310.844 274.750 0.872   0.958   1.222   1.583   0.671   0.951

I managed to fish out the identifier in file 2 but still cannot figure out how to extract sequences out of file 1.

sequence gene next-gen • 1.8k views

ADD COMMENT • link updated 7.8 years ago by Matteo Schiavinato ★ 3.6k • written 7.8 years ago by sbchua.1990 ▴ 50

3

Entering edit mode

How To Extract A Sequence From A Big (6Gb) Multifasta File ? Extract Sequence From Fasta File Using Ids From A Separate File Extract A Group Of Fasta Sequences From A File

etc. This question has been asked like at least 100 times. Did you search before posting?

ADD REPLY • link 7.8 years ago by 5heikki 11k

0

Entering edit mode

I added markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

In addition, I modified this post to a 'Question' because that's what it is.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2017-04-12

0

Entering edit mode

7.8 years ago

Matteo Schiavinato ★ 3.6k

I have my own tool for this (like 99% of the bioinformaticians I guess):

https://github.com/MatteoSchiavinato/Utilities/blob/master/select-sequences-from-filename.py

ADD COMMENT • link 7.8 years ago by Matteo Schiavinato ★ 3.6k