Question

How to remove header's tail of a multi-fasta file with sed or other

0

Entering edit mode

10.1 years ago

tremblayemilie9 • 0

Hi!

I have a multifasta file with read's headers such as:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00181_00132
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00171_00907
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA

now, I would like to remove this tail part of my hearders where we get the sequence's id. I do not know how to do so for different tails for each reads.I thought of something like this:

sed s'/^.fastq/s/[^ ]* //'g

but it does not apply for some reason.

I would like to get something like this:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA

sequence • 3.2k views

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by tremblayemilie9 • 0

0

Entering edit mode

Hi again,

I also have to remeve that sequence number from another file, but in that case, the sequence is in between...:

>barcodelabel= #ITS2_A_B10_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_00944_01804;size=52893;
CAGAACCAAGAGATCCGTTGTTGAAAGTTGTAACTATTATGTTTTTTCAGACGCTGATTGCAACTGCAAAGGGTTTGAAT
GTTGTCCAATCGGCGGGCGGACCCGCCGAGGAAACGAAGGTACTCAAAAGACATGGGTAAGAGGTAGCAGACCGAAGTCT
ACAAACTCTAGGTAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>barcodelabel= #ITS1F_A_B21_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_03245_02705;size=33771;
AAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTCATAATAAGTGTTTTATGGCACTTTTTAAATCCAT
ATCCACCTTGTGTGCAATGTCAGTCGGTCTTCTTTATGGAGATCGGCCAAACATCAACCTAATTTTTAACTCTTTGTCTG
AAAAATATTATGAATAAAATAATTCAAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAG
C

So I want to keep the size=52893 part but remove the 72JCK_00944_01804 part.

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 10.1 years ago by tremblayemilie9 • 0

0

Entering edit mode

You might wanna start working on regular expressions more. These come best when you practice a bit. As long as you don't overwrite the file, nothing should go wrong in experimentation.

In this case, you wanna match something that starts after a fastq_ and ends before the next ;

Should be easy enough to do that from the answer in your other question on the forum.

ADD REPLY • link 2.8 years ago by Ram 44k

0

Entering edit mode

Hey I want to remove the header from a multifasta file except the first header is that possible?

ADD REPLY • link 4.8 years ago by zhamouda • 0

0

Entering edit mode

This is not an answer to the top-level question and hence must not be added as an answer. I'm moving it to a comment.

Please open a new post describing your exact problem as well as what you've tried in your efforts to solve that problem.

ADD REPLY • link 4.8 years ago by Ram 44k

Ram · Accepted Answer · 2014-11-07

4

Entering edit mode

10.1 years ago

dariober 15k

What about:

sed 's/fastq_.*/fastq/' myseq.fa

Assuming the string "fastq_" occurs only at the end of the sequence name and everything after and including "_" will be stripped.

ADD COMMENT • link updated 4.8 years ago by Ram 44k • written 10.1 years ago by dariober 15k