How to remove header's tail of a multi-fasta file with sed or other
2
0
Entering edit mode
10.1 years ago

Hi!

I have a multifasta file with read's headers such as:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00181_00132
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq_VG6RM_00171_00907
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA

now, I would like to remove this tail part of my hearders where we get the sequence's id. I do not know how to do so for different tails for each reads.I thought of something like this:

sed s'/^.fastq/s/[^ ]* //'g

but it does not apply for some reason.

I would like to get something like this:

>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26_Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTAATGAAAATGTGTTGCCGGGGCCCATAATCCCGGCACTAACCTTCTTATCCATAACACCTGTGCACTGTTGGATGCTTGCATCCACTTTTATACTAAACAATTTGTAACAAATGTAGTCTTATTATAATTAATAAAACTTTTAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAGC
>ITS1F_A_B10_R_2014_04_24_15_26_33_user_SN2-26-Run_2_for_its_oom_and_phyg_run2withbarcode.fastq
CCTGCGGAAGGATCATTACCGAGTTAGGGTCCTCTGGGGCCGAACCTCCCAACCCTGTGTCTATTGTTACCTTTTAGTTGCTTCGGCGGGCCGGCCGTCCTGACCAACTGGTCTCGCCGGCCGCCGGTCGTGGGTCTCCACGA
sequence • 3.2k views
ADD COMMENT
0
Entering edit mode

Hi again,

I also have to remeve that sequence number from another file, but in that case, the sequence is in between...:

>barcodelabel= #ITS2_A_B10_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_00944_01804;size=52893;
CAGAACCAAGAGATCCGTTGTTGAAAGTTGTAACTATTATGTTTTTTCAGACGCTGATTGCAACTGCAAAGGGTTTGAAT
GTTGTCCAATCGGCGGGCGGACCCGCCGAGGAAACGAAGGTACTCAAAAGACATGGGTAAGAGGTAGCAGACCGAAGTCT
ACAAACTCTAGGTAATGATCCTTCCGCAGGTTCACCTACGGAAACCTTGTTACGACTTTTACTTCCTCTAAATGACCAAG
>barcodelabel= #ITS1F_A_B21_R_2014_02_19_15_00_39_user_SN2-19-FUNGI_OOMYCETE-EMVSAMPLES_et_2014-02-19_RUN1_Fungi_oomycete_Run1_Ana140224.fastq_72JCK_03245_02705;size=33771;
AAGTCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTCATAATAAGTGTTTTATGGCACTTTTTAAATCCAT
ATCCACCTTGTGTGCAATGTCAGTCGGTCTTCTTTATGGAGATCGGCCAAACATCAACCTAATTTTTAACTCTTTGTCTG
AAAAATATTATGAATAAAATAATTCAAAATACAACTTTCAACAACGGATCTCTTGGCTCTCGCATCGATGAAGAACGCAG
C

So I want to keep the size=52893 part but remove the 72JCK_00944_01804 part.

ADD REPLY
0
Entering edit mode

You might wanna start working on regular expressions more. These come best when you practice a bit. As long as you don't overwrite the file, nothing should go wrong in experimentation.

In this case, you wanna match something that starts after a fastq_ and ends before the next ;

Should be easy enough to do that from the answer in your other question on the forum.

ADD REPLY
0
Entering edit mode

Hey I want to remove the header from a multifasta file except the first header is that possible?

ADD REPLY
0
Entering edit mode

This is not an answer to the top-level question and hence must not be added as an answer. I'm moving it to a comment.

Please open a new post describing your exact problem as well as what you've tried in your efforts to solve that problem.

ADD REPLY
4
Entering edit mode
10.1 years ago

What about:

sed 's/fastq_.*/fastq/' myseq.fa

Assuming the string "fastq_" occurs only at the end of the sequence name and everything after and including "_" will be stripped.

ADD COMMENT

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6