Question

Parsing header of FASTA File

0

Entering edit mode

6.2 years ago

Shelle ▴ 30

I have so many bacterial Refseq fasta files and want to parse the headers in the fasta files to see if is there any word 'chromosome' in the headers, as a side note there are some sequences in FASTA files started with '>' so i want to parse all the lines staring with '>' . I know i have files that do not have word like 'chromosome' . I would like to separate the files with header 'chromosome' from the rest of files. Is there a way to do so?

any help would be appreciated.

FASTA sequence header Parse • 2.2k views

ADD COMMENT • link updated 6.2 years ago by ATpoint 85k • written 6.2 years ago by Shelle ▴ 30

0

Entering edit mode

If you want to get faster/better/more accurate answers it would really help if you show some examples of your data, and how these have to be "parsed".

ADD REPLY • link 6.2 years ago by WouterDeCoster 47k

0

Entering edit mode

I’m not sure what the aim of filtering the genomes is by the word chromosome is exactly?

To my knowledge the work chromosome in the header doesn’t tell you anything about that assembly specifically.

ADD REPLY • link 6.2 years ago by Joe 21k

0

Entering edit mode

Assuming that fasta is linearized (i.e sequence is in single line, after header):

sed -n '/>/p' test.fa | grep -vw chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with no chromosome in header.

sed -n '/>/p' test.fa | grep -w chromosome | grep --no-group-separator -f - -A 1 test.fa

should give you all the fasta sequences with chromosome in header.

ADD REPLY • link 6.2 years ago by cpad0112 21k

score 0 · Answer 1 · 2018-09-08

0

Entering edit mode

6.2 years ago

ATpoint 85k

You can do it with this one-liner:

grep '>chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt

By the way, fasta headers must start with '>'. Do you have some that do not start with it?

ADD COMMENT • link 6.2 years ago by ATpoint 85k

0

Entering edit mode

I tried this command and all the files go to haveNoChr.txt which is not correct as i have files with header (first line) as below: A few examples is as follows:

>NZ_LS483492.1 Serratia rubidaea strain NCTC10848 genome assembly, chromosome: 1
>NC_013791.2 Bacillus pseudofirmus OF4, complete genome  
>NZ_CP016324.1 Vibrio cholerae 2740-80 chromosome 1, complete sequence

I have gone through the whole file in second example and i didn't see any line starting with '>' which includes 'chromosome'. I am not sure why this one-liner doesn't separate at least this file in a haveNOchr.txt

ADD REPLY • link 6.2 years ago by Shelle ▴ 30

0

Entering edit mode

Ok, I see. In this case, simply grep for 'chromosome' instead of '>chromosome'.:

grep 'chromosome' *.fasta | awk -F ":" '{print $1}' | tee haveChr.txt | diff /dev/stdin <(ls *.fasta) | awk -F "> " '{print $2}' | awk NF > haveNOChr.txt