Question

Filter files based on number in the filenames

1

Entering edit mode

4.1 years ago

vanessagpds ▴ 10

Hi everyone,

I am new to bioinformatics and I have the following question to resolve:

In one of the projects, the genomic editor stored each contig in an individual stage file. The files have a name pattern “Contig_ [number] _cov_ [number] .fasta”.

The “cov” information refers to the coverage obtained for that contig. How would you go about summarizing this data? How would you go about obtaining only contigs with coverage greater than 500, and storing them in a multi-layer file?

regex bash • 815 views

ADD COMMENT • link updated 4.1 years ago by zx8754 12k • written 4.1 years ago by vanessagpds ▴ 10

0

Entering edit mode

Are these contigs from SPAdes output? If so, just be aware that coverage in that case refers to kmer coverage, not 'sequencing depth' - this may also apply in other cases that I'm not aware of.

ADD REPLY • link 4.1 years ago by Joe 21k

score 2 · Answer 1 · 2020-11-02

This might give you what you're looking for, but I'm not sure!

ls *.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat *_${line}.fasta >> cov_above_500.fasta ; done

or if you want to be more specific

 ls Contig_*_cov_*.fasta | sed 's/.*_//g' | sed 's/.fasta//g' | awk '($1>500)' | while read line ; do cat Contig_*_cov_${line}.fasta >> cov_above_500.fasta ; done

So, basically it removes everything (.*) before the second number, because there's an underline before the coverage value, and not after it, you can use it at a marker, and delete the _ and anything before it, then delete anything after the number, which is ".fasta", then only select those values that are above 500, and loop over them, and concatenate the corresponding files. When mapping the numbers back to the file names, you need to make sure to place *_ before the number, and .fasta after the number (as in the commands) otherwise you would also pick 6005 and 5600 when searching for 600.
If you have two contigs with 600 as coverage, both would be selected which is fine I guess.