I have whole genome sequence data (.FASTA files) for Salmonella bacteria. On average the sequence files have a size of 4.7MB but there are some that are too big, like 7Mb and others that are too small, like 500Kb. There is likelihood that the too large files contain unnecessary sequence data and the smaller ones have part of the organism genome sequenced, which would skew my data.
I would like to keep files that are in the range of 3.0- 6.0Mb. Am running on Linux server, any way around this?
Regards.
Addendum:
I would like to do gene core and accessory gene analysis on these bacterial genome sequences. I will be running roary after I do annotation. Or, based on what I want to do, what quality check can I do?
Ah, ok not surprising there are in-built function for size filtering. #TIL +1