Count specific sequences in a .fasta file
2
0
Entering edit mode
2.2 years ago
x_ma_x • 0

This is probably a super-simple problem, but I must be searching it wrong because I can't find anything useful!

I basically have an NGS .fastq file with millions of reads of a PCR amplicon library. I want to check the coverage of my library against a reference .fasta of 100k sequences, and count the frequencies of all those reference sequences. At this point I don't care about mutations, so this is a simple find all and count problem (except at a rather larger scale).

What's the simples way of doing it? Or what should I be searching for?

frequency library fasta count • 1.8k views
ADD COMMENT
1
Entering edit mode

A Python3 program that counts sequence occurrences in raw FASTQ files

https://github.com/afombravo/2FAST2Q seems could do this.

ADD REPLY
2
Entering edit mode
2.2 years ago

Mapping your reads to the reference fasta file. Then do counting of the mapping results.

You should be searching for: "NGS read mapping", "counting reads", "coverage plots" , ...

More specifically you can search for tools such as: STAR, HiSAT2, BBmap, samtools, ...

ADD COMMENT
0
Entering edit mode

Though if your reference sequences are shorter than your reads, you'll probably want to trim the reads or pad the refs.

ADD REPLY
0
Entering edit mode

sure thing, unfortunately my answer pre-dated that addition of info

ADD REPLY
1
Entering edit mode
2.2 years ago
GenoMax 147k

Are 100K reference sequences unique i.e. they don't have any sequence overlap? What is the length of sample sequences and is it fixed?

If they are unique then you could use bbduk.sh in filter mode to find sequences that match each one of your reference (you can search using the reference). You could use clumpify.sh to count reads and reduce them to single sequence representation (and then search with your reference).

Guide for BBduk.sh: https://jgi.doe.gov/data-and-tools/software-tools/bbtools/bb-tools-user-guide/bbduk-guide/
Clumpify - Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.

Or you could simply use blat to search the two files against each other and parse the result file to get counts.

ADD COMMENT
0
Entering edit mode

The reference sequences are all 120bp, of which 87bp in the middle are unique to each sequence. The remaining 33bp are primer handles used to create the library. I can in theory trim those and leave only the unique sequences.

The actual sample fasta is more messy, with sequences of different lengths, and in most cases are longer than reference. For this reason, I'm not looking for occurrences of particular strings (ie. reference sequences) - if that makes sense. In other words, if my reference is GGGG then TGGGGT should still be counted.

ADD REPLY
0
Entering edit mode

I can in theory trim those and leave only the unique sequences.

It may be best to do that and then use the unique sequences with bbduk.sh in filter mode. It would allow you to satisfy the following requirement

if my reference is GGGG then TGGGGT should still be counted.

ADD REPLY

Login before adding your answer.

Traffic: 2575 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6