Question

Finding a specific gene in a fastq file without mapping the reads?

1

Entering edit mode

9.4 years ago

prostoesh ▴ 20

Hi! I have a .fastq file with reads and i need to check if there is a specific gene in them, without assembling reads into contigs. The wanted gene is in nucleotide sequence

Assembly gene alignment blast • 9.3k views

ADD COMMENT • link updated 9.4 years ago by Sukhi Singh 11k • written 9.4 years ago by prostoesh ▴ 20

score 5 · Answer 1 · 2016-04-07

5

Entering edit mode

9.4 years ago

Jorge Amigo 14k

the only way you can be sure which reads come from which region is actually by mapping them to the entire genome, and then extracting the reads that mapped on the gene of interest by chromosome positions.

you can save time by mapping all reads using only the gene sequence as a reference, because you can discard all the reads that do not map being almost sure that the reads that mapped would be the gene reads. the key is that almost, since you will lose the multiple mapping location information if you decide to use the gene as the mapping reference instead of the entire genome..

ADD COMMENT • link 9.4 years ago by Jorge Amigo 14k

3

Entering edit mode

A mixed solution is to map reads to the gene, and then re-map hitting reads to a reference genome. Faster this way. Mapping to a genome greatly reduces the chance of getting reads from close paralogs or pseudogenes.

ADD REPLY • link 9.4 years ago by lh3 33k

2

Entering edit mode

2 round mapping, outputting 1 single bam file containing all the reads's information? nice solution.

ADD REPLY • link 9.4 years ago by Jorge Amigo 14k

0

Entering edit mode

@lh3 and @JorgeAmigo: This might be a simple question, but how would this two-step process be accomplished (in terms of input and output files)? I have a gene of interest called vaca (vacuolating cytotoxin), which has about 3873 base pairs (https://www.ncbi.nlm.nih.gov/nuccore/NC_000915.1?report=fasta&from=938415&to=942287). I downloaded the fasta file from that webpage and called it gene.fasta. Then, I have my sample of interest (call them r1.fasta and r2.fasta) for which I would like to know whether they might contain the gene. My tentative plan would be to map the sample reads to the gene using the following code: bwa mem gene.fasta r1.fastq r2.fastq > r.sam. However, at this point, how would I know whether or not the gene is present in my sample?

ADD REPLY • link 6.0 years ago by LRStar ▴ 200

1

Entering edit mode

all mapped reads from r.sam would be that gene's reads.

# total reads
samtools view -c r.sam
# mapped reads
samtools view -cF4 r.sam

if you get a non 0 value from the -cF4 option, then those reads would be that gene's.

if you want to be sure that those reads are that gene's, and that gene's only, then follow lh3's idea:

# extract gene reads
samtools view -ubF4 r.sam > r.gene.bam
# map reads to whole genome
samtools sort -n r.gene.bam | samtools fastq - | bwa mem -p genome.fa - > r.genome.sam 
# count uniquely mapped reads
samtools view -cq1 r.genome.sam

those reads that map on the gene, and on the gene only, would be that gene's reads.

ADD REPLY • link 6.0 years ago by Jorge Amigo 14k

score 2 · Answer 2 · 2016-04-07

2

Entering edit mode

9.4 years ago

natasha.sernova ★ 4.0k

Hopefully you can easily transform *.fastq files to *.fa files

To do it you can use the following biostars-post: Fastq Convert To Fasta

Then make a database out of your fasta-reads and search inside the database with blastn, for example.

1) First you need to make a database from your nucleotide sequences.

To do this:

makeblastdb -in input_file (file-name of the contigs or whatever) -dbtype nucl (if nucleotide) -out dbname (the database name)

Use input file in *.fa-format.

2) Run the blast-program:

blastn -query input (with the gene file) -db (database name, which was created in step 1) -out outname (file name with the results)

So use blastn for this search in the nucleotide database. You said you know the sequence of needed gene in the genome?

Hopefully the size of the database will not be the bottle neck.

Good luck!

ADD COMMENT • link 9.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

fastq reads from NGS experiments are usually much shorter than a gene, so looking up a gene sequence against a database of short sequences doesn't make much sense. what it does make sense is to look up each read against the gene sequence, which is indeed mapping the reads using the gene as an ad hoc reference. plus mapping does also take into account phred qualities of the aligned bases, which helps a lot in the alignment process.

ADD REPLY • link 9.4 years ago by Jorge Amigo 14k

0

Entering edit mode

I see, I was wrong.Sorry!

ADD REPLY • link 9.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

but you rose an interesting point. the idea would be methodologically correct if the gene sequence is similar in length to the reads'. blast has been among us for several decades now and it's very useful indeed, but when you take into account the nature and the characteristics of the NGS data you have to overcome the mapping bottlenecks, like indexing the reference like the first generation of short-read mappers did, or even indexing also the reads like the second generation of short-read mappers did.

ADD REPLY • link 9.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Thanks everyone, i'm trying all of your suggestions now, even the contradictory ones

ADD REPLY • link 9.4 years ago by prostoesh ▴ 20

score 0 · Answer 3 · 2016-04-07

0

Entering edit mode

9.4 years ago

mastal511 ★ 2.1k

That is usually done by assembling/aligning the reads.

ADD COMMENT • link 9.4 years ago by mastal511 ★ 2.1k