Hello,
I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.
Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.
So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (grep -f -A1
) however, because my files are so large, this took too much RAM and I had to cancel the task.
I found that using:
while read line; do grep -wF -A1 "$line" query.file; done < subject.file
gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.
Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.
Wouldn't STAR-Fusion or Tophat-Fusion give you the insight you need?
With these tools, you'll detect chimeric reads and search for those with one read in the nucleus and the other in mt.
I think you can use
samtools faidx
to index the FASTA, that'll make the access faster, although I'm not sure if any tools exist that can use the faidx to speed up their operation as well as use identifiers to get the sequence.You could try
samtools faidx file.fasta; samtools faidx file.fasta <identifier>