Question

Faster way to get sequences from large fasta files?

0

Entering edit mode

6.8 years ago

emilywynn6 ▴ 10

Hello,

I am working with Illumina paired end reads. I am trying to use Blast to find read pairs in which one pair aligns to the nucleus and the other pair aligns to the mitochondria. Any reference genome I could align the reads to would not have the gene transfers I'm looking for, so the unaligned fastq files are my best bet.

Basically what I've done so far is turn the fastq files into fasta files and blasted them to the mitochondrial genome, then I've taken the read names from the blast output and used sed to change the ".1's" to ".2's" so that I have the name of that read's pair.

So I have a text file with 100000 read names, and I need to get the sequences of these reads. Originally I used grep to find the names and print the next line (grep -f -A1) however, because my files are so large, this took too much RAM and I had to cancel the task.

I found that using: while read line; do grep -wF -A1 "$line" query.file; done < subject.file gives me the output I need and doesn't crash our server, but it is also EXTREMELY slow.

Any advice on how to more efficiently perform this task? Even using grep -m 1 will take over a day to finish each sample.

grep fasta paired-ends • 2.4k views

ADD COMMENT • link updated 6.8 years ago by n,n ▴ 390 • written 6.8 years ago by emilywynn6 ▴ 10

1

Entering edit mode

Wouldn't STAR-Fusion or Tophat-Fusion give you the insight you need?

With these tools, you'll detect chimeric reads and search for those with one read in the nucleus and the other in mt.

ADD REPLY • link 6.8 years ago by michael.ante ★ 4.0k

0

Entering edit mode

I think you can use samtools faidx to index the FASTA, that'll make the access faster, although I'm not sure if any tools exist that can use the faidx to speed up their operation as well as use identifiers to get the sequence.

You could try samtools faidx file.fasta; samtools faidx file.fasta <identifier>

ADD REPLY • link 6.8 years ago by Ram 45k

1

Entering edit mode

6.8 years ago

n,n ▴ 390

if you only need all the sequences without headers you could:

cat fasta | paste - - | cut -f2 > output

if you need each sequence in a separate file with the fasta header you could:

for id in $(cat file_with_headers_list)
do
  filename=$(cat test | grep "$id" | cut -d\> -f2)
  cat fasta | paste - - | grep $id | tr '\t' '\n' > ${filename}_output
done

These should be faster I think, don't know if I understood what you wanted correctly though so I apologize in advance if I misunderstood.

ADD COMMENT • link 6.8 years ago by n,n ▴ 390

0

Entering edit mode

obviously to make the "file_with_headers_list" you would:

cat fasta | grep \> > file_with_headers_list

ADD REPLY • link 6.8 years ago by n,n ▴ 390

score 3 · Accepted Answer · 2018-10-19

3

Entering edit mode

6.8 years ago

Pierre Lindenbaum 166k

linearize your fasa file, sort and use joiin.

Not tested but it should be something like

join  -t $'\t' -1 1 -2 1 \
   <(cat query.file | paste - - | cut -c 2- | sort -t $'\t' -k1,1) \
   <(sort subject.file) |\
tr "\t" "\n"

ADD COMMENT • link 6.8 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I've tried this out on a small test file, and I think this strategy will work and be much faster. Thank you!!!

ADD REPLY • link 6.8 years ago by emilywynn6 ▴ 10

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

Upvote|Bookmark|Accept

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 166k