I have an internal data for NCI-h660 file with 8m mapped pairs (HiSeq, 50bp paired end data) and I have an external dataset (4m mapped pairs, 50 bp paired end generated on GAII).
Questions:
1. I observe TMPRSS2-ERG fusion with external dataset, not with internal data from HiSeq. What could be the reasons? I use tophat2 fusion with same parameters for both the datasets.
Use grep to search your fastq for a specific sequence.
Something like
grep -A 2 -B 1 GGAATAACCTGCCGCG myfastq.fastq > junctions.fastq
The -A means "Get 2 lines after the line that matches that sequence". -B means "get the one line before the line that matches the sequence". This will give you the full 4 lines of the fastq entry. If you don't need that, you can omit those two options. Check the rev-comp of that sequence too.
If your fastq is gzipped, use zgrep instead of grep. If you have a .bam file, do this to search the .bam
samtools view is reading the .bam, and converting it to a plain text .sam, and feeding that one line at a time to grep, which is only going to output the lines that contain your sequence to junctions.sam.