Hi
Problem:
I have genome seq PE reads of an Arabidopsis mutant. What I know of this mutant is that: a) it has a t-DNA insertion somewhere, containing a GUS transgene. b) it has at least another mutation, causing a phenotype I am interested in.
What could be a possible strategy to identify where the GUS transgene is placed, and also to find other possible insertions or deletions in the genome?
What I tried I mapped reads to the genome, and assembled contigs from unmapped reads. This way I could assemble a contig containing part of the t-dna insertion, but couldn't find the flanking sequence (so I don't know where it is). I could find this contig just by searching for the GUS sequence. But I also would like to find a way to detect other possible insertions of which I don't know any sequence, and I don't know how.
I also tried to look at all the reads that were mapped and had their mate unmapped. I thought I could detect this way the insertion, but I end up with lots of reads everywhere (I don't know if it is normal to have so many unmapped reads, or if I used wrong mapping parameters).
Any suggestion?
-> At this point you have the unmapped reads that come from your gene of insertion. To know where it is inserted in your genome, you can look at the mapping position of their mapped mates
I used samtools flagstats, it gave : 45809554 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary 0 + 0 supplementary 0 + 0 duplicates 33530536 + 0 mapped (73.20% : N/A) 45809554 + 0 paired in sequencing 22904777 + 0 read1 22904777 + 0 read2 31581052 + 0 properly paired (68.94% : N/A) 31766210 + 0 with itself and mate mapped 1764326 + 0 singletons (3.85% : N/A) 84670 + 0 with mate mapped to a different chr 60488 + 0 with mate mapped to a different chr (mapQ>=5) Then i extracted unmapped reads using the command $samtools view -b -f 4 input.bam > unmapped.output and it gave an output bam file
Now I need to blast them to my gene sequence right? but I dont understand this part of your answer " To know where it is inserted in your genome, you can look at the mapping position of their mapped mates" can you please explain it a bit.. Sorry for any inconvenience it may cause :(
No problem ! Imagine a pair of reads that come from the region of insertion. One read is unmapped on the genome because it is located within the insertion. The second read (they are mates), however, mapped on the genome because it is located outside the gene of insertion. Now if you can confirm that the first read (the unmapped one) really comes from the insertion with blast, the second one (the mapped one) will give you the approximate genomic location of the insert.
Hmm I got it thanks. But how can I extract those mate reads? using: $samtools view -b -f 8 input.bam > mate.output? Thanks for your kind help :)
You can play with -f and -F. Note that both reads are each other mates so don't be confused. To extract the alignments of the mates of the reads that blast on your sequence, you can simply use grep (both mates have the same name). Here is the code I would use.
@Carlo Yague, thanks a lot. I tried to execute these codes, first three worked but there are some errors when I try to run the blast like:
Error: (106.18) NCBI C++ Exception: Error: (CArgException::eSynopsis) Too many positional arguments (1), the offending value: /Users/znasim09/Documents/ncbi-blast-2.2.18+/bin/blastn Error: (CArgException::eSynopsis) Application's initialization failed
I don't know whats wrong with it :(
There were some errors, now its working. I got a blasted file of about 50kb, is it ok to get such small sized file?
I guess its ok, the output gives you one line by significant match. You can try
wc -l oneEndUnmapped.blasted
to know the exact number of lines.The
wc -l oneEndUnmapped.blasted
gave "468 oneEndUnmapped.blasted" result.However, by using the last code (mentioned by you), I got a bam file of around 400bytes. I guess there is something wrong with it. As I tried to convert it to fasta but it resulted empty file.
Look at the intermediate sam file :