Hi guys, I am relatively new to alignment tools and bioinformatics in general. I have seen several questions on the forums pertaining to my question, but after following through on the answers, I can't seem to get the right result. Basically, I am trying to find where a plasmid tDNA insertion of known sequence is in an arabidopsis thaliana genome. My plan was to use split alignment methods for this. I have a reference genome consisting of 5 chromosomes in fasta format, to which I apended the known plasmid contruct sequence with the header ">Plasmid". I have paired end reads that I want to align to multiple places in the "Genome+Plasmid" reference with the hope that I can later use BLAST to find which reads align to one of the chromosomes as well as the plasmid. With this information I hope to use BLAST to determine the gene that the tDNA insertion is next to. It is my understanding thus far that I want split or "chimeric" reads, but not supplementary reads. I have a good understanding of how to use BLAST, but I am having difficulty finding split alignment reads. So far, I have tried bwa-mem, segemehl, and bowtie2 on ubuntu, but none have worked. I'm not sure if this is because I am incorrectly setting parameters or some other issue. Also, once the split alignments are found and marked, I'm not sure how to extract them. I also am unable to open .sam files, (not sure if people normally can?) If anyone could show me a step-by-step process with one of the above alignment tools (or a new tool if necessary) of how to find and extract split reads, this help would be greatly appreciated. If I am unclear in any way, please let me know. Thanks
segemehl commands- resulted in no reads ./segemehl.x -i refindex.idx -d ref.fas -q reads1.fasta -p reads2.fasta > output.sam samtools view -bS -o output.bam output.sam samtools sort output.bam sorted.bam samtools view -h -o sorted.sam sorted.bam ./testrealign.x -d ref.fas -q sorted.sam -n
bowtie2- not sure how to extract bowtie2 -x ref.fas.bowtie2 -1 reads1.fq.gz -2 reads2.fq.gz -S output.sam --local
bwa mem bwa mem ref.fas reads1.fq reads2.fq > output.sam samtools view output.sam I tried several different commands after this but I think I was going in the wrong direction. I believe I tried the virst command with the -M function as well. how can I extract split reads from this point?
What does this mean? Do they produce an error? Don't forget to add the exact commands you used for the alignment.
How did you try to open them?
I've looked online but have not been able to find any program that can open .sam files in a readable form. Basically I have not idea how to open them, so I haven't tried any method. For the alignment tools, when I used bwa-mem, I went through all the steps but did not find any split reads in the end. I'm not sure why. When I used bowtie2, it told me after the alignment that about 6000 reads aligned more than once, but I was not sure how to extract these. Also, I don't believe these reads are completely what I want, as they include reads where the same parts of the read align to different places. I want reads where different parts of the read align to different places in the reference. With segemehl, the program did not find any split reads in the end. I will soon post the exact commands I used.
Look into
samtools
and supplementarySAM flags
(2048): https://samtools.github.io/hts-specs/SAMv1.pdfI've tried something like this, but I will look into it more.
Have you seen this thread: Identification of the sequence insertion site in the genome
I had not seen that forum. I'm not sure its the exact issue as mine because it seems that their insertion sequence is unknown. I forgot to mention that mine is known, which should make my problem simpler.
Indeed!
So what is the exact problem. BTW, SAM files are plain text and should be viewable using
cat/less
etc on unix. Don't try to open them on Windows with a GUI based editor.The problem is that when I go through all the steps for the programs, then try to extract using SAMtools view (and setting different flags), it does not appear to find any split reads. Is there a typical protocol for finding split reads with programs such as bwa mem? Should I be marking short split reads with the -M function? And how should I be attempting to extract the split reads after alignment?
Very Large Insertion Detection Methodology Query In this thread, the answers basically say that bwa mem can make a sam file for me that contains any read where one end maps to one place in the genome and one end maps to another place in the genome. I'm not sure how to get bwa mem to do this for me. Somehow, I guess maybe with Samtools, although I'm not totally sure how, I want to extract these reads and individually BLAST them against the reference to make sure they are actually aligning to a place in the genome as well as the appended plasmid sequence (and not just two normal places in the genome).