Hi, I cant seem to find any answers after hours of research to a few questions. I want to make a reference sequence for bowtie2. The reference consists of a few sequences of different length RNA:
By simply placing all sequences after eachother with N's in between them (to allow for reads to extend the reference a bit?(I dont know if that is a good thing to do?)): NNNNAGTGATCGGANNNNNNNAGCGTGATCGCATCGANNNNNNNNAGCGTGAGGAATAGTCTCGCATCGANNNNNNNN etc. It seemed to work, but I wanted to assign names to the sequences so later I can see to what reference sequence my reads mapped. I created a reference sequence by making a multifasta of them like this:
>seq1
NNNNAGTGATCGGANNNN
>seq2
NNNNAGCGTGATCGCATCGANNNN
>seq3
NNNNAGCGTGAGGAATAGTCTCGCATCGANNNN
etc.
This also seemed to work for Bowtie2. However, the .sam files resulting from both reference genomes are different. I don't know why. Should I even use N's to allow extended sequences or is that not needed? And for some reason I cannot upload the second reference (where I assigned names) to IGV tools for visualization. Am I doing something wrong?
I realize there's a few questions here, they can be summarized to this one: How could I assign names to the different reference sequences in my reference fasta file?
Thanks in advance!
You should not need to append any N's. A multi-fasta file is fine to use.
Are your reference sequences really < 25 bp long? There may be other tools that could be used instead of
bowtie2
, if that is the case. What kind of data do you want to align to this file?Just like you did above (in the example I have formatted using
code
option).No wonder since you converted your multiple fasta sequences into a single one as far as the aligner is concerned.
Most likely. A simple multi-fasta file should easily be recognized by IGV.
You should not need to append any N's. A multi-fasta file is fine to use. Alright thanks, good to hear that this way is fine and I dont need to use the N's
Are your reference sequences really < 25 bp long? There may be other tools that could be used instead of bowtie2, if that is the case. What kind of data do you want to align to this file? Yes, very short indeed, just want to map fastq reads to them To see how much mismatches, or full-length distribution there is
Just like you did above (in the example I have formatted using code option). Ah thank you
No wonder since you converted your multiple fasta sequences into a single one as far as the aligner is concerned. Right, I thought because I placed the N's in between them, they should still map the same but I guess there might be some exeptions
Most likely. A simple multi-fasta file should easily be recognized by IGV. Ye this is really strange, I started doubting if what I did was correct because this didnt work. But I'll give this some more thought
Thanks a lot for the help!
This is somewhat of an outside the box application, since you have very small reference sequences. You could literally
grep
them out of reads or use a tool likefuzznuc
(from EMBOSS) after converting your fastq sequences to fasta.I am not sure if IGV expects the references to be of a certain size. Perhaps that is why it is having trouble with your multi-fasta file.
Are you sure these alignments are working? Can you post a couple of representative entries from your SAM file?
I seems to work well, the results make a lot of sense atleast. Why would it not work? The tRNA sequences are synthetic so they cant map on multiple places on the reference and I like the log files I can make from .sam files. One of the .sam files looks like this, is this what you asked for? :) Here is an imgur link: https://imgur.com/wvX2kaQ Its not very sharp :/
btw I figured out how to upload the reference seqs to IGV as a multifasta, when I upload it, I have to select one of the sequences I want to see afterwards. It doesn't show anything right away so I thought it failed.
When your reference target is that small I was not sure if the aligner would be able to properly soft-clip longer reads (or you have perhaps trimmed then already). It certainly looks like it is working from the image.
If you have a lot of sequence redundancy in your dataset you could look at simplifying it by using this tool: A: Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files You can actually add counts for sequences to the fastq headers among other things.
That is correct. For IGV each of these is a
chromosome
and if you have a lot of them then they would not show up at the top of the first page.Awesome, I'll give them a try thanks very much!