I have one samples in fastq files from bovine (bovine_sample1.fastq
). If I want to see whether a sequence (supposed named "GQ2") is expressed in bovine cells. (I have got the FASTA of this sequence) The sequence is unannotated in the current bovine genome, so might not have been tested in the analyses thus far.
As we know , in RNA-Seq, the expression level is number of raw reads counts mapped to genome.so my solution is like this:
My solution (2 steps):
(1) generate the bowtie2 index of GQ2.
>GQ2
ATGGAGCACTTTCCCCGCTGTGTGCACGAGTCCTGGGGTTCCTCAAAGGA
GCCCCAGAAAACAGAGGTGCTGCAACTCTTGAGCTTAGCGGACCCTGAGG
.....
mkdir GQ2Bowtie2Index
cd GQ2Bowtie2Index
bowtie2-build GQ2.fa GQ2
ls
GQ2.1.bt2 GQ2.2.bt2 GQ2.3.bt2 GQ2.4.bt2 GQ2.fa GQ2.rev.1.bt2 GQ2.rev.2.bt2
(2) map the reads file on to 'GQ2' genome
bowtie2 -p 8 -x ./GQ2Bowtie2Index/GQ2 -U bovine_sample1.fastq
But when I did it, I got the error.
After I ran the bowtie2, it goes to the endless loop (obviously it is error). If I terminate the bowtie2, I found the error.
bowtie2-align died with signal 2 (INT)
Could you please tell me:
- Is my solution correct? If it is not, do you have any solution?
- If mine is correct, why did I get the error? Do my command lines have the problem?
Hi Cytosine, Thank you! For 2) , it works.
So what does the result mean? Is GQ2 expressed or not expressed in bovine?
And how about this one?
And for 1), what do you mean ? Cause blast is only find the similarity of that Sequence. Could you say it in detail? Or show me some command or pseudo-code? Thank you
The numbers you showed here are just the mapping statistics. It won't give you the answer to whether GQ2 is expressed or not. Do the steps I already mentioned in the previous post and query your bam file to find the answer.
Well if you compare sequences that are similar then you also know the location your query aligns to, right? You can either use the web-GUI for the blast or make your own blast database from the bovine genome and blast it locally with the blast+ suite.
Sorry I didn't reply to you on your post (my mistake), please see my reply below your post.
I disagree with previous statements that those map counts are definitely not informative. For one sample there are 71k reads mapping to that transcript. in another there are 5316 reads mapping to it. Most reads do not map, as expected.
I would simply check if the coverage along the whole transcript is good by using the fasta file for the single transcript as reference in the IGV browser, and then load the bam file above to view the coverage graphically. if you see coverage across the whole transcript, you can be pretty certain it's there. If that's all you want to know, case is closed. if all reads stack up in one place or are very unevenly distributed, then there's a risk that they come from a related similar transcript, or a low-complexity/repetitive sequence. you'll then want to use better methods and a) map against all transcripts or b) against the whole genome followed by intersection with transcript locations.
I put a more elaborate comment in a separate answer just previously, suggesting eXpress to estimate abundance across all transcripts.
Hi David If I don't use IGV but directly go to eXpress, is that OK? cause the results will show the FPKM. And if not expressed, it will be 0 not not shown, am I right?
Sorry, I am still confused about your steps.
Bos taurus breed Hereford chromosome 20, Bos_taurus_UMD_3.1
Sequence ID: ref|AC_000177.1|Length: 72042655 Number of Matches: 1
Related Information
Range 1: 31667545 to 31668074 GenBank Graphics Next Match Previous Match
Alignment statistics for match #1
Features:
60570 bp at 5' side: sex determination protein fruitless
119104 bp at 3' side: selenoprotein P precursor
Sorry for these rookie questions.
The blast output tells you where your gene is most likely located:
(Bos taurus breed Hereford chromosome 20 - Range 1: 31667545 to 31668074)
Now you convert your sam file from Tophat to bam and index it with samtools.
When you have those files, you can either load them to a genome-viewer (e.g. IGV) and take a look at the coverage there
or use samtools view to parse out the reads mapping to the chr20:31667545-31668074 region from the indexed bam file or even use samtools depth to see the coverage in that region.
Hi of course I have got the bam/sam file generated by Tophat (map my
bovine_sample1.fastq
to full bovine genome).However, I don't know how to go further.
For example this is my aligned SAM file
Do you mean when select the column3 =20 meanwhile column4 between 31667545 and 31668074. If there isn't any hits, so it means that 'GQ2' is not expressed. And how many hits is the expression level? Am I right?
Another question, if No significant similarity found when blast 'GQ2' gene, how could I do?