How to get the expression level of specific gene without annotation in RNA-Seq?
3
0
Entering edit mode
10.3 years ago
super ▴ 60

I have one samples in fastq files from bovine (bovine_sample1.fastq). If I want to see whether a sequence (supposed named "GQ2") is expressed in bovine cells. (I have got the FASTA of this sequence) The sequence is unannotated in the current bovine genome, so might not have been tested in the analyses thus far.

As we know , in RNA-Seq, the expression level is number of raw reads counts mapped to genome.so my solution is like this:

My solution (2 steps):

(1) generate the bowtie2 index of GQ2.

>GQ2
ATGGAGCACTTTCCCCGCTGTGTGCACGAGTCCTGGGGTTCCTCAAAGGA
GCCCCAGAAAACAGAGGTGCTGCAACTCTTGAGCTTAGCGGACCCTGAGG
.....

mkdir GQ2Bowtie2Index
cd GQ2Bowtie2Index
bowtie2-build GQ2.fa GQ2
ls
GQ2.1.bt2 GQ2.2.bt2 GQ2.3.bt2 GQ2.4.bt2 GQ2.fa GQ2.rev.1.bt2 GQ2.rev.2.bt2

(2) map the reads file on to 'GQ2' genome

bowtie2 -p 8 -x ./GQ2Bowtie2Index/GQ2 -U bovine_sample1.fastq

But when I did it, I got the error.

After I ran the bowtie2, it goes to the endless loop (obviously it is error). If I terminate the bowtie2, I found the error.

bowtie2-align died with signal 2 (INT)

Could you please tell me:

  1. Is my solution correct? If it is not, do you have any solution?
  2. If mine is correct, why did I get the error? Do my command lines have the problem?
software-error RNA-Seq • 5.6k views
ADD COMMENT
2
Entering edit mode
10.3 years ago
Cytosine ▴ 460

Mapping RNA-Seq reads onto a single gene will produce a massively overestimated expression level and may even show expression where there is none in the case of mapping against the whole genome.

  1. Create an index of the whole bovine genome and map your RNA-Seq reads against that.

    Since you say you know the GQ2 sequence but not the region, why not just blast the sequence against the genome and see where it is most likely situated?

    Once you know the location you just slice the GQ2 region of the bam file and you can extract raw counts from there or visualize the file in some bam-viewer to see the coverage.

  2. The command you used looks fine, you only need to save the results somewhere:

    (either appending -S <output> or > output.sam to your command).

    I find it highly unlikely your run went into an infinite loop... How long did you wait for it? Did you check your cpu/ram usage to see if it's running?

bowtie2-align died with signal 2 (INT) ---> that just means you killed the process by sending it an interrupt signal (most likely with Ctrl+C). It is definitely not an error, just a log message from the OS for receiving a SIGINT.

ADD COMMENT
0
Entering edit mode

Hi Cytosine, Thank you! For 2) , it works.

54560452 (99.83%) aligned 0 times
71526 (0.13%) aligned exactly 1 time
20474 (0.04%) aligned >1 times
0.17% overall alignment rate

So what does the result mean? Is GQ2 expressed or not expressed in bovine?

And how about this one?

74179401 (99.99%) aligned 0 times
5316 (0.01%) aligned exactly 1 time
0 (0.00%) aligned >1 times
0.01% overall alignment rate

And for 1), what do you mean ? Cause blast is only find the similarity of that Sequence. Could you say it in detail? Or show me some command or pseudo-code? Thank you

ADD REPLY
0
Entering edit mode

The numbers you showed here are just the mapping statistics. It won't give you the answer to whether GQ2 is expressed or not. Do the steps I already mentioned in the previous post and query your bam file to find the answer.

Well if you compare sequences that are similar then you also know the location your query aligns to, right? You can either use the web-GUI for the blast or make your own blast database from the bovine genome and blast it locally with the blast+ suite.

ADD REPLY
0
Entering edit mode

Sorry I didn't reply to you on your post (my mistake), please see my reply below your post.

ADD REPLY
0
Entering edit mode

I disagree with previous statements that those map counts are definitely not informative. For one sample there are 71k reads mapping to that transcript. in another there are 5316 reads mapping to it. Most reads do not map, as expected.

I would simply check if the coverage along the whole transcript is good by using the fasta file for the single transcript as reference in the IGV browser, and then load the bam file above to view the coverage graphically. if you see coverage across the whole transcript, you can be pretty certain it's there. If that's all you want to know, case is closed. if all reads stack up in one place or are very unevenly distributed, then there's a risk that they come from a related similar transcript, or a low-complexity/repetitive sequence. you'll then want to use better methods and a) map against all transcripts or b) against the whole genome followed by intersection with transcript locations.

I put a more elaborate comment in a separate answer just previously, suggesting eXpress to estimate abundance across all transcripts.

ADD REPLY
0
Entering edit mode

Hi David If I don't use IGV but directly go to eXpress, is that OK? cause the results will show the FPKM. And if not expressed, it will be 0 not not shown, am I right?

ADD REPLY
0
Entering edit mode

Sorry, I am still confused about your steps.

Q1: I have blast my 'GQ2' use web GUI and I don't know how to go to next steps.

which part is most interresting for my work?!

Bos taurus breed Hereford chromosome 20, Bos_taurus_UMD_3.1

Sequence ID: ref|AC_000177.1|Length: 72042655 Number of Matches: 1

Related Information

Range 1: 31667545 to 31668074 GenBank Graphics Next Match Previous Match

Alignment statistics for match #1

Score            Expect    Identities       Gaps         Strand
979 bits(530)    0.0       530/530(100%)    0/530(0%)    Plus/Minus

Features:

60570 bp at 5' side: sex determination protein fruitless
119104 bp at 3' side: selenoprotein P precursor

Query  1         ATGGAGCACTTTCCCCGCTGTGTGCACGAGTCCTGGGGTTCCTCAAAGGAGCCCCAGAAA  60
                 ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct  31668074  ATGGAGCACTTTCCCCGCTGTGTGCACGAGTCCTGGGGTTCCTCAAAGGAGCCCCAGAAA  31668015

Q2: And for your query your bam file, what do you mean?

  1. which bam file? alignment bam file generate by Tophat?
  2. how to do that query?

Sorry for these rookie questions.

ADD REPLY
0
Entering edit mode

The blast output tells you where your gene is most likely located:

(Bos taurus breed Hereford chromosome 20 - Range 1: 31667545 to 31668074)

Now you convert your sam file from Tophat to bam and index it with samtools.

When you have those files, you can either load them to a genome-viewer (e.g. IGV) and take a look at the coverage there

or use samtools view to parse out the reads mapping to the chr20:31667545-31668074 region from the indexed bam file or even use samtools depth to see the coverage in that region.

ADD REPLY
0
Entering edit mode

Hi of course I have got the bam/sam file generated by Tophat (map my bovine_sample1.fastq to full bovine genome).

However, I don't know how to go further.

For example this is my aligned SAM file

XHM7T:00010:04775       0       27      34436721        50      16M98N2M        *       0       0       AAAAAAAAAAAAAAAGTC      57666666666776%1.-      AS:i:0  XN:i:0  XM:i:0  XO:i:0      XG:i:0  NM:i:0  MD:Z:18 YT:Z:UU XS:A:+  NH:i:1
XHM7T:00010:05264       0       3       49099319        2       26M14S  *       0       0       AAAAAAAAAAAAAAAAGCTGTATGTGAAAACTAACATTAC        677777777777777%7885548::6666*333-23./14    AS:i:52 XS:i:48 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:26 YT:Z:UU

Do you mean when select the column3 =20 meanwhile column4 between 31667545 and 31668074. If there isn't any hits, so it means that 'GQ2' is not expressed. And how many hits is the expression level? Am I right?

Another question, if No significant similarity found when blast 'GQ2' gene, how could I do?

ADD REPLY
2
Entering edit mode
10.3 years ago
David Fredman ★ 1.1k

If all you want to know is whether the gene is expressed or not, mapping reads directly to the single transcript (requiring perfect alignment) should be fine - if you have reads covering the whole transcript (view the output bam file in the IGV browser, using the single transcript as reference sequence), then it's present in appreciable amounts. Single reads or very uneven coverage may suggest low abundance or possibly reads from a different source mapping to your transcript.

However, if you wish to estimate its relative abundance compared to other genes, then you should map reads either to the complete transcriptome or genome, followed by read count based analyses that take various biases into account, as suggested previously. One tool that'll help you do this is eXpress, which is pretty easy to run and very fast. Have a look at the tutorial linked to, and see if you could get that to work with the bam file you generated by mapping reads to the genome.

ADD COMMENT
0
Entering edit mode

Thank you David.

  1. As shown in my post replied to Cytosine, if I just map the reads directly to the single gene (suppose it called GQ2), the result only represents the mapping statistics but I still don't know if it is mapping or not.
  2. About the eXpress, from the website it seems useful. cause currently I have bovine full genome and 'GQ2' Fasta sequence, so the commands could be sth. listed as below, am I right? (It seems that it doesn't need full bovine genome.....)
bowtie2 GQ2.fasta GQ2
bowtie2 -p 8 -x ./GQ2Bowtie2Index/GQ2 -U bovine_sample1.fastq |samtools view -Sb - > hits.bam
express GQ2.fasta hits.bam
ADD REPLY
0
Entering edit mode

Hi David,

If I use bowtie2 build the genome and index of GQ2 gene, and map bovine_sample1.fastq to the GQ2 geome.

I got the alignment result

4652452 reads; of these:
  54652452 (100.00%) were unpaired; of these:
    54560452 (99.83%) aligned 0 times
    71526 (0.13%) aligned exactly 1 time
    20474 (0.04%) aligned >1 times   0.17% overall alignment rate

But if I use

bowtie2  -p 8 -x ./GQ2Bowtie2Index/GQ2 -U bovine_sample1.fastq |samtools view -Sb -> hits.bam
express GQ2.fasta hits.bam

Then the result is listed as below:

Is that mean that the GQ2 has been expressed ?

bundle_id       target_id       length  eff_length      tot_counts      uniq_counts     est_counts      eff_counts      ambig_distr_alpha       ambig_distr_beta        fpkm    fpkm_conf_low       fpkm_conf_high  solvable        tpm
1       GQ2    530     329.494259      92000   92000   92000.000000    147984.369194   0.000000e+00    0.000000e+00    3.034954e+06    3.034954e+06    3.034954e+06    T       1.000000e+06
ADD REPLY
0
Entering edit mode
10.3 years ago
cdsouthan ★ 1.9k

It's important to do at least some manual reality checking. When I use BLAT to map GQ2 against UCSC Cow

https://genome.ucsc.edu/cgi-bin/hgTracks?db=bosTau7&position=chr20%3A33550848-33570847&hgsid=384395739_3dzqjrVrvuxAtLCYAd387xV5wC68

and open up the tracks a) it lands in a complete transcription desert b) no local genomic conservation patches and c) BLASTX finds nothing. I therefore don't believe this particular piece of RNA-Seq has any biological credibility

ADD COMMENT

Login before adding your answer.

Traffic: 1916 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6