How To Ignore Pseudogenes Or Mirna During Aligning Using Tophat
4
1
Entering edit mode
11.6 years ago
AsoInfo ▴ 300

Greetings!

Is it possible to ignore the pseudogenes or miRNA during aligning with TopHat?

Thanking you!

tophat • 3.7k views
ADD COMMENT
2
Entering edit mode
11.6 years ago

I am not sure about the context of your analyses, but emerging evidence suggests that several pseudogenes have role in cancer and miRNA could could act as decoys. So if you are out to understand novel biology from your RNAseq data -- retain them or analyze them for new insights.

See:

ADD COMMENT
0
Entering edit mode

Thank, very helpful... I will consider it.

ADD REPLY
1
Entering edit mode
11.6 years ago
Asaf 10k

Yes, you can give TopHat a GTF/GFF3 file with the genes you want to map the reads to (using -G) and ask it to match the reads only to the genes you provided (otherwise it will search first in the list of genes you provided and then in the rest of the genome) using -T.

ADD COMMENT
1
Entering edit mode
11.6 years ago
Rm 8.3k

you can ignore specific type of biotypes from Tophat : generally i mask only rRNA and mitochondrial genes or r/t RNAs.

Say: Download gtf from ensemble: http://uswest.ensembl.org/info/data/ftp/index.html

script: awk -f get.biotypes.awk Homo_sapiens.GRCh37.71.gtf | sort -u > all.biotypes.txt

BEGIN {OFS=FS="\t"}

(substr($1,1,1)!="#" && substr($1,2,1)!="#") {
#print $9;
        split($9,format,";");
        i=0;
           for (i in format){
                if (format[i] ~ /gene_biotype|gene_type/){     
                  sub("gene_biotype ", "", format[i]);
                  gsub(/"/,"",format[i]);
                        print format[i];
                }
            }
        }

script2: awk -f get.gtf.mask.biotypes.awk Homo_sapiens.GRCh37.71.gtf > output.gtf

BEGIN {OFS=FS="\t"}
(substr($1,1,1)!="#" && substr($1,2,1)!="#") {
        split($9,format,";");
        i=0;
           for (i in format){
                if (format[i] ~ /gene_type|gene_biotype/){
         ## change to get biotype patterns you want ( ~ ) or you don't want ( !~ ) : (I generally mask Mt and rRNA in RNAseq)
                  if (format[i] !~ /pseudogene|miRNA/){                      
#                  sub("gene_biotype ", "", format[i]);
#                 gsub(/"/,"",format[i]);
                        print ;
                }
              }
            }
        }
ADD COMMENT
0
Entering edit mode

Thank you so much... I'll try to run it on my data

ADD REPLY
1
Entering edit mode
11.6 years ago

I don't think you want to ignore them. If you have reads that align to those things, you need your aligner to report their correct mapping position. The last thing you want is for the aligner to place those reads in the wrong gene, because you told it not to put them in the right place.

ADD COMMENT

Login before adding your answer.

Traffic: 1624 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6