How to remove uncharacterized chromosomes from GTF file?
1
0
Entering edit mode
22 months ago
Vasu ▴ 790

I have a GTF and here I'm showing an example:

GL000008.2      Cufflinks       exon    83383   83545   .       +       .       transcript_id "SHARED_00000001"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "1"; 
GL000008.2      Cufflinks       transcript      83383   85626   .       +       .       transcript_id "SHARED_00000001"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; oId "SHARED_00000001"; class_code "u"; tss_id "TSS1"; 
GL000008.2      Cufflinks       exon    85567   85626   .       +       .       transcript_id "SHARED_00000001"; gene_id "XLOC_000001"; gene_name "XLOC_000001"; exon_number "2"; 
chr1    HAVANA  exon    11869   12227   .       +       .       transcript_id "SHARED_00000341"; gene_id "ENSG00000223972.5"; gene_name "ENSG00000223972.5"; exon_number "1"; 
chr1    HAVANA  transcript      11869   14409   .       +       .       transcript_id "SHARED_00000341"; gene_id "ENSG00000223972.5"; gene_name "ENSG00000223972.5"; oId "ENST00000456328.2"; tss_id "TSS213"; 
chr1    HAVANA  exon    12613   12721   .       +       .       transcript_id "SHARED_00000341"; gene_id "ENSG00000223972.5"; gene_name "ENSG00000223972.5"; exon_number "2"; 
chr1    HAVANA  exon    13221   14409   .       +       .       transcript_id "SHARED_00000341"; gene_id "ENSG00000223972.5"; gene_name "ENSG00000223972.5"; exon_number "3"; 
chr10_GL383545v1_alt    ncbiRefSeq      exon    3012    3170    .       +       .       transcript_id "SHARED_00065395"; gene_id "XLOC_011047"; gene_name "XLOC_011047"; exon_number "1"; 
chr10_GL383545v1_alt    ncbiRefSeq      transcript      3012    96701   .       +       .       transcript_id "SHARED_00065395"; gene_id "XLOC_011047"; gene_name "XLOC_011047"; 
chr10_GL383546v1_alt    ncbiRefSeq      transcript      295416  305254  .       -       .       transcript_id "SHARED_00065412"; gene_id "XLOC_011055"; gene_name "XLOC_011055"; 
chr10_GL383546v1_alt    ncbiRefSeq      exon    299951  300098  .       -       .       transcript_id "SHARED_00065412"; gene_id "XLOC_011055"; gene_name "XLOC_011055"; exon_number "2";

From the GTF, I would like to remove uncharacterized chromosomes like chr10_GL383545v1_alt, chr10_GL383546v1_alt and there are several others present in the original gtf.

I would like to keep chr1-chr22, chrX, chrY, chrM, and also contigs like GL000008.2, KI270364.1, KI270740.1, and several other contigs.

grep gtf awk • 1.8k views
ADD COMMENT
2
Entering edit mode
22 months ago
seidel 11k

If you're in linux on the command-line, one simple method would be to use grep to match a regular expression, and ignore the matching lines:

 grep -v -E "^chr[0-9]+_" genes.gtf > genes_cannonical_chr.gtf

This matches all lines that start with a chromosome and a number followed by an underscore, and filters them out. The -v take the inverse of the pattern (do not return lines that match) and the -E confers "Extended" attributes so that you can use "+" in the regular expression (the + means match one or more occurrences of the preceding thing).

This solution assumes all your unwanted chromosomes have some variant of the pattern chrN_ where N = a number. If there happens to be chrM_ lines, you can filter these out in a second step.

Sometimes you will also have chrUn (for unknwon chromosomes) as a suffix. For these you could simply pipe the first command to a second grep statement (to avoid thinking of a complicated regex):

grep -v -E "^chr[0-9]+_" genes.gtf | grep -v "^chrUn_" > genes_cannonical_chr.gtf

On the other hand, we could simply see in advance what we need, and the regex probably isn't that complicated. Since you want to control inclusion of lines based on the chromosome in the first field of each line, you could examine what is left after the first filtering step:

# how many different chromosome patterns do we get?
cut -f1  genes.gtf | grep -v -E "^chr[0-9]+_" | uniq

This would show you the patterns you still have. And you would see that you didn't catch problematic things like: chrUn_, chrX_, chrY_. So you can simply add these to the regex pattern, and do all the filtering in one step:

grep -v -E "^chr[0-9UnYXM]+_" genes.gtf > genes_cannonical_chr.gtf

This catches all the unwanted chromosomes and filters them out in one step.

ADD COMMENT
0
Entering edit mode

Thanks a lot for the reply. I can still see some unwanted chromosomes like chrUn_GL000195v1, chrX_KI270880v1_alt, chrY_KN196487v1_fix

ADD REPLY
1
Entering edit mode

I added a step about how to explore the result, and modify the filter. Hopefully it makes sense, and is useful to you. There are a variety of ways to solve this problem. This is just one.

ADD REPLY

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6