Question

bioawk command fails to generate tx2gene.tsv

0

Entering edit mode

2.7 years ago

Claire • 0

Hi Everyone

I am trying to prepare tx2gene.tsv for salmon alevin for sc/snRNASeq. I use this command but I usually get empty output, any help on what's wrong in the command, thanks

 > bioawk -c gff '$feature=="transcript" {print $attribute}' <(gunzip -c
 > gencode.v31.primary_assembly.annotation.gtf.gz) | awk -F ' ' '{print
 > substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' - >
 > txp2gene.tsv

I checked gencode.v31.primary_assembly.annotation.gtf.gz does exist in my directory, subset of it as below. No error, just txp2gene is empty. Thanks a lot

> ##description: evidence-based annotation of the human genome (GRCh38), version 31 (Ensembl 97)
> ##provider: GENCODE
> ##contact: gencode-help@ebi.ac.uk
> ##format: gtf
> ##date: 2019-06-27 chr1    HAVANA  gene    11869   14409   .       +       .       gene_id "ENSG00000223972.5"; gene_type
> "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;
> hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2"; chr1   
> HAVANA  transcript      11869   14409   .       +       .      
> gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2";
> gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1";
> transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2;
> transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic";
> havana_gene "OTTHUMG00000000961.2"; havana_transcript
> "OTTHUMT00000362751.1"; chr1    HAVANA  exon    11869   12227   .  
> ...

salmon bioawk alevin • 1.9k views

ADD COMMENT • link updated 2.7 years ago by D. Puthier ▴ 350 • written 2.7 years ago by Claire • 0

1

Entering edit mode

hum I don't know bioawk , but you're not giving bioawk a GTF/GFF input.

<(gunzip -c input.gtf.gz) | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' -

it looks like a 2 columns file.

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

It should be in this part in my command

< (gunzip -c gencode.v31.primary_assembly.annotation.gtf.gz)

I tried yours I get something like:
18 AVA 18 AVA 18 AVA 26 AVA 616 NSEM 615 NSEM 380 NSEM I need the transcripts to gene. Thanks Pierre a lot. I will try to work around yours. That's helpful.

ADD REPLY • link 2.7 years ago by Claire • 0

0

Entering edit mode

gunzip -c > gencode.v31.primary_assembly.annotation.gtf.gz)

no gunzip -c won't produce a gtf.gz file...

ADD REPLY • link 2.7 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I think it is a formatting issue here, not familiar with biostar formatting. But if you look at my original command in the question it is right.

< (gunzip -c gencode.v31.primary_assembly.annotation.gtf.gz)

The issue is not with gunzip -c, it is with the attr and substring I guess, but not sure how to fix it. The command is exactly as here in salmon alevin: https://combine-lab.github.io/alevin-tutorial/2018/setting-up-resources/

Thanks Pierre any how :)

ADD REPLY • link 2.7 years ago by Claire • 0

1

Entering edit mode

gzip -c operation on gzip file is a incorrect as pierre said. bioawk can handle gzipped files. In addition, try printing bioawk -c gff '{print $attribute}' <gencode_gtf> and it will be an empty one as attribute does nothing here.

Probably you are looking output like this:

$ bioawk -c gff '$source == "ENSEMBL" && $feature=="transcript" {print $group}' gencode.v39.primary_assembly.annotation.gtf.gz |  awk -v OFS="\t" -F "\"" '{print $4,$2}' | head                             

ENST00000619216.1   ENSG00000278267.1
ENST00000607096.1   ENSG00000284332.1
ENST00000610542.1   ENSG00000238009.6
ENST00000410691.1   ENSG00000222623.1
ENST00000612080.1   ENSG00000273874.1
ENST00000614007.1   ENSG00000278757.1
ENST00000621981.1   ENSG00000278791.1
ENST00000411249.1   ENSG00000223181.1
ENST00000433179.3   ENSG00000187642.9
ENST00000620552.4   ENSG00000188157.15

I added source also to the filter and you may also want to consider the evidence level for the transcript. Remove head at the end of the function for full list.

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

It seems I did not understand Pierre response well till you explained it cpad0112. Got you. Sorry Pierre and thanks both Pierre and cpad0112.

ADD REPLY • link 2.7 years ago by Claire • 0

1

Entering edit mode

you can also do this, but it prints records from all sources. You need to figure out how to print source as well:

$ gffread  -E gencode.v39.primary_assembly.annotation.gtf --table @id,@geneid  | head

Command line was:
gffread -E gencode.v39.primary_assembly.annotation.gtf --table @id,@geneid
   .. loaded 244998 genomic features from gencode.v39.primary_assembly.annotation.gtf
ENST00000456328.2   ENSG00000223972.5
ENST00000450305.2   ENSG00000223972.5
ENST00000488147.1   ENSG00000227232.5
ENST00000619216.1   ENSG00000278267.1
ENST00000473358.1   ENSG00000243485.5
ENST00000469289.1   ENSG00000243485.5
ENST00000607096.1   ENSG00000284332.1
ENST00000417324.1   ENSG00000237613.2
ENST00000461467.1   ENSG00000237613.2
ENST00000606857.1   ENSG00000268020.3

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

Thanks cpad0112 :) Will play around it. Appreciated :)

ADD REPLY • link 2.7 years ago by Claire • 0

score 0 · Answer 1 · 2022-03-25

Hi, You may also use gtftk (CLI interface of pygtftk) to achieve this.

   gtftk select_by_key --select-transcripts  -i  input.gtf.gz | gtftk tabulate -k transcript_id,gene_id

Alternatively you can also merge gene_id and gene_name which can be useful for downstream analyses.

 gtftk select_by_key --select-transcripts  -i  input.gtf.gz | gtftk merge_attr -k gene_id,gene_name -d merge_id | gtftk tabulate -k transcript_id,merge_id

Best

Disclaimer: I'm the pygtftk developper.