Entering edit mode
2.7 years ago
Claire
•
0
Hi Everyone
I am trying to prepare tx2gene.tsv for salmon alevin for sc/snRNASeq. I use this command but I usually get empty output, any help on what's wrong in the command, thanks
> bioawk -c gff '$feature=="transcript" {print $attribute}' <(gunzip -c
> gencode.v31.primary_assembly.annotation.gtf.gz) | awk -F ' ' '{print
> substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' - >
> txp2gene.tsv
I checked gencode.v31.primary_assembly.annotation.gtf.gz does exist in my directory, subset of it as below. No error, just txp2gene is empty. Thanks a lot
> ##description: evidence-based annotation of the human genome (GRCh38), version 31 (Ensembl 97)
> ##provider: GENCODE
> ##contact: gencode-help@ebi.ac.uk
> ##format: gtf
> ##date: 2019-06-27 chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type
> "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;
> hgnc_id "HGNC:37102"; havana_gene "OTTHUMG00000000961.2"; chr1
> HAVANA transcript 11869 14409 . + .
> gene_id "ENSG00000223972.5"; transcript_id "ENST00000456328.2";
> gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1";
> transcript_type "lncRNA"; transcript_name "DDX11L1-202"; level 2;
> transcript_support_level "1"; hgnc_id "HGNC:37102"; tag "basic";
> havana_gene "OTTHUMG00000000961.2"; havana_transcript
> "OTTHUMT00000362751.1"; chr1 HAVANA exon 11869 12227 .
> ...
hum I don't know bioawk , but you're not giving bioawk a GTF/GFF input.
<(gunzip -c input.gtf.gz) | awk -F ' ' '{print substr($4,2,length($4)-3) "\t" substr($2,2,length($2)-3)}' -
it looks like a 2 columns file.
It should be in this part in my command
I tried yours I get something like:
18 AVA 18 AVA 18 AVA 26 AVA 616 NSEM 615 NSEM 380 NSEM I need the transcripts to gene. Thanks Pierre a lot. I will try to work around yours. That's helpful.
no
gunzip -c
won't produce a gtf.gz file...I think it is a formatting issue here, not familiar with biostar formatting. But if you look at my original command in the question it is right.
The issue is not with gunzip -c, it is with the attr and substring I guess, but not sure how to fix it. The command is exactly as here in salmon alevin: https://combine-lab.github.io/alevin-tutorial/2018/setting-up-resources/
Thanks Pierre any how :)
gzip -c
operation ongzip
file is a incorrect as pierre said. bioawk can handle gzipped files. In addition, try printingbioawk -c gff '{print $attribute}' <gencode_gtf>
and it will be an empty one as attribute does nothing here.Probably you are looking output like this:
I added source also to the filter and you may also want to consider the evidence level for the transcript. Remove
head
at the end of the function for full list.It seems I did not understand Pierre response well till you explained it cpad0112. Got you. Sorry Pierre and thanks both Pierre and cpad0112.
you can also do this, but it prints records from all sources. You need to figure out how to print source as well:
Thanks cpad0112 :) Will play around it. Appreciated :)