I am currently working on a GTF file that is tab-separated, and I need to use bash to find and extract any Gene IDs with more than one transcript, then print out these Gene IDs, how many transcripts was found (count), and the transcripts themselves.
I am currently working on a GTF file that is tab-separated, and I need to use bash to find and extract any Gene IDs with more than one transcript, then print out these Gene IDs, how many transcripts was found (count), and the transcripts themselves.
Save the following script to duplicate.sh and place it in your directory where the gtf/gff file is:
#!/bin/bash
#this will set a variable for the duplicated gene ids
duplicated_gene_ids=$(cat yourgtffile.txt | awk -F " " '{ print $10,$12 }' | tr -d '"' | tr -d ';' | cut -d " " -f1 | uniq -d)
#the following line removes the output file (extracted.txt) prior to start
rm extracted.txt
#the following for loop will cycle throug the variable $duplicated_gene_ids
#generating extracted.txt and print the counted occurrence of duplicates
for singlegeneid in ${duplicated_gene_ids}
do
grep ${singlegeneid} yourgtffile.txt | awk -F " " '{ print $10,$12 }' | tr -d '"' | tr -d ';' >> extracted.txt
grep --count ${singlegeneid} yourgtffile.txt | awk '{ print "'"${singlegeneid}"'" " occurred " $1 " times" } '
done
Use it like this:
pratik@pratik:~/Desktop/So$ bash duplicate.sh
Output:
STRG.14852 occurred 3 times
Output of extracted.txt:
STRG.14852 STRG.14852.1
STRG.14852 STRG.14852.2
STRG.14852 STRG.14852.3
I must comment that using "'"${singlegeneid}"'" is generally not a good idea as I've read that people can inject code into that somehow like I guess run entire scripts (ie. viruses etc) through the variable somehow. I guess the double "'"
is what allows this. Therefore there probably is a better way to write this. but this accomplishes what you need, I think.
Hope this helps!
Hi So,
Here is your script in both bash and R, as this was easier for me. (I also used some shortcuts here), I will make a bash solution at some point... it will just take more time as I have a Master's project to tend to that I've been putting aside!!!
I must say though Google is your friend. you can google things like: count duplicates in bash
and the second link is someone asking the question on stackoverflow:
and then piece that together with the next question you have to build your script.
Anyways here's my try at this.
So you would just have to change the file paths appropriately.
Your current output is this:
pratik@Pratiks-MacBook-Air So % cat transcripts.txt
VFFH01000932.1 StringTie transcript 716340 733649 1000 + . "gene_id ""STRG.14845""; transcript_id ""STRG.14845.6""; cov ""2.191051""; FPKM ""0.287366""; TPM ""0.383253"";"
VFFH01001189.1 StringTie transcript 11702 12012 1000 . . "gene_id ""STRG.14846""; transcript_id ""STRG.14846.1""; cov ""4.823151""; FPKM ""0.632579""; TPM ""0.843654"";"
...
Doing the following:
cat transcripts.txt | tr -d '"' | tr -d ';' | awk -F " " '{ print $3,$4,$5,$10,$12 }' | tr "." ' ' | awk -F " " '{ print $1,$2,$3,$4 "." $5,$6 "." $7 " " $8 }' > transcripts_cleaned.txt
Will result in a text file "transcripts_cleaned.txt" that looks like this:
pratik@Pratiks-MacBook-Air So % cat transcripts_cleaned.txt
transcript 716340 733649 STRG.14845 STRG.14845 6
transcript 11702 12012 STRG.14846 STRG.14846 1
transcript 72502 81013 STRG.14847 STRG.14847 1
...
You can see here that I basically used the number after transcript id to provide a number for the counts of transcripts. The last column being the counts.
Now in R:
The following script or variations should get you what you want:
df <- read.table("~/Desktop/biostars/So/transcripts_cleaned.txt", sep = " ")
df2 <- df[!duplicated(as.list(df))]
install.packages('dplyr')
library(dplyr)
df.with.more.than.one.count <- df2 %>% group_by(V4) %>% filter(n()>1)
df.with.gene.ids.transcripts <- df.with.more.than.one.count %>% group_by(V4) %>% filter(V6==max(V6))
df.with.gene.ids.transcripts$V4
write.table(df.with.gene.ids.transcripts$V4, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt", sep= " ", col.names = F, row.names = F)
write.table(df.with.gene.ids.transcripts, file = "~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt", sep= " ", col.names = F, row.names = F)
if you want to remove quotes in the file you can use the tr
for translate like this:
cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame.txt | tr -d '"' > ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_whole_data_frame_no_quotes.txt
and
cat ~/Desktop/biostars/So/geneids_with_more_than_one_transcript.txt | tr -d '"' > ~/Desktop/biostars/So/geneids_with_more_than_one_transcript_without_quotes.txt
I hope this helps.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
What have you tried? This is pretty straightforward with core utils (
grep
/cut
/uniq
) andawk
.I haven't worked with bash before that's why I don't know the utilities, so I thought of asking here.
If you do not need to explicitly use bash you can have a look at the AGAT tool, I'm pretty sure it will have an option to achieve this.
AGAT - Another Gff/Gtf Analysis Toolkit
So : I second this. As a beginner in bash, you will run into a bunch of challenges that are great as a learning experience, but will result in delays in getting things done. For the moment, use this tool that already exists. The
agat_sp_statistics
script seems interesting - it might give you number of transcripts per gene statistic. If not, I'd recommend using R instead of bash as you're looking to do a group/aggregate operation which is done easier on R.Ram you've provided alot of help, thank you so much. I will make sure to learn more R and try challenges abit far from Bash.
Unfortunately, I have to use bash, but I will still check AGAT for future use. Thank you!
Could you provide a sample of the data, please?
You can do the following to print out the top (head) 10 lines.
head -10 your_gtf_file.gtf
or print out the top (head) 5 lines:
head -5 your_gtf_file.gtf
and then past your output in your question. Remember to use the '010101' button to put it in code format so it's easier to visualize and copy and paste. This will help myself or someone else work with it to give you the output you need.
So this is the last few heads of the gtf file, there are over a thousand in there.
So basically what I want is to write bash code that checks if a gene id has more than one transcript. If so it collects the Gene Ids, how many transcripts map to it, and what are the ids of these transcripts "Transcript ID".
Please, provide me mainly with the sources to learn so I can handle such tasks on my own, I love having such helpful community, but I mainly love to rely on myself, so please do provide me with what I need to learn as well so I can do even more complex tasks on my own.
Thank you!
Could you please check my latest reply below?
Okay, so I worked abit on it and here is what I have come to, and I still need help.
I will rephrase the whole problem real quick: I am currently working on a GTF file, here is an example of one header for the file.
Every line looks like this with different values, and I simply wanted to Extract Gene ID (e.g. STRG.1) and then find if this gene ID is repeated, if it is repeated (duplicate) extract all the Transcript IDs that matches that gene ID (in this line only STRG.1.1) and put all of this in one row! and tell me how many times did you find that Gene ID.
What I tried to do was
Gave me this output ( a single header as an example )
It counted the occurrence correctly, but compressed all Transcript IDs to just one, didn't add them in a single row.
Then, I stumbled upon this piece of code from the user in stackoverflow (credits to the user) https://unix.stackexchange.com/users/65304/steeldriver?tab=profile answered it as follows
I edited it to
It gave me this
Now I am stuck with two things:
I changed the separator to be white space instead of tabs to recognize the fields in column 3, I don't understand the use of OFS=FS. For the second part I understand that it creates an array for the row that will be concatenated to. don't understand a thing after the equal sign.
As for the last post, it has 5 different transcript IDs for the same gene ID. How do I print it out next to them in a different field?
I tried awk print NF, using comma separation for the file but it replaced the previous output with the count. I only added this to the previous code.