Hi,
I have a file containing a list of mutations of some genes, with the format -
rsID Location GeneID Feature Exon NucleotideSubstitution AminoAcidSubstitution cDNA Type
rs541031071 11:47331423-47331423 MYBPC3 ENST00000256993.8 34/34 aaG/aaT K/N 4200 3_prime_UTR
rs541031071 11:47331423-47331423 MYBPC3 ENST00000399249.6 35/35 aaG/aaT K/N 4200 3_prime_UTR
rs541031071 11:47331423-47331423 MYBPC3 ENST00000387238.6 - aaG/aaT K/N 4200 upstream_gene_variant
Our focus is to identify unique variants, but as shown above the two variants are the same except for some columns being different, which in this case are 'Feature' and Exon
. Is there a way for me to check if rows 1 to n
of column 1 are equal, check the exon column to pick out the longest isoform from column 5 (one with 35 exons)?
My (hasty) solution to this would be to delete the variable columns (except Exon
), grep
all rows containing 35 exons, and then remove duplicates, if any. I would like to know if there is a cleaner and more sophisticated way of doing the same.
Thanks in advance,
Vinay
This works perfectly! I was wondering if I could modify the command in such a way that (added to the question are the last row and column) I separately pick the
Type
of variant without bothering about theExon
column.I tried
sort -t $'\t' -k1,1 -k6,6gr inputFile | sort -t $'\t' -uk1,1 --merge
andsort -t $'\t' -k1,1 -k6,6gr inputFile | sort -t $'\t' -uk6,6 --merge
, both with and without the-u
option, but I only retrieve3_prime_UTR
( while I'm trying to retrieve both3_prime_UTR
andupstream_gene_variant
). This is not a key requirement, it's just out of curiosity, so put in too much time with this.Thank again :)
Consider this mock file:
I want the rows with the highest value from the 2nd column for unique 1st + 3rd column combinations:
Like this:
Almost back to the original format:
Back to the original format:
!! NOTE !!
Choose your "join separator" wisely, i.e. know your data. Here
sed
replaces the 1st occurrence of an underscore with a tab. This works because there is no underscore in the 1st column of my mock data!! NOTE !!