Extract column which contain <20 exons in a transcript
2
0
Entering edit mode
3.8 years ago
harry ▴ 40

The first 3 column contains the genome coordinates 4th column contains transcript name 5th column contains exon name and 6th column contain exon rank. So I want to separate this file into 2 files: 1st files if the transcript contains a total <20 exons then print all exons and another file in which if the transcript contains a total of more than 20 exons then print all exons in different file.

X   70444861    70445558    ENST00000194900 ENSE00001463308 1
X   70448580    70448633    ENST00000194900 ENSE00001463307 2
X   70448913    70448963    ENST00000194900 ENSE00003643927 3
X   70449359    70449483    ENST00000194900 ENSE00003548284 4
X   70449690    70449859    ENST00000194900 ENSE00003690291 5
X   70450169    70450305    ENST00000194900 ENSE00000436378 6
X   70450639    70450783    ENST00000194900 ENSE00000672250 7
X   70451867    70452026    ENST00000194900 ENSE00003562095 8
X   70453637    70453793    ENST00000194900 ENSE00003664498 9
X   70454214    70454316    ENST00000194900 ENSE00003617252 10
X   70479150    70479264    ENST00000194900 ENSE00003589163 11
X   70492107    70492283    ENST00000194900 ENSE00003626140 12
X   70492521    70492596    ENST00000194900 ENSE00003469450 13
X   70495408    70495453    ENST00000194900 ENSE00003692914 14
X   70497180    70497221    ENST00000194900 ENSE00003595614 15
X   70498520    70498570    ENST00000194900 ENSE00003511316 16
X   70499176    70499277    ENST00000194900 ENSE00003615707 17
X   70499877    70500049    ENST00000194900 ENSE00003483338 18
X   70500471    70500580    ENST00000194900 ENSE00003686632 19
X   70500898    70500989    ENST00000194900 ENSE00003626326 20
X   70502163    70505490    ENST00000194900 ENSE00003733449 21
X   120426148   120431462   ENST00000200639 ENSE00001187748 9
X   120441730   120441894   ENST00000200639 ENSE00000854520 8
X   120442599   120442662   ENST00000200639 ENSE00000374515 7
X   120446305   120446427   ENST00000200639 ENSE00000854522 6
X   120447841   120448025   ENST00000200639 ENSE00000854523 5
X   120448970   120449128   ENST00000200639 ENSE00000854524 4
X   120455357   120455570   ENST00000200639 ENSE00000675764 3
X   120456651   120456769   ENST00000200639 ENSE00003615412 2
X   120469106   120469296   ENST00000200639 ENSE00003896841 1
X   68829021    68829904    ENST00000204961 ENSE00001041113 1
X   68838617    68838894    ENST00000204961 ENSE00000672435 2
X   68839664    68839756    ENST00000204961 ENSE00000672436 3
X   68839960    68840088    ENST00000204961 ENSE00000672437 4
X   68840242    68842160    ENST00000204961 ENSE00001041114 5
X   100843836   100844078   ENST00000217885 ENSE00001857295 12
X   100848630   100848754   ENST00000217885 ENSE00000401023 11
X   100849772   100849934   ENST00000217885 ENSE00000401021 10
X   100850151   100850386   ENST00000217885 ENSE00001661840 9
X   100851233   100851325   ENST00000217885 ENSE00001607924 8
X   100862171   100862303   ENST00000217885 ENSE00001702593 7
X   100862392   100862573   ENST00000217885 ENSE00001701549 6
X   100862669   100862820   ENST00000217885 ENSE00001637130 5
X   100863159   100863243   ENST00000217885 ENSE00001653972 4
X   100863485   100863595   ENST00000217885 ENSE00001767233 3
X   100870719   100870814   ENST00000217885 ENSE00001602753 2
X   100874095   100874209   ENST00000217885 ENSE00001187812 1
X   2913614 2914773 ENST00000217890 ENSE00001203379 7
X   2915556 2915692 ENST00000217890 ENSE00003684272 6
X   2917804 2918227 ENST00000217890 ENSE00003509340 5
X   2920601 2920723 ENST00000217890 ENSE00003691518 4
X   2921903 2922024 ENST00000217890 ENSE00003652714 3
X   2925616 2925765 ENST00000217890 ENSE00003691123 2
X   2929232 2929275 ENST00000217890 ENSE00001881224 1

After the output, the 2 files looks like this:- The first files look like:-

 X  120426148   120431462   ENST00000200639 ENSE00001187748 9
    X   120441730   120441894   ENST00000200639 ENSE00000854520 8
    X   120442599   120442662   ENST00000200639 ENSE00000374515 7
    X   120446305   120446427   ENST00000200639 ENSE00000854522 6
    X   120447841   120448025   ENST00000200639 ENSE00000854523 5
    X   120448970   120449128   ENST00000200639 ENSE00000854524 4
    X   120455357   120455570   ENST00000200639 ENSE00000675764 3
    X   120456651   120456769   ENST00000200639 ENSE00003615412 2
    X   120469106   120469296   ENST00000200639 ENSE00003896841 1
    X   68829021    68829904    ENST00000204961 ENSE00001041113 1
    X   68838617    68838894    ENST00000204961 ENSE00000672435 2
    X   68839664    68839756    ENST00000204961 ENSE00000672436 3
    X   68839960    68840088    ENST00000204961 ENSE00000672437 4
    X   68840242    68842160    ENST00000204961 ENSE00001041114 5
    X   100843836   100844078   ENST00000217885 ENSE00001857295 12
    X   100848630   100848754   ENST00000217885 ENSE00000401023 11
    X   100849772   100849934   ENST00000217885 ENSE00000401021 10
    X   100850151   100850386   ENST00000217885 ENSE00001661840 9
    X   100851233   100851325   ENST00000217885 ENSE00001607924 8
    X   100862171   100862303   ENST00000217885 ENSE00001702593 7
    X   100862392   100862573   ENST00000217885 ENSE00001701549 6
    X   100862669   100862820   ENST00000217885 ENSE00001637130 5
    X   100863159   100863243   ENST00000217885 ENSE00001653972 4
    X   100863485   100863595   ENST00000217885 ENSE00001767233 3
    X   100870719   100870814   ENST00000217885 ENSE00001602753 2
    X   100874095   100874209   ENST00000217885 ENSE00001187812 1
    X   2913614 2914773 ENST00000217890 ENSE00001203379 7
    X   2915556 2915692 ENST00000217890 ENSE00003684272 6
    X   2917804 2918227 ENST00000217890 ENSE00003509340 5
    X   2920601 2920723 ENST00000217890 ENSE00003691518 4
    X   2921903 2922024 ENST00000217890 ENSE00003652714 3
    X   2925616 2925765 ENST00000217890 ENSE00003691123 2
    X   2929232 2929275 ENST00000217890 ENSE00001881224 1

The second files look like this:-

X   70444861    70445558    ENST00000194900 ENSE00001463308 1
    X   70448580    70448633    ENST00000194900 ENSE00001463307 2
    X   70448913    70448963    ENST00000194900 ENSE00003643927 3
    X   70449359    70449483    ENST00000194900 ENSE00003548284 4
    X   70449690    70449859    ENST00000194900 ENSE00003690291 5
    X   70450169    70450305    ENST00000194900 ENSE00000436378 6
    X   70450639    70450783    ENST00000194900 ENSE00000672250 7
    X   70451867    70452026    ENST00000194900 ENSE00003562095 8
    X   70453637    70453793    ENST00000194900 ENSE00003664498 9
    X   70454214    70454316    ENST00000194900 ENSE00003617252 10
    X   70479150    70479264    ENST00000194900 ENSE00003589163 11
    X   70492107    70492283    ENST00000194900 ENSE00003626140 12
    X   70492521    70492596    ENST00000194900 ENSE00003469450 13
    X   70495408    70495453    ENST00000194900 ENSE00003692914 14
    X   70497180    70497221    ENST00000194900 ENSE00003595614 15
    X   70498520    70498570    ENST00000194900 ENSE00003511316 16
    X   70499176    70499277    ENST00000194900 ENSE00003615707 17
    X   70499877    70500049    ENST00000194900 ENSE00003483338 18
    X   70500471    70500580    ENST00000194900 ENSE00003686632 19
    X   70500898    70500989    ENST00000194900 ENSE00003626326 20
    X   70502163    70505490    ENST00000194900 ENSE00003733449 21

So please can anyone tell me how to make 2 files. Thanks in advance.

transcript exons extract • 1.1k views
ADD COMMENT
1
Entering edit mode

with awk (since each exon is preceded by transcript entry)

Transcripts with more than 20 exons:

$ awk 'NR==FNR{a[$4]++; next} a[$4] > 20' test.txt test.txt

Transcripts with less than 20 exons:

$ awk 'NR==FNR{a[$4]++; next} a[$4] < 20' test.txt test.txt

With datamash, awk and join:

For transcripts with more than 20 exons:

$ datamash -sf -g 4 count 5   < test.txt | awk '$7 > 20 {print}' | join -1 4 -2 4 test.txt - -o 1.1,1.2,1.3,1.4,1.5,1.6

For transcripts with less than 20 exons:

$ datamash -sf -g 4 count 5   < test.txt | awk '$7 < 20 {print}' | join -1 4 -2 4 test.txt - -o 1.1,1.2,1.3,1.4,1.5,1.6
ADD REPLY
2
Entering edit mode
3.8 years ago
lessismore ★ 1.4k

This is really not a bioinformatic question. Anyway, using R, this is the solution.

library(tidyverse)



biostars_clean <- c("X 70444861 70445558 ENST00000194900 ENSE00001463308 1", "X 70448580 70448633 ENST00000194900 ENSE00001463307 2", 
"X 70448913 70448963 ENST00000194900 ENSE00003643927 3", "X 70449359 70449483 ENST00000194900 ENSE00003548284 4", 
"X 70449690 70449859 ENST00000194900 ENSE00003690291 5", "X 70450169 70450305 ENST00000194900 ENSE00000436378 6", 
"X 70450639 70450783 ENST00000194900 ENSE00000672250 7", "X 70451867 70452026 ENST00000194900 ENSE00003562095 8", 
"X 70453637 70453793 ENST00000194900 ENSE00003664498 9", "X 70454214 70454316 ENST00000194900 ENSE00003617252 10", 
"X 70479150 70479264 ENST00000194900 ENSE00003589163 11", "X 70492107 70492283 ENST00000194900 ENSE00003626140 12", 
"X 70492521 70492596 ENST00000194900 ENSE00003469450 13", "X 70495408 70495453 ENST00000194900 ENSE00003692914 14", 
"X 70497180 70497221 ENST00000194900 ENSE00003595614 15", "X 70498520 70498570 ENST00000194900 ENSE00003511316 16", 
"X 70499176 70499277 ENST00000194900 ENSE00003615707 17", "X 70499877 70500049 ENST00000194900 ENSE00003483338 18", 
"X 70500471 70500580 ENST00000194900 ENSE00003686632 19", "X 70500898 70500989 ENST00000194900 ENSE00003626326 20", 
"X 70502163 70505490 ENST00000194900 ENSE00003733449 21", "X 120426148 120431462 ENST00000200639 ENSE00001187748 9", 
"X 120441730 120441894 ENST00000200639 ENSE00000854520 8", "X 120442599 120442662 ENST00000200639 ENSE00000374515 7", 
"X 120446305 120446427 ENST00000200639 ENSE00000854522 6", "X 120447841 120448025 ENST00000200639 ENSE00000854523 5", 
"X 120448970 120449128 ENST00000200639 ENSE00000854524 4", "X 120455357 120455570 ENST00000200639 ENSE00000675764 3", 
"X 120456651 120456769 ENST00000200639 ENSE00003615412 2", "X 120469106 120469296 ENST00000200639 ENSE00003896841 1", 
"X 68829021 68829904 ENST00000204961 ENSE00001041113 1", "X 68838617 68838894 ENST00000204961 ENSE00000672435 2", 
"X 68839664 68839756 ENST00000204961 ENSE00000672436 3", "X 68839960 68840088 ENST00000204961 ENSE00000672437 4", 
"X 68840242 68842160 ENST00000204961 ENSE00001041114 5", "X 100843836 100844078 ENST00000217885 ENSE00001857295 12", 
"X 100848630 100848754 ENST00000217885 ENSE00000401023 11", "X 100849772 100849934 ENST00000217885 ENSE00000401021 10", 
"X 100850151 100850386 ENST00000217885 ENSE00001661840 9", "X 100851233 100851325 ENST00000217885 ENSE00001607924 8", 
"X 100862171 100862303 ENST00000217885 ENSE00001702593 7", "X 100862392 100862573 ENST00000217885 ENSE00001701549 6", 
"X 100862669 100862820 ENST00000217885 ENSE00001637130 5", "X 100863159 100863243 ENST00000217885 ENSE00001653972 4", 
"X 100863485 100863595 ENST00000217885 ENSE00001767233 3", "X 100870719 100870814 ENST00000217885 ENSE00001602753 2", 
"X 100874095 100874209 ENST00000217885 ENSE00001187812 1", "X 2913614 2914773 ENST00000217890 ENSE00001203379 7", 
"X 2915556 2915692 ENST00000217890 ENSE00003684272 6", "X 2917804 2918227 ENST00000217890 ENSE00003509340 5", 
"X 2920601 2920723 ENST00000217890 ENSE00003691518 4", "X 2921903 2922024 ENST00000217890 ENSE00003652714 3", 
"X 2925616 2925765 ENST00000217890 ENSE00003691123 2", "X 2929232 2929275 ENST00000217890 ENSE00001881224 1"
)

biostars_clean <- 
  biostars_clean %>%
  as_tibble() %>%
  separate(value, sep = " ", into = c("V1", "V2","V3","V4", "V5", "V6")) %>%
  group_by(V4) %>%
  mutate(n_exons = n()) %>%
  ungroup()

# file 1 (<20 exons)
file1 <- biostars_clean %>%
  filter(n_exons < 20) %>%
  select(-n_exons) %>%
  write_delim("/home/file1.txt", delim = "\t")

# file 2 (> 20 exons)
file2 <- biostars_clean %>%
  filter(n_exons > 20) %>%
  select(-n_exons) %>%
  write_delim("/home/file2.txt", delim = "\t")
ADD COMMENT
2
Entering edit mode
3.8 years ago

1st files if the transcript contains a total <20 exons then print all exons

 grep -w -F -f  <(cat  input.bed  | cut -f4 | sort | uniq -c | awk '($1<20) {print $2}' ) input.bed
ADD COMMENT
0
Entering edit mode

Translation for us mortals :) Create a subprocess to count transcripts IDs (column 4), use awk to split the id and the count and return the ID if it has a count less than 20 (print) to use as grep patterns to target lines in the original file with: -w Select only those lines containing matches that form whole words. -F Interpret PATTERNS as fixed strings not regular expressions. -f Obtain patterns from FILE (e.g. the subprocess)

ADD REPLY

Login before adding your answer.

Traffic: 2496 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6