The first 3 column contains the genome coordinates 4th column contains transcript name 5th column contains exon name and 6th column contain exon rank. So I want to separate this file into 2 files: 1st files if the transcript contains a total <20 exons then print all exons and another file in which if the transcript contains a total of more than 20 exons then print all exons in different file.
X 70444861 70445558 ENST00000194900 ENSE00001463308 1
X 70448580 70448633 ENST00000194900 ENSE00001463307 2
X 70448913 70448963 ENST00000194900 ENSE00003643927 3
X 70449359 70449483 ENST00000194900 ENSE00003548284 4
X 70449690 70449859 ENST00000194900 ENSE00003690291 5
X 70450169 70450305 ENST00000194900 ENSE00000436378 6
X 70450639 70450783 ENST00000194900 ENSE00000672250 7
X 70451867 70452026 ENST00000194900 ENSE00003562095 8
X 70453637 70453793 ENST00000194900 ENSE00003664498 9
X 70454214 70454316 ENST00000194900 ENSE00003617252 10
X 70479150 70479264 ENST00000194900 ENSE00003589163 11
X 70492107 70492283 ENST00000194900 ENSE00003626140 12
X 70492521 70492596 ENST00000194900 ENSE00003469450 13
X 70495408 70495453 ENST00000194900 ENSE00003692914 14
X 70497180 70497221 ENST00000194900 ENSE00003595614 15
X 70498520 70498570 ENST00000194900 ENSE00003511316 16
X 70499176 70499277 ENST00000194900 ENSE00003615707 17
X 70499877 70500049 ENST00000194900 ENSE00003483338 18
X 70500471 70500580 ENST00000194900 ENSE00003686632 19
X 70500898 70500989 ENST00000194900 ENSE00003626326 20
X 70502163 70505490 ENST00000194900 ENSE00003733449 21
X 120426148 120431462 ENST00000200639 ENSE00001187748 9
X 120441730 120441894 ENST00000200639 ENSE00000854520 8
X 120442599 120442662 ENST00000200639 ENSE00000374515 7
X 120446305 120446427 ENST00000200639 ENSE00000854522 6
X 120447841 120448025 ENST00000200639 ENSE00000854523 5
X 120448970 120449128 ENST00000200639 ENSE00000854524 4
X 120455357 120455570 ENST00000200639 ENSE00000675764 3
X 120456651 120456769 ENST00000200639 ENSE00003615412 2
X 120469106 120469296 ENST00000200639 ENSE00003896841 1
X 68829021 68829904 ENST00000204961 ENSE00001041113 1
X 68838617 68838894 ENST00000204961 ENSE00000672435 2
X 68839664 68839756 ENST00000204961 ENSE00000672436 3
X 68839960 68840088 ENST00000204961 ENSE00000672437 4
X 68840242 68842160 ENST00000204961 ENSE00001041114 5
X 100843836 100844078 ENST00000217885 ENSE00001857295 12
X 100848630 100848754 ENST00000217885 ENSE00000401023 11
X 100849772 100849934 ENST00000217885 ENSE00000401021 10
X 100850151 100850386 ENST00000217885 ENSE00001661840 9
X 100851233 100851325 ENST00000217885 ENSE00001607924 8
X 100862171 100862303 ENST00000217885 ENSE00001702593 7
X 100862392 100862573 ENST00000217885 ENSE00001701549 6
X 100862669 100862820 ENST00000217885 ENSE00001637130 5
X 100863159 100863243 ENST00000217885 ENSE00001653972 4
X 100863485 100863595 ENST00000217885 ENSE00001767233 3
X 100870719 100870814 ENST00000217885 ENSE00001602753 2
X 100874095 100874209 ENST00000217885 ENSE00001187812 1
X 2913614 2914773 ENST00000217890 ENSE00001203379 7
X 2915556 2915692 ENST00000217890 ENSE00003684272 6
X 2917804 2918227 ENST00000217890 ENSE00003509340 5
X 2920601 2920723 ENST00000217890 ENSE00003691518 4
X 2921903 2922024 ENST00000217890 ENSE00003652714 3
X 2925616 2925765 ENST00000217890 ENSE00003691123 2
X 2929232 2929275 ENST00000217890 ENSE00001881224 1
After the output, the 2 files looks like this:- The first files look like:-
X 120426148 120431462 ENST00000200639 ENSE00001187748 9
X 120441730 120441894 ENST00000200639 ENSE00000854520 8
X 120442599 120442662 ENST00000200639 ENSE00000374515 7
X 120446305 120446427 ENST00000200639 ENSE00000854522 6
X 120447841 120448025 ENST00000200639 ENSE00000854523 5
X 120448970 120449128 ENST00000200639 ENSE00000854524 4
X 120455357 120455570 ENST00000200639 ENSE00000675764 3
X 120456651 120456769 ENST00000200639 ENSE00003615412 2
X 120469106 120469296 ENST00000200639 ENSE00003896841 1
X 68829021 68829904 ENST00000204961 ENSE00001041113 1
X 68838617 68838894 ENST00000204961 ENSE00000672435 2
X 68839664 68839756 ENST00000204961 ENSE00000672436 3
X 68839960 68840088 ENST00000204961 ENSE00000672437 4
X 68840242 68842160 ENST00000204961 ENSE00001041114 5
X 100843836 100844078 ENST00000217885 ENSE00001857295 12
X 100848630 100848754 ENST00000217885 ENSE00000401023 11
X 100849772 100849934 ENST00000217885 ENSE00000401021 10
X 100850151 100850386 ENST00000217885 ENSE00001661840 9
X 100851233 100851325 ENST00000217885 ENSE00001607924 8
X 100862171 100862303 ENST00000217885 ENSE00001702593 7
X 100862392 100862573 ENST00000217885 ENSE00001701549 6
X 100862669 100862820 ENST00000217885 ENSE00001637130 5
X 100863159 100863243 ENST00000217885 ENSE00001653972 4
X 100863485 100863595 ENST00000217885 ENSE00001767233 3
X 100870719 100870814 ENST00000217885 ENSE00001602753 2
X 100874095 100874209 ENST00000217885 ENSE00001187812 1
X 2913614 2914773 ENST00000217890 ENSE00001203379 7
X 2915556 2915692 ENST00000217890 ENSE00003684272 6
X 2917804 2918227 ENST00000217890 ENSE00003509340 5
X 2920601 2920723 ENST00000217890 ENSE00003691518 4
X 2921903 2922024 ENST00000217890 ENSE00003652714 3
X 2925616 2925765 ENST00000217890 ENSE00003691123 2
X 2929232 2929275 ENST00000217890 ENSE00001881224 1
The second files look like this:-
X 70444861 70445558 ENST00000194900 ENSE00001463308 1
X 70448580 70448633 ENST00000194900 ENSE00001463307 2
X 70448913 70448963 ENST00000194900 ENSE00003643927 3
X 70449359 70449483 ENST00000194900 ENSE00003548284 4
X 70449690 70449859 ENST00000194900 ENSE00003690291 5
X 70450169 70450305 ENST00000194900 ENSE00000436378 6
X 70450639 70450783 ENST00000194900 ENSE00000672250 7
X 70451867 70452026 ENST00000194900 ENSE00003562095 8
X 70453637 70453793 ENST00000194900 ENSE00003664498 9
X 70454214 70454316 ENST00000194900 ENSE00003617252 10
X 70479150 70479264 ENST00000194900 ENSE00003589163 11
X 70492107 70492283 ENST00000194900 ENSE00003626140 12
X 70492521 70492596 ENST00000194900 ENSE00003469450 13
X 70495408 70495453 ENST00000194900 ENSE00003692914 14
X 70497180 70497221 ENST00000194900 ENSE00003595614 15
X 70498520 70498570 ENST00000194900 ENSE00003511316 16
X 70499176 70499277 ENST00000194900 ENSE00003615707 17
X 70499877 70500049 ENST00000194900 ENSE00003483338 18
X 70500471 70500580 ENST00000194900 ENSE00003686632 19
X 70500898 70500989 ENST00000194900 ENSE00003626326 20
X 70502163 70505490 ENST00000194900 ENSE00003733449 21
So please can anyone tell me how to make 2 files. Thanks in advance.
with awk (since each exon is preceded by transcript entry)
Transcripts with more than 20 exons:
Transcripts with less than 20 exons:
With datamash, awk and join:
For transcripts with more than 20 exons:
For transcripts with less than 20 exons: