I have too many lines like this:
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
TTAACTATAATCCCACTGCCTATTTTTTTATTTCTAAAAATATCATAAAAAGACACAAAA
the first line(starting with >
) is identifier and other lines are sequence and also each identifier has its own sequence. in the mentioned example, ENSG00000100206
is name and ENST00000216024
is isoform. in my file there are some identifier lines with the same name but everything else is different.
I would like to get the longest sequence for each name and make a new file. meaning there would be only one repeat of each name (but with the longest sequence).
for the above example the results would be like this:
>ENSG00000100206|ENST00000216024|DMC1|2371|38568257;38570043|38568289;38570286
CTCAGACGTCGGGCCGACGCAAGGCCACGCGCGCGAACACACAGGTGCGGCCCCGGGCCA
CACGCACACCGTACAC
>ENSG00000001630|ENST00000003100|CYP51A1|3210|92134365|92134530
TATATCACAGTTTCTTTCTTTTTTTTTTTTTTTTTTTTGAGACAGAGTTTTGCTCTTGTT
GCCCAGGCTGGAGTACAGTGACGCAATCTCGGCTCACTGCAACCTTTGCCTCCCAGGTTC
I count the number of identifiers using looping over each identifier
file = open ("file.txt", "r")
count = 0
for line in file:
if line.startswith(">")
count +=1
but I don't know how to do filtering do you guys know how to do that in python?
You'll need to use a dictionary for this. Do you know a dictionary? How familiar are you with (bio)python?
I can write the script you want, but I think it's more interesting for you if you do some thinking yourself. I don't mind assisting and steering your efforts :-)
It would make more sense to call a file like this a fasta file... it's not just a txt file.