Entering edit mode
22 months ago
SaltedPork
▴
170
Hi my input looks like:
>ref
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>sample1
GGTGCCCACACTAATGATGTAAAACAATTAACAGAGGCAGTGCAAA
>ref
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGCATTTTGGAATTCCCTACAAT
>sample2
GGTTAGGGCCGCCTGTTGGTGGGCGGGAATCAAGCAGGTATTTGGAATTCCCTACAAT
This entries in the fasta file are paired so that the ref is paired with the sample# below it.
I want to identify where the nt seqeunce for sample#
and ref
are identical, and remove them from the fasta (or put them into another fasta file of their own). The output would hopefully be a fasta file where the nt sequences for refs and sample# are different.
So far I have tried seqkit rmdup
command, however, this doesn't treat the entries as if they are paired. How can I accomplish this, ideally with a bash command or other program.
I don't have an existing tool that would do it, but if your fasta files aren't that large, it would be quite easy to do in R. You could create 2 objects, 1 with
ref
and the other withsample
, then find overlapping sequencing with something likeref$sequence %in% sample$sequence
to emit rows with matching entries.Again, this only really works if the fastas are not large.