I have a query.fasta
file that looks like:
>seq0 hello_world
GAACCTAAGTACGCG
...
>seq83 hello_world
CACGCGGCTAGTACG
...
>seq1170 hello_world
CGTACTAGCCGCGTG
...
>seq4420 hello_world
CGCGTACTTAGGTTC
...
Every sequence in this file is unique. However when I use bowtie2 to map these reads to a RefSeq genome
bowtie2 -x GCF_ref -p 12 --end-to-end -f -U query.fasta -S result.sam
I get:
seq0 16 chromosA 456940 3 15M * 0 0 CGCGTACTTAGGTTC IIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:7G7 YT:Z:UU
seq83 16 chromosB 869078 42 15M * 0 0 CGTACTAGCCGCGTG IIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:15 YT:Z:UU
seq1170 0 chromosB 869078 42 15M * 0 0 CGTACTAGCCGCGTG IIIIIIIIIIIIIII AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:15 YT:Z:UU
seq4420 0 chromosA 456940 3 15M * 0 0 CGCGTACTTAGGTTC IIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:7G7 YT:Z:UU
How is it even possible that sl1170
's sequence (CGTACTAGCCGCGTG
) is also found in sl83
's row? The same happens for seq4420 ~ seq0
pair. This also happens for every mapped sequence (having a duplicate pair in sam)
Any ideas?