Entering edit mode
8.5 years ago
Daniel Gerlach
▴
30
I am looking at RNA-seq data from TCGA and noticed that some nucleotides are replaced by '.' in the FASTQ files. So far, I only encountered this coding in FASTA files that were repeat-masked. Can this encoding potentially cause any mappers or other down-stream software to fail on such files, or should I even replace the '.' by N?
Please see the second read in this output:
zcat TCGA/RNAseq/OV/data/246cf927-a407-4c5a-9f60-0822bf34b208/D0W8YACXX_7_TGACCA_R1.fastq.gz | head -8
@HS2_288:7:1101:1487:2168/1 1:N:0:TGACCA
GGTCAGTCTGCTTTCCCCCTGTTTTATAATGTTGGTGGTTTTAATCCGTATTTCTTTGCAACTTCTGTCTGGGCA
+
BC?DFFFFHHHHHJJJJJJJJJJJJEIEIHIFHGIGHIFHIJIIJIGHHGHJJHIJJJJIJJJJJJJJJJIGHHH
@HS2_288:7:1101:1723:2143/1 1:N:0:TGACCA
.CTGGTTATGTGCTCCCTTCCACAGGGCTACATGACGGCACTTTATTTTAAATCCTTTAAACAAAATACATATGG
+
#1=DFDFFHGFHHJJJJJJJJJJJJJJJJEHHJID9DHGIJJJJIJJJJIJJJJJJJIIEIIJJGHHHHHGFFFF
Best, Daniel