Demultiplexing SRA data
1
0
Entering edit mode
8.8 years ago
Adrian Pelin ★ 2.6k

Hello,

I am looking at this experiment

Looks like the authors uploaded a multiplexed dataset with no info on barcoding. Any way to guess the barcodes? Just by running fastqc you can sorta guess what some barcodes may be, but is there a better way?

I downloaded the sra file and used --split-3, which gave me 2 files for PE.

Here is what the file sorta looks like:

@SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
AAGGATGCAATTCCTGGTGGTGCCATGGAGGTAAAGTCATAGTATTTTTATGATTTATATTTACATATTTTTACACTTCATAGTCATTTTTATAAAACTTTN
+SRR2046220.999996 HISEQ:108:C24D5ACXX:4:1101:5712:37823 length=102
CCCFFFFFHHHHHJJJIFGGFHIIHIIJAHIEGHJHFGGFIIBGIIJJJJJIJJIJJJJIJJJJJJIJIIJJJIIHHIIJIHHHFHHEFFFFFFEDDECEE#
@SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
CAGGACGAAAATGAAGGTTTGGTTTTAACATTTGATCTGAGTTTATAGTATAGAAAGAGATCTATATTGACTCAGCTTTGCATATAAATCATACATTCTAGN
+SRR2046220.999997 HISEQ:108:C24D5ACXX:4:1101:5536:37823 length=102
######################################################################################################
@SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
TAACTCTCTATTCACGAAAATCTGATCAATTGGATGACGGCTCGAAGAGCTTGATTCTACCAGATAGTACAGTTACATCAGGATGAAGTGCAGAAACGCTTN
+SRR2046220.999998 HISEQ:108:C24D5ACXX:4:1101:5653:37824 length=102
0;8@##################################################################################################
@SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
CTCAATTCAATTCGGAGCTTCGTCCCCTACAGGACCTCACCCTTCGATCAAACTAAATTATTATTCTTTTTCCAATATTACAATATCAACAATATGTACGTN
+SRR2046220.999999 HISEQ:108:C24D5ACXX:4:1101:5739:37825 length=102
1++44=BDF>FFFEEG1CFCGIDGHFH@G>FGEHDGGIIIIID;;FFHEG8@FG@AGHH;CAEFFEHHHEEEDDE>CCEDCA5>>CDEC?@?CCCC>@CB<#
@SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
TCTCTTACAATTCCAAAAGATATAGATAAGGCAATTTATTGGTATGAAGAATCTGCTAAACAAGGAAATCAAGGTGCACAAAATAGTTTAGAAGGACTTCAN
+SRR2046220.1000000 HISEQ:108:C24D5ACXX:4:1101:5569:37827 length=102
+++22?@A+?CCCCBBBCBCABBBBBCBBBBCCBBBBBBBBBABABBBBBBBBBBBBBBBBBBBBBABBBBABB>=ABAAAA<>>@?@@B>@@@@==;???#
SRA NGS Illumina multiplex • 4.2k views
ADD COMMENT
0
Entering edit mode

Based on the SRA entry this is ddRAD-Seq data. ENA also has just paired end fastq files. This must be EcoRI-MspI digested fragments.

ADD REPLY
0
Entering edit mode

How to detect if a certain SRA RNA-seq fastq file has been demultiplexed or not?

ADD REPLY
1
Entering edit mode

Check the index sequences in fastq headers. If there are more than one the file is likely not demultiplexed.

ADD REPLY
1
Entering edit mode
8.8 years ago

I have a semi empirical "method" cut the first N bases from the first 100,000K sequences then see how many unique you get. Something like (untested just typed it in)

cat data.fq | head -100000 |\
    bioawk -c fastx ' { print substr($seq,1,8) }' | sort |\
    uniq -c | sort -rn

Then keep raising the 8 to 9, 10 etc as long as you read the indexed adapter you'll have clear groups with approximately the same number of sequences.

ADD COMMENT
0
Entering edit mode

Very very cool idea, love it! I tried on 1 million

So, at 8 got:

1490 TCAATATC
1376 TAAAAAAA
1351 GGCATATC
1144 TGATCGCC
1063 GGACTCAC
1055 GGATATAC
1026 GGCAAGGC
1006 GGCCATCC

Low complexity like TAAAAAA is probably a false one.

At 9:

1465 TCAATATCAA
1347 GGCATATCAA
1130 TGATCGCCAA
1039 GGATATACAA
 996 GGCAAGGCAA
 984 GGCCATCCAA
 965 GGACTCACAA
 907 GAACTTGCAA

Interesting that all end in AA.

At 15:

457 TAANNNNNNNNNNNN
187 TCAATATCAATTCAA
182 TCAATATCAATTCTT
166 GGCATATCAATTCCT
157 TCAATATCAATTCAT
149 TGATCGCCAATTCAA
143 GGCATATCAATTCAA
142 GGCAAGGCAATTCAA
134 GGCCATCCAATTCAA
133 GGATATACAATTCAA
128 GGATATACAATTCTT

Is everything before AA the index sequence?

ADD REPLY
0
Entering edit mode

normally there should be a big drop in the groupings, but make sure to use it on the first file, only that will have the index, the other pair won't

ADD REPLY
0
Entering edit mode

Yeah I am using it on _1 from sra. Will try on _2. A big drop in grouping would make sense, but I don't think I see it. It may be that the file was demultiplexed, barcodes removed, and then all groups merged back together? That hardly makes sense but maybe possible.

ADD REPLY
0
Entering edit mode

yeah, your data does not seem to indicate the presence of common sequences at the start.

ADD REPLY

Login before adding your answer.

Traffic: 2395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6