Content Comparison

...

The work is done by run_parse_count_onSplitInput.pl. As the name implies, we split the raw data into many files (240492), so that the parsing can be done in parallel by many nodes. The approximate string matching that we are doing requires ~140 hours of CPU time, so we are splitting the task across many jobs. By doing so, the parsing takes less than one hour.

...

NS5_Demux.csv is used to map MIDS to sample names and projects.

Below This Point is yet to be done

Splitting to fastq for individuals

The info lines for each read in parsed_NS*_R1.fastq and parsed_NS*_R2.fastq have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.

run_splitFastq_fwd.sh

Below This Point is yet to be done

and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/microbiome/data_queue/seq/NS5/rawdata/sample_fastq/.

...

Version	Old Version 5	New Version 6
Changes made by	Gregg Randolph	Gregg Randolph
Saved on	Apr 29, 2022	Apr 29, 2022

Versions Compared

Key

Below This Point is yet to be done

Splitting to fastq for individuals

Below This Point is yet to be done