Bioinformatics iSeq100 Pilot1
Status (28 February 2021)
@Alex Buerkle has processed the data to OTU table generation, including both merged and joined paired-end reads.
These are the informatics for the Bioinformatics iSeq100 Pilot1 library prep.
Demultiplexing and splitting
Demultiplexing
The raw sequence data are in /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/rawdata
.
I uncompressed the forward and reverse read files with
unpigz
on an interactive node (could also used single-threaded gunzip).I fixed the line-endings (from MS DOS line-endings) in the
Brew_20_DEMUX.csv
withdos2unix Brew_20_DEMUX.csv
I ran
sbatch run_slurm_parse_count.sh
after editingrun_slurm_parse_count.sh
to have the correct filenames and string for the sequencer id.
It appears that there are a large number of reads going into truemiderrors_Brew20Nov_S1_L001_R1_001.fastq
many of which look like they could be ITS coligos. They have long terminal, default G base calls and the sequence before this looks pretty consistent. It is possible that they had mids, but that the primer sequence was too far off to be recognized.
Splitting
The info lines for each read in parsed_*_R1.fastq
and parsed_*_R2.fastq
have the locus, the forward mid, the reverse mid, and the sample name. These can be used with the demux key to separate reads into loci, projects, and samples, in the folder sample_fastq/
. The reads are in separate files for each sequenced sample, including replicates. The unique combination of forward and reverse MIDs (for a locus) is part of the filename and allows replicates to be distinguished and subsequently merged.
run_splitFastq_fwd.sh
and run_splitFastq_rev.sh
run splitFastq.pl
, which split reads by sample and project, and place them in /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/rawdata/sample_fastq/.
splitFastq.pl
will need tweaking in the future, until sample names and the format of the key for demultiplexing and metadata stabilizes. The number of columns in the demux key has differed among some of our completed sequence lanes. Brew_20_DEMUX.csv
has 10 columns: forward_barcode,reverse_barcode,locus,samplename,project,wellposition,plate,midplate,substrate,client_name
.
Calculate summary statistics on reads
In a sequence library’s rawdata/
directory (e.g., /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/rawdata
), in an interactive node run:
module load swset/2018.05 gcc/7.3.0 usearch/10.0.240
./aggregate_usearch_fastx_info.pl
See summary_sample_fastq.csv
for read counts for each sample.
Trim, merge, and filter reads
Using the current steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569, we trimmed primers from reads with cutadapt, and merged and filtered them with vsearch. Output is in /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/tfmergedreads
, which is further broken by 16S and ITS, and project name.
Within each of these directories are files for the trimmed, merged, and filtered reads. In each of these directories, there are subfolders trimmed/
, joined/
, and unmerged/
(the last one is used as a working directory, should be empty; unmerged reads are filtered and joined and put in joined/
if they can be joined; the joined directory can be empty, if all unmerged reads were coligos for example). For example, see contents of /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/tfmergedreads/16S/brew_20/
Make OTU table
In /project/microbiome/data/seq/gtl_tests/iSeq100Pilot1_brew_30nov20/otu
, I ran run_slurm_mkotu.pl
, which I modified to also pick up the joined reads (in addition the merged reads). On Apr 5, 2021 I reran this with a modification to correctly and unique label (assign names to) otus in the table that is used as a database for otutable generation. These jobs should be complete later on Apr 5, 2021.