Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Info

Status (02 May 2022)

  • Data arrived by sftp on 28 April 2022Globus on 01 10 2023. Everything below is modified from Bioinformatics for Novaseq run 4 ). Data processing finished 5-12-22 .

Table of Contents

Demultiplexing and splitting

...

Code Block
mkdir -p /gscratch/grandol1/5ALANS6/rawdata
cd /gscratch/grandol1/5ALANS6/rawdata
unpigz --to-stdout /project/microbiome/data_queue/seq/5ALANS6/rawdata/5ALNovaSeq6_Redo_S1_R1_001.fastqpool_1.fq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 5ALANS6_R1_ ;
unpigz --to-stdout /project/microbiome/data_queue/seq/5ALANS6/rawdata/5ALNovaSeq6_Redo_S1_R2_001.fastqpool_2.fq.gz | split -l 16000000 -d --suffix-length=3 --additional-suffix=.fastq - 5ALANS6_R2_

making 94 257 R1 files and 94 257 R2 files, with structured names (e.g., for the R1 set):

/gscratch/grandol1/5ALA/rawdata/5ALANS6_R1_000.fastq
/gscratch/grandol1/5ALA/rawdata/5ALANS6_R1_001.fastq
etc.Stalled here on 1-18-2023

run_parse_count_onSplitInput.pl also writes to /gscratch.

NS5NS6_Demux.csv is used to map MIDS to sample names and projects.

...

splitFastq.pl and splitFastq_manyInputfiles.pl will need tweaking in the future, whenever sample names and the format of the key for demultiplexing and metadata changes. The number of columns has differed among some of early sequence lanes, which necessitated changes to this parsing script.

Stopped at above step on 2/01/23 6:05pm

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/microbiome/data_queue/seq/5ALANS6/rawdata) I made run_aggregate.sh, to run aggregate_usearch_fastx_info.pl with a slurm job. Summaries are written to summary_sample_fastq.csv.

...

In /project/microbiome/data_queue/seq/5ALANS6/tfmergedreads , we used run_slurm_mergereads.plto crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. This script conforms to the steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569/Bioinformatics+v3.0?focusedCommentId=1280377080#comment-1280377080, including trimming primers, and joining unmerged reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. For example, see contents of /project/microbiome/data/seq/NS5NS6/tfmergedreads/16S/

Within each of these directories are files for the trimmed, merged, and filtered reads, in subfolders trimmed/, joined/, and unmerged/ (the last one is used as a working directory, should be empty; unmerged reads are filtered and joined and put in joined/ if they can be joined; the joined directory can be empty, if all unmerged reads were coligos for example).

...

In /project/microbiome/data_queue/seq/NS5NS6/otu, I ran run_slurm_mkotu.pl, which I modified to also pick up the joined reads (in addition the merged reads).

...

In /project/microbiome/data_queue/seq/5ALANS6/coligoISD, coligoIS or /project/microbiome/data/seq/5ALA/coligoISD, and /project/microbiome/data/seq/5ALANS6/coligoISD, there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

...