Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

run_splitFastq_fwd.sh and run_splitFastq_rev.sh run splitFastq_manyInputfiles.pl, which steps through the many pairs of files to split reads by sample and project, and place them in /project/gtl/data/raw/ALF1/16S/rawdata/sample_fastq/

 

Stopped here on 8-31-22

 

 

Calculate summary statistics on reads

In a sequence library’s rawdata/ directory (e.g., /project/gtl/data/raw/ALF1/ITS/rawdata/) I made run_aggregate.sh, to run aggregate_usearch_fastx_info.pl with a slurm job. Summaries are written to summary_sample_fastq.csv.

Trim, merge and filter reads

...

Remaining Steps

mkdir /project/gtl/data/raw/ALF1/16S/rawdata/tfmergedreads , we used run_slurm_mergereads.plto crawl the project folders and sample files (created in the splitting step above) to merge read pairs, and filter based on base quality. This script conforms to the steps in https://microcollaborative.atlassian.net/wiki/spaces/MICLAB/pages/1123778569/Bioinformatics+v3.0?focusedCommentId=1280377080#comment-1280377080, including trimming primers, and joining unmerged reads. This writes a new set of fasta files for each sample and project, rather than fastq, to be used in subsequent steps. These files are found in the 16S/ and ITS/ folders in tfmergedreads/. For example, see contents of sample_fastq

cd /project/gtl/data/raw/ALF1/16S/rawdata/sample_fastq

cp -R 16S /project/gtl/data/raw/ALF1/rawdata/sample_fastq

cd /project/gtl/data/distributionraw/ALF1/ITS/rawdata/sample_fastq

cp -R ITS /project/gtl/data/raw/ALF1/rawdata/

Within each of these directories are files for the trimmed, merged, and filtered reads, in subfolders trimmed/, joined/, and unmerged/ (the last one is used as a working directory, should be empty; unmerged reads are filtered and joined and put in joined/ if they can be joined; the joined directory can be empty, if all unmerged reads were coligos for example).

Statistics on the initial number reads, the number of reads that merged, and the number of reads that remain after filtering are in filtermergestats.csv in each project folder. Please note that this will not include the number of reads that failed to merge, but we were able to join. This category is likely to include ITS sequences for which the amplicon was large enough that our 2x250bp reads could not span the whole length. The greater number removed in ITS (orange) in the plot below is consistent with this idea. For the full lane these summaries were concatenated in tfmergedreads/ with

cat */*/filtermergestats.csv > all_filtermergestats.csv

I used commands.R in that folder to make a plot of numbers of reads per sample (horizontal axis) and the number reads that were removed because they did not merge, or did meet quality criteria and were filtered out (vertical axis). Purple is for 16S and orange is for ITS. It might be interesting to do that plot for each of the projects in the library (TODO), and possibly to have marginal histograms (or put quantiles on the plots).

...

Make OTU table

In /project/gtl/data/raw/ALF1/ITS/rawdata/otu, I ran run_slurm_mkotu.pl, which I modified to also pick up the joined reads (in addition the merged reads).

Make coligo table

In /project/gtl/data/raw/ALF1/ITS/coligoISD and /project/gtl/data/raw/ALF1/ITS/otu there are 16S and ITS directories for all projects. These contain a file named coligoISDtable.txt with counts of the coligos and the ISD found in the trimmed forward reads, per sample. The file run_slurm_mkcoligoISDtable.pl has the code that passes over all of the projects and uses vsearch for making the table.

...

sample_fastq/

Go to 16S Bioinformatics for ALF1 to see the remaining processing steps from section “Trim, merge and filter reads” 

 

Info

Gregg Randolph : please see /project/gtl/data/raw/ALF1/16S/tfmergedreads where I made mergereads.nf, teton.conf and edited trim_merge.pl to trim_mergecab.pl (initially this was because I didn't have permissions to run the file, so I copied it, but I found I need to make some changes, which are in the *cab.pl version). 

You run the nextflow script with: module load nextflowand nextflow run -bg mergereads.nf  -c teton.config. See inside of mergereads.nf for other ways of running it (i.e., not in the background).

I tried this on pair of input files and one of the vsearch steps in the middle fails because the inputs are too small. I then ran it on all input. The nextflow script completes, but one the vsearch steps appears to produce no output. It might be that some of the input files are genuinely too small. I can see that the trimming step is working. You can see this in output/trimmed/. But we're not getting other files (in unmerged/, joined/, and in output/ itself.).

Logs and other files for debugging from Nextflow will be made automatically in work/. I remove this folder between jobs, so that I can see what output is from what run. For example, see work/ff/d9576360a5295e6c92a0183e485944/.command.log and neighboring files with ls -al work/ff/d9576360a5295e6c92a0183e485944/. In this case I think vsearch is silently failing to write anything. Importantly, you can see the command that was being run, to debug: see work/ff/d9576360a5295e6c92a0183e485944/.command.sh. I think a next step would be to try the commands with some sample data (an R1 and R2 fastq file) that we expect should work.

Right now each job requests 1 hour from SLURM. I suspect this will be much smaller in the end and we can tailor it down.