1. Install nextflow manually or using conda:
conda install nextflow
git clone https://github.com/RasmussenLab/DoBSeqWF.git .
nextflow run main.nf -profile (standard/esrum/ngc),test -stub
nextflow run main.nf -profile (standard/esrum),test
The pipeline has multiple optional configurations found in nextflow.config
.
Configurations can be supplied as a config.json
and run with nextflow run main.nf -profile (standard/esrum) -params-file config.json
, or directly from the commandline:
nextflow run main.nf \
-profile (standard/esrum/ngc) \
--pooltable <path to pool fastq file table> \
--decodetable <path to pool decode tsv> \
--reference_genome <path to indexed reference genome> \
--bedfile <path to bedfile with target regions> \
--ploidy <integer>
The pooltable.tsv
should connect (user assigned) pool id's to input FASTQ files; one entry for each pool.
pool_row_1 path/to/sample1_R1.fq.gz path/to/sample1_R2.fq.gz
pool_column_1 path/to/sample2_R1.fq.gz path/to/sample2_R2.fq.gz
The decodetable.tsv
should map (user assigned) individual id's in the matrix to the corresponding row and column id's of each pool; one entry for each element in the matrix.
individual1 pool_row_1 pool_column_1
The workflow will output a results folder containing multiple config dependent output files:
results
├── pinpointables.vcf # Merged VCF file containing all assigned variants
├── cram/ # CRAM files for each pool
├── logs/ # Log files for each process
├── variants/ # VCF files for each pool
├── variant_tables/ # TSV files converted from pool VCFs
└── pinpoint_variants/
├── all_pins/ # All pinpointables for each sample in individual vcfs (*note)
├── unique_pins/ # All unique pinpointables for each sample in individual vcfs (*note)
├── *_merged.vcf.gz # All pinpointables for all samples in a single vcf without sample information
├── summary.tsv # Variant counts for each sample
└── lookup.tsv # Variant to sample lookup table
A central files is the pinpointables.vcf
. This file contains all individually assigned variants. Since each variant contains information from two pools, these a presented as the sample columns: ROW and COLUMN.
DoBSeqWF
├── LICENSE
├── VERSION
├── README.md
├── assets
│ ├── data
│ │ ├── reference_genomes
│ │ │ └── small
│ │ │ └── small_reference.*
│ │ └── test_data
│ │ ├── coordtable.tsv
│ │ ├── decodetable.tsv
│ │ ├── pools
│ │ │ └── *.fq.gz
│ │ ├── pooltable.tsv
│ │ ├── snvlist.tsv
│ │ └── target_calling.bed
│ └── helper_scripts
│ └── simulator.py # Script for simulating minimal pipeline data
├── bin # Executable pipeline scripts
│ └── <script>.*
├── conf
│ └── profiles.config # Configuration profiles for compute environments
├── envs
│ └── <name>/
│ └── environment.yaml # Conda environment definitions
├── main.nf # Main workflow
├── modules/
│ └── <module>.nf # Module scripts
├── subworkflows/
│ └── <subworkflow>.nf # Module scripts
├── next.pbs # Helper script for running on NGC-HPC
└── nextflow.config # Workflow parameters
Create a wrapper script for qsub, so you don't have to keep track of working directory, group etc. again.
First do mkdir ~/bin
. Then save the following script as a file named ~/bin/myqsub
and make it executable by chmod +x ~/bin/myqsub
.
#!/bin/bash
qsub -W group_list=icope_staging_r -A icope_staging_r -d $(pwd) "$@"
Add ~/bin to your path. You can have this done on log-in by appending the following line to your ~/.bashrc
:
export PATH="$PATH:$HOME/bin"
git clone /ngc/projects/icope_staging_r/git/predisposed/.git .
bash next.pbs -params-file test_config.json -stub
bash next.pbs -params-file test_config.json
myqsub next.pbs -F "-params-file test_config.json"
While the pipeline is still under development, it make sense to create new clones for each pipeline run, to keep track of possible changes done while running it. I propose this folder structure:
predisposed
├── git/DoBSeqWF # Temporary local workflow repository
├── resources # Reference genome and target files.
├── data/ # Raw data for each batch
│ ├── <batch_id_I>/
│ │ └── *.fq.gz
│ ├── <batch_id_II>/
│ │ └── *.fq.gz
│ └── <batch_id_III>/
│ └── *.fq.gz
│ └── ...
└── processed_data/ # Processed data for each batch
├── <batch_id_I>/
│ ├── DoBSeqWF/ # Clone repository here
│ │ ├── config.json # Configuration file
│ │ ├── pooltable.tsv # Pool table (create with helper script)
│ │ └── decodetable.tsv # Decode table (we need a convention for this)
│ └── results
│ ├── cram/ # CRAM files for each pool
│ ├── logs/ # Log files for each process
│ ├── variants/ # VCF files for each pool
│ ├── variant_tables/ # TSV files converted from pool VCFs
│ └── pinpoint_variants/
│ ├── all_pins/ # All pinpointables for each sample in individual vcfs (*note)
│ ├── unique_pins/ # All unique pinpointables for each sample in individual vcfs (*note)
│ ├── *_merged.vcf.gz # All pinpointables for all samples in a single vcf without sample information
│ ├── summary.tsv # Variant counts for each sample
│ └── lookup.tsv # Variant to sample lookup table
├── <batch_id_II>/
│ ├── DoBSeqWF/
│ └── results/
├── <batch_id_III>/
│ ├── DoBSeqWF/
└── results/
└── ...
(*note) Each pinpointable variant can be represented by the horizontal or the vertical pools. In order not to loose any information, there are, for now, 6 vcf files for each sample. Four with representations from either dimension named {sample}_{pool}_{unique/all}_pins.vcf.gz and 2 with all pins merged named {sample}_{unique/all}.vcf.gz.
cd /ngc/projects2/dp_00005/data/predisposed/
mkdir -p data/<batch_id> processed_data/<batch_id>
mv /ssi/fastq/data /ngc/projects2/dp_00005/data/predisposed/data/<batch_id>/
cd processed_data/<batch_id>
git clone /ngc/projects2/dp_00005/data/predisposed/git/DoBSeqWF
cd DoBSeqWF
bash assets/helper_scripts/create_pooltable.sh ../../../data/<batch_id>/
Fill out config.json with the correct paths and parameters. Decodetable is not needed for mapping only. Look into nextflow.config for possible parameters to set in the conifg.json.
myqsub next.pbs -F "-params-file config.json"
tail nextflow.log
If the pipeline fails - it is likely due to resource constraints. Adjust as needed in the conf/profiles.config file under NGC, and rerun the PBS script. Be aware that any direct edits of the workflow scripts, ie. modules and subworkflows, can lead to complete re-run of the pipeline.