|
9 | 9 | ## Description
|
10 | 10 |
|
11 | 11 | This is a reimplementation of [pp-sketchlib](https://github.com/bacpop/pp-sketchlib)
|
12 |
| -in the rust language. |
| 12 | +in the rust language. This version is optimised for larger sample numbers, particularly |
| 13 | +allowing subsets of samples to be compared. |
| 14 | + |
| 15 | +Sketch databases have two files: `.skm` which is the metadata (samples names, base counts etc) |
| 16 | +and `.skd` which is the actual sketch data. |
| 17 | + |
| 18 | +## Usage |
| 19 | +With all options we typically recommend using `-v` to see all progress during the run. |
| 20 | + |
| 21 | +### Sketching |
| 22 | + |
| 23 | +Using input fasta/fastq files, create a sketch database. Run `sketchlib sketch -h` to see the help. |
| 24 | + |
| 25 | +- List .fasta files on the command line, or use `-f` to provide a file(s). From file, |
| 26 | +these are one line per sample listing the name and fasta file, or name and two read files |
| 27 | +(fastq). Inputs can be gzipped or not, this is automatically detected. |
| 28 | +- To set the k-mer size in the sketch database you can either give a list of sizes with `--k-vals` |
| 29 | +or a sequence `--k-seq` with start,stop,step. e.g. `--k-seq 17,29,4` would sketch at k=17, 21, 25 and 29. |
| 30 | +- Set the sketch size with `-s`. Typically 1000 is enough for species level resolution, 10000 for within-species/strain |
| 31 | +resolution and 100000-1000000 for SNP level resolution. |
| 32 | +- To sketch amino acid sequences use `--seq-type aa --concat-fasta` if you have the typical case |
| 33 | +of each fasta file being a multifasta with many aa sequences. Each one will then be its own sample. |
| 34 | +- You can also sketch structures with .pdb input, see 'Enabling PDB->3Di' below. This is experimental. |
| 35 | + |
| 36 | +### Distances |
| 37 | + |
| 38 | +To compute internal all-vs-all core and accessory distances use: |
| 39 | +``` |
| 40 | +sketchlib dist db_name |
| 41 | +``` |
| 42 | +Note the database names can be the prefix, or the full path to the .skm file. The output |
| 43 | +is in pairwise 'long' format, which lists the upper triangle of the distance matrix row-by-row. |
| 44 | + |
| 45 | +To calculate distances between two different sample sets, each in their own sketch database, use: |
| 46 | +``` |
| 47 | +sketchlib dist db1 db2 |
| 48 | +``` |
| 49 | +For example, if you want to query distances of a new sample against an existing database, |
| 50 | +first sketch the new sample with e.g. `sketchlib sketch -o db2 new_sample.fasta`, then |
| 51 | +run the above command. |
| 52 | + |
| 53 | +Modifiers: |
| 54 | +- Use `-k` to calculate Jaccard distance at the given k. Otherwise the default is to |
| 55 | +calculate across multiple k and output core and accessory distances. |
| 56 | +- Use `--ani` with `-k` to transform the Jaccard distance into average nucleotide identity. |
| 57 | +- Use `--subset` to provide a list of sample names to include in the distance calculations, |
| 58 | +only these sample will be loaded from the `.skd` file. |
| 59 | +- Use `-o` to write the distances to a file. The default it to write to stdout, so you can also |
| 60 | +use `>` to redirect to a file (progress messages are written to stderr). |
| 61 | +- Use `--knn` to only keep this many nearest neighbour distances. For very large databases |
| 62 | +it may be useful to keep only ~50 distances. This makes the memory use manageable. This sparse output |
| 63 | +can be used with e.g. [mandrake](https://github.com/bacpop/mandrake). |
| 64 | + |
| 65 | +### Other operations |
| 66 | + |
| 67 | +- `merge` joins two existing sketch databases. |
| 68 | +- `append` sketches new input samples, and adds them to an existing database. |
| 69 | +- `delete` removes samples from a sketch database. |
13 | 70 |
|
14 | 71 | ## Enabling PDB->3Di
|
15 | 72 | conda doesn't work, so make sure it is deactivated
|
|
0 commit comments