Skip to content

Commit

Permalink
Updated docs.
Browse files Browse the repository at this point in the history
  • Loading branch information
rfm-targa committed Feb 22, 2024
1 parent 0d2ab58 commit 77b7d85
Show file tree
Hide file tree
Showing 7 changed files with 41 additions and 34 deletions.
4 changes: 2 additions & 2 deletions CHEWBBACA/AlleleCall/allele_call.py
Original file line number Diff line number Diff line change
Expand Up @@ -2597,8 +2597,8 @@ def allele_calling(fasta_files, schema_directory, temp_directory,
config['CPU cores'],
show_progress=False)

################ Need to identify representative candidates that match several loci and remove them from the analysis

# Need to identify representative candidates that match several
# loci and remove them from the analysis
excluded = []
representative_candidates = {}
for r in class_results:
Expand Down
14 changes: 8 additions & 6 deletions CHEWBBACA/docs/user/getting_started/important_notes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,15 +3,17 @@ Important Notes

- chewBBACA only works with **Python 3** (automatic testing for Python 3.8-3.11
with GitHub Actions).
- We strongly recommend that users install and use **BLAST 2.9.0+** with chewBBACA, as
chewBBACA's processes have been extensively tested with that version of BLAST.
- We strongly recommend that users install and use **BLAST 2.9.0+** with chewBBACA<=3.3.2, as the files passed to ``-seqidlist`` are not converted to
a binary format with ``blastdb_aliastool``, which might affect performance in some cases if using BLAST>=2.10. chewBBACA>=3.3.3 determines if the files should be converted
to the binary format based on the BLAST version.
- chewBBACA includes Prodigal training files for some species. You can consult the list of
Prodigal training files that are readily available `here <https://github.com/B-UMMI/chewBBACA/tree/master/CHEWBBACA/prodigal_training_files>`_.
We strongly recommend using the same Prodigal training file for schema creation and allele calling to ensure consistent results.
- chewBBACA defines an allele as a complete Coding DNA Sequence, with start and stop codons
- chewBBACA defines an allele as a complete Coding DNA Sequence (CDS), with start and stop codons
according to the `NCBI genetic code table 11 <http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi>`_
(identified using `Prodigal <https://github.com/hyattpd/prodigal/releases/>`_ by default, but with the option to provide FASTA
files with coding sequences). It will automatically exclude any allele for which the DNA sequence does not contain start or stop
(identified using `Prodigal <https://github.com/hyattpd/prodigal/releases/>`_ for chewBBACA>=3.2.0 and
`Pyrodigal <https://github.com/althonos/pyrodigal>`_ for chewBBACA>=3.3.0, but with the option to provide FASTA
files with CDSs). It will automatically exclude any allele for which the DNA sequence does not contain start or stop
codons and for which the length is not multiple of three. Alleles that contain ambiguous bases are also excluded.
- Make sure that your FASTA files are UNIX format. If they were created in Linux or MacOS
systems they should be in the correct format, but if they were created in Windows systems,
Expand All @@ -20,7 +22,7 @@ Important Notes
:doc:`PrepExternalSchema </user/modules/PrepExternalSchema>` module to convert the schema to a format
fully compatible with chewBBACA v3.
- If you are running chewBBACA in an environment with multiple processes accessing the same schema please use the ``--no-inferred``
option (see :doc:`Allele call </user/modules/AlleleCall>`)
option (see :doc:`Allele call </user/modules/AlleleCall>`).
- Input files should have short names without blank spaces or special characters. It is also important to ensure each input file has
a unique basename prefix (everything before the first ``.`` in the basename). You can read more about this in the :doc:`FAQ </user/help_support/faq>` section.

Expand Down
11 changes: 6 additions & 5 deletions CHEWBBACA/docs/user/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,18 @@ Install the latest released version using `conda <https://anaconda.org/bioconda/

::

conda create -c bioconda -c conda-forge -n chewie "chewbbaca=3.3.2"
conda create -c bioconda -c conda-forge -n chewie "chewbbaca=3.3.3"

If you're having issues installing chewBBACA through conda, we recommend that you install
`mamba <https://mamba.readthedocs.io/en/latest/index.html>`_ and run the following command:
If you're having issues installing chewBBACA through conda, please verify that you are using
conda>=22.11, and enable the libmamba solver, which might speed up the installation process.
You can also install `mamba <https://mamba.readthedocs.io/en/latest/index.html>`_ and run the following command:

::

mamba create -c bioconda -c conda-forge -n chewie "chewbbaca=3.3.2"
mamba create -c bioconda -c conda-forge -n chewie "chewbbaca=3.3.3"

.. important::
We strongly recommend that users install and use BLAST 2.9.0+. Please open an
We strongly recommend that users install and use BLAST 2.9.0+ with chewBBACA<=3.3.2. Please open an
`issue <https://github.com/B-UMMI/chewBBACA/issues>`_ if you find any problems with any
of the dependencies.

Expand Down
13 changes: 6 additions & 7 deletions CHEWBBACA/docs/user/getting_started/quick_start.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,10 @@ input files, one full path per line):
chewBBACA.py CreateSchema -i InputAssemblies -o OutputSchemaFolder --ptf ProdigalTrainingFile

Option 2 - Coding sequences
............................
Option 2 - Coding DNA Sequences
...............................

If you do not need/want to use Prodigal for gene prediction, you can provide FASTA files with coding
sequences (CDSs) and use the ``--cds`` parameter to skip the gene prediction step:
You can provide FASTA files with Coding DNA Sequences (CDSs) and skip the gene prediction step by passing the ``--cds`` parameter:

::
Expand All @@ -35,7 +34,7 @@ sequences (CDSs) and use the ``--cds`` parameter to skip the gene prediction ste
.. note::
The CreateSchema module creates a schema seed with one representative allele per locus in the
schema. To include more allele variants in the schema, we recommend starting by performing
allele calling with the set of genome assemblies/coding sequences used for schema creation.
allele calling with the set of genome assemblies/CDSs used for schema creation.


Option 3 - Adapt an external schema
Expand Down Expand Up @@ -63,13 +62,13 @@ Determine the allelic profiles for genome assemblies:

chewBBACA.py AlleleCall -i InputAssemblies -g OutputSchemaFolder/SchemaName -o OutputFolderName

Use a subset of the loci in a schema:
Perform allele calling with a subset of the schema loci:

::

chewBBACA.py AlleleCall -i InputAssemblies -g OutputSchemaFolder/SchemaName -o OutputFolderName --gl LociList.txt

Provide FASTA files with coding sequences (one file per genome/strain):
Provide FASTA files with CDSs (one file per genome/strain):

::

Expand Down
14 changes: 8 additions & 6 deletions CHEWBBACA/docs/user/modules/AlleleCall.rst
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ influenced by several factors:

- Quality of the sequence assembly (influenced by several aspects, such as the sequencing
method, the assembler used, etc);
- If the alleles must correspond to coding sequences (CDSs) and open reading frames (ORFs);
- If the alleles must correspond to Coding DNA Sequences (CDSs) and open reading frames (ORFs);
- Presence of possibly homologous loci (this situation can result in an allele assignment
to a possibly wrong locus given the difficulty in distinguishing closely related homologs).

Expand All @@ -34,7 +34,7 @@ with Prodigal will be skipped.

The allele calling algorithm has the following main steps:

- Gene predictipon with Prodigal followed by coding sequence (CDS) extraction to create FASTA files
- Gene predictipon with Prodigal followed by CDS extraction to create FASTA files
that contain all CDSs extracted from the inputs (There is also the option to provide FASTA files
with CDSs and the ``--cds`` parameter to skip the gene prediction step with Prodigal).

Expand Down Expand Up @@ -208,9 +208,9 @@ Parameters

.. important::
If you provide the ``--cds-input`` parameter, chewBBACA assumes that the input FASTA files contain
coding sequences and skips the gene prediction step with Prodigal. To avoid issues related with the
CDSs and skips the gene prediction step with Prodigal. To avoid issues related with the
format of the sequence headers, chewBBACA renames the sequence headers based on the unique basename
prefix determined for each input file and on the order of the coding sequences (e.g.: coding sequences
prefix determined for each input file and on the order of the CDSs (e.g.: CDSs
inside a file named ``GCF_000007125.1_ASM712v1_cds_from_genomic.fna`` are renamed to
``GCF_000007125-protein1``, ``GCF_000007125-protein2``, ..., ``GCF_000007125-proteinN``).

Expand All @@ -236,8 +236,10 @@ Outputs


- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig
identifier, start position, stop position, protein identifier attributed by chewBBACA and coding
strand) of the coding sequences identified in each genome.
identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding
strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and
chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs
identified in each genome.

- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
excluded based on the minimum sequence size value and presence of ambiguous bases.
Expand Down
13 changes: 8 additions & 5 deletions CHEWBBACA/docs/user/modules/CreateSchema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -192,8 +192,11 @@ Outputs
- The training file passed to create the schema is also included in ``OutputFolderName/SchemaName``
and will be automatically detected during the allele calling process.

- A file with the coordinates of the identified genes in each genome passed to create the schema,
``cds_coordinates.tsv``.

- A file with the list of alleles predicted by Prodigal that were excluded based on the
minimum sequence length value and the presence of ambiguous bases, ``invalid_cds.txt``.
- The ``cds_coordinates.tsv`` file contains the coordinates (genome unique identifier, contig
identifier, start position, stop position, protein identifier attributed by chewBBACA, and coding
strand (chewBBACA<=3.2.0 assigns 1 to the forward strand and 0 to the reverse strand and
chewBBACA>=3.3.0 assigns 1 and -1 to the forward and reverse strands, respectively)) of the CDSs
identified in each genome.

- The ``invalid_cds.txt`` file contains the list of alleles predicted by Prodigal that were
excluded based on the minimum sequence size value and presence of ambiguous bases.
6 changes: 3 additions & 3 deletions CHEWBBACA/docs/user/modules/PrepExternalSchema.rst
Original file line number Diff line number Diff line change
Expand Up @@ -112,10 +112,10 @@ Outputs

- ``<adapted_schema>_invalid_alleles.txt`` - contains the identifiers of the alleles that were
excluded and the reason for the exclusion of each allele.
- ``<adapted_schema>_invalid_genes.txt`` - contains the list of genes that had no valid alleles.
- ``<adapted_schema>_summary_stats.tsv`` - contains summary statistics for each gene. Number of
- ``<adapted_schema>_invalid_genes.txt`` - contains the list of genes that had no valid alleles, one gene identifier per line.
- ``<adapted_schema>_summary_stats.tsv`` - contains summary statistics for each gene (number of
alleles in the external schema, number of valid alleles included in the adapted schema and
number of representatives.
number of representative alleles chosen by chewBBACA).

.. note::
For most genes, only one or a few sequences need to be chosen as representatives to
Expand Down

0 comments on commit 77b7d85

Please sign in to comment.