This is a list of questions we have regularly gotten on the project. See below for questions about the ngless language.
First of all, you can actually use NGLess through Python, using NGLessPy.
However, the native mode of NGLess is using its internal DSL (domain specific langugage). There are several advantages to this approach:
output/results.txtas your output filename, we will check if a directory named output exists and is writable. We also try to be helpful in the error messages (mispelled a parameter value? Here’s an error, but also my best guess of what you meant + all legal values). I really care about error messages.
While the basic types and syntax of the language are fixed, it is not hard to add external modules that introduce new functions. These can be described with a YAML file and can use any command line tool.
Add new model organisms can similarly be done with simple YAML file.
More advanced extensions can be done in Haskell, but this is considered a solution for advanced users.
Short answer: Bioboxes gets us part of the way there, but not all of it; however, if we think of these technologies as complements, we might get more out of them.
Several of the goals of ngless can be fulfilled with a technology such as bioboxes. Namely, we can obtain reproducibility of computation, including across platforms using bioboxes without having to bother with ngless. However, the result is less readable than an ngless script, which is another important goal of ngless. An ngless script can be easily be submitted as supplemented methods to a journal publication and even be easily scrutinized by a knowledgeable reviewer in an easier way than a Docker container.
Furthermore, the fact that we work with a smaller domain than a Docker-based solution (we only care about NGS) means that we can provide the users a better experience than is possible with a generic tool. In particular, when the user makes a mistake (and all users will make mistakes), we can diagnose it faster and provide a better error message than is possible to do with Bioboxes.
Like for the question above, we consider ngless to be related to but not overlapping with the CWL (Common Workflow Language).
In particular, much of functionality of ngless can also be accessed in CWL workflow, using our command line wrappers all of which have CWL wrappers.
Additionally, (with some limitations), you can embedded a generic NGLess script
within a larger CWL workflow by using the
--export-cwl functionality. For
example, to automatically generate a wrapper for a script called
ngless --export-cwl=wrapper.cwl my-script.ngl
The automatically generated
wrapper.cwl file can now be used as a CWL tool
within a larger pipeline. See more in the CWL page.
Generally speaking, it does not. It can be used with HPC clusters, whereby you simply run a job that calls the ngless binary.
The parallel module can be used to split large jobs in many tasks, so that you can run multiple ngless instances and they collaborate. It is written such that does not depend on the HPC scheduler and can, thus, be used in any HPC system (or even, for smaller jobs, on a single machine).
Yes, you can. Just add them as additional arguments and they will be available
inside your script as
If you are familiar with the concept, you can think of them as
enums in other
Whenever a symbol is used in the argument to a function, this means that that function takes only one of a small number of possible symbols for that argument. This improves error checking and readibility.
selectfunction work on inserts (considering both mates) or per-read (treating the data as single-ended)?¶
select considers the insert as a whole, but you can have it
consider each read as single-end by using setting the
paired argument to
Short answer: If you have a choice, use TSV; if you must, use GFF.
Long answer: The TSV format is much more limited, annotating each reference sequence with a set of annotation terms. This is appropriate for gene catalogues. With the GFF format, you can annotate areas of the reference with different annotations. This is appropriate for (1) mapping metagenomes against reference genomes (where, due to strain variability, different areas of the genome may be differentially present in your samples) and (2) mapping (meta)transcriptomes against reference genomes.
The GFF format is, thus, more powerful than the TSV format (this is meant in the strict sense: everything that you can do with the TSV file can also be done with a GFF by setting the coordinates to cover the whole sequence). However, this implies a significantly higher computational cost (both in terms of time and memory usage), which is why you should not use the functionality unless you need it.