The count()
function takes a MappedReadSet
(the logical equivalent of a
SAM/BAM file) and summarizes the information therein. The count()
function
can perform three types of operation, depending on the features
argument:
seqname
: only the MappedReadSet
is necessary.functional_map
(TSV) fileOption #1 is the simplest to understand: just summarize based on the names of
the sequences. This is appropriate for obtaining per-gene abundances from gene
catalogs. Options #2 and #3 are similar: for each MappedRead
, count()
will use the passed reference to map to a set of features and summarize those.
Using the GFF format is much more flexible and allows for a lot more filtering,
but also significantly costlier in time and memory. At a high-level, the
process is similar:
For example, if you have an insert (either a paired-end or single-end
short-read) that mapped to a gene called G1
and that gene is annotated with
two gene ontology terms, both will be considered and their counts adjusted. If
the insert mapped to multiple genes, then all the terms will be considered, but
as a set: if an insert is mappped to G1
, which has two GO annotations and
also to G2
which has the same GO annotations (which is very frequent), then
those annotations will be counted a single time.
How counts are adjusted in the presence of multiple annotations is defined by
the multiple
argument. Generally, for obtaining gene abundances, distribution
of multiple mappers is the best (using multiple={dist1}
), while for
functional annotations, you want to count them all (using multiple={all1}
).
This implies that the functional annotations will sum to a higher value than
the number of reads. This may seem strange at first, but it is the intended
behaviour.
See also the full description of all count arguments in the API docs.
functional_map
argument¶The file consists of a header and content.
The simplest header is just a single line of tab separated column headers. That
line may start with a #
sign, which is ignored. Alternatively, a multi-line
header consists of multiple lines starting with #
. The last one of these will
be parsed as the header.
ALl the
#
sign#geneID feat1 feat2 feat3
G1 ann ann ann
G2 ann ann ann
#
signgeneID feat1 feat2 feat3
G1 ann ann ann
G2 ann ann ann
#
signs are required# My comment can span multiple lines
# The last one of these is the header!
#geneID feat1 feat2 feat3
G1 ann ann ann
G2 ann ann ann
Note: format #3 is only supported in NGLess version 1.1 and above
Values can be
,
or |
characters.#geneID feat1 feat2 feat3
G1 a1,a2 b c
G2 a1|a3 c
In this case, the mappings are:
G1
has the properties a1
and a2
in the feat1
feature; b
in the
feat2
feature; and c
in the col3
G2
has the properties a1
and a3
in the feat1
feature; and c
in the
col3
. There is no feat2
associated with G2
and, from feat2
’s
point-of-view, inserts mapped to G2
are considered unmappedAs of NGLess 1.1, spaces are not allowed: i.e., a, b
is the feature a
and
the feature b
(space followed by b). This is arguably sub-optimal.
Privacy: Usage of this site follows EMBL’s Privacy Policy. In accordance with that policy, we use Matomo to collect anonymised data on visits to, downloads from, and searches of this site. Contact: bork@embl.de.