Preprocessing FastQ files consists of quality trimming and filtering of reads as well as (possible) elimination of reads which match some reference which is not of interest.
Filtering reads based on quality is performed with the preprocess
function,
which takes a block of code. This block of code will be executed for each read.
For example:
ngless "1.4"
input = fastq('input.fq.gz')
input = preprocess(input) using |r|:
r = substrim(r, min_quality=20)
if len(r) < 45:
discard
If it helps you, you can think of the preprocess
block as a foreach
loop, with the special keyword discard
that removes the read from the
collection. Note that the name r
is just a variable name, which you choose
using the |r|
syntax.
Within the preprocess block, you can modify the read in several ways:
r[trim5:]
or r[:-trim3]
substrim
, endstrim
or smoothtrim
to trim the read
based on quality scores. substrim
finds the longest substring such that
all bases are above a minimum quality (hence the name, which phonetically
combines substring and trim). endstrim
chops bases off the ends and
smoothtrim
averages quality scores using a sliding window before applying
substrim
.len
function (see example above).avg_quality()
method).You can combine these in different ways. For example, the behaviour of the fastx quality trimmer can be recreated as:
preprocess(input) using |r|:
r = endstrim(r, min_quality=20)
if r.fraction_at_least(20) < 0.5:
discard
if len(r) < 45:
discard
When your input is paired-end, the preprocess call above will handle each mate independently. Three things can happen:
The only question is what to do in the third case. By default, the
preprocess
call keep the mate turning the read into an unpaired read (a
single), but you can change that behaviour by setting the keep_singles
argument to False
:
preprocess(input, keep_singles=False) using |r|:
r = substrim(r, min_quality=20)
if len(r) < 45:
discard
Now, the output will consist of only paired-end reads.
It is often also a good idea to match reads against some possible contaminant database. For example, when studying the host associated microbiome, you will often want to remove reads matching the host. It is almost always a good to at least check for human contamination (during lab handling).
For this, you map the reads against the human genome:
mapped_hg19 = map(input, reference='hg19')
Now, mapped_hg19
is a set of mapped reads. Mapped reads are reads, their
qualities, plus additional information of how they matched. Mapped read sets
are the internal ngless representation of SAM files.
To filter the set, we will select
. Like preprocess
, select
also
uses a block for the user to specify the logic:
mapped_hg19 = select(mapped_hg19) using |mr|:
mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
if mr.flag({mapped}):
discard
We first set a minimum match size and identity percentage to avoid spurious
hits. We keep the reads but unmatch them (i.e., we clear any
information related to a match). Then, we discard any reads that match by
checking the flag {mapped}
.
Finally, we convert the mapped reads back to simple reads using the
as_reads
function (this discards the matching information):
input = as_reads(mapped_hg19)
Now, input
can be passed to the next step in the pipeline.
Privacy: Usage of this site follows EMBL’s Privacy Policy. In accordance with that policy, we use Matomo to collect anonymised data on visits to, downloads from, and searches of this site. Contact: bork@embl.de.