Preprocessing FastQ Data

Preprocessing FastQ files consists of quality trimming and filtering of reads as well as (possible) elimination of reads which match some reference which is not of interest.

Quality-based filtering

Filtering reads based on quality is performed with the preprocess function, which takes a block of code. This block of code will be executed for each read. For example:

ngless "1.4"

input = fastq('input.fq.gz')

input = preprocess(input) using |r|:
    r = substrim(r, min_quality=20)
    if len(r) < 45:
        discard

If it helps you, you can think of the preprocess block as a foreach loop, with the special keyword discard that removes the read from the collection. Note that the name r is just a variable name, which you choose using the |r| syntax.

Within the preprocess block, you can modify the read in several ways:

  • you can trim it with the indexing operator: r[trim5:] or r[:-trim3]
  • you can call substrim, endstrim or smoothtrim to trim the read based on quality scores. substrim finds the longest substring such that all bases are above a minimum quality (hence the name, which phonetically combines substring and trim). endstrim chops bases off the ends and smoothtrim averages quality scores using a sliding window before applying substrim.
  • you can test for the length of the sequence (before or after trimming). For this, you use the len function (see example above).
  • you can test for the average quality score (using the avg_quality() method).

You can combine these in different ways. For example, the behaviour of the fastx quality trimmer can be recreated as:

preprocess(input) using |r|:
    r = endstrim(r, min_quality=20)
    if r.fraction_at_least(20) < 0.5:
        discard
    if len(r) < 45:
        discard

Handling paired end reads

When your input is paired-end, the preprocess call above will handle each mate independently. Three things can happen:

  1. both mates are discarded,
  2. both mates are kept (i.e., not discarded),
  3. one mate is kept, the other discarded.

The only question is what to do in the third case. By default, the preprocess call keep the mate turning the read into an unpaired read (a single), but you can change that behaviour by setting the keep_singles argument to False:

preprocess(input, keep_singles=False) using |r|:
    r = substrim(r, min_quality=20)
    if len(r) < 45:
        discard

Now, the output will consist of only paired-end reads.

Filtering reads matching a reference

It is often also a good idea to match reads against some possible contaminant database. For example, when studying the host associated microbiome, you will often want to remove reads matching the host. It is almost always a good to at least check for human contamination (during lab handling).

For this, you map the reads against the human genome:

mapped_hg19 = map(input, reference='hg19')

Now, mapped_hg19 is a set of mapped reads. Mapped reads are reads, their qualities, plus additional information of how they matched. Mapped read sets are the internal ngless representation of SAM files.

To filter the set, we will select. Like preprocess, select also uses a block for the user to specify the logic:

mapped_hg19 = select(mapped_hg19) using |mr|:
    mr = mr.filter(min_match_size=45, min_identity_pc=90, action={unmatch})
    if mr.flag({mapped}):
        discard

We first set a minimum match size and identity percentage to avoid spurious hits. We keep the reads but unmatch them (i.e., we clear any information related to a match). Then, we discard any reads that match by checking the flag {mapped}.

Finally, we convert the mapped reads back to simple reads using the as_reads function (this discards the matching information):

input = as_reads(mapped_hg19)

Now, input can be passed to the next step in the pipeline.

Privacy: Usage of this site follows EMBL’s Privacy Policy. In accordance with that policy, we use Matomo to collect anonymised data on visits to, downloads from, and searches of this site. Contact: bork@embl.de.