======================================= Ocean Metagenomics Functional Profiling ======================================= .. note:: If you are starting out with NGLess for metagenomics profiling, consider using the predefined pipeline collection, `NG-meta-profiler `__. This tutorial is based on deconstructing a pipeline very similar to those. In this tutorial, we will analyse a small dataset of oceanic microbial metagenomes. .. note:: This tutorial uses the full Ocean Microbial Reference Gene Catalog presented in `Structure and function of the global ocean microbiome `__ Sunagawa, Coelho, Chaffron, et al., Science, 2015 This catalog contains ca. 40 million genes and requires **56GiB** of RAM 1. Download the toy dataset First download all the tutorial data:: ngless --download-demo ocean-short This will `download `__ and expand the data to a directory called ``ocean-short``. This is a toy dataset. It is based on real data, but the samples were trimmed so that they contains only 250k paired-end reads. The dataset is organized in so that each directory contains a sample with multiple fastq files). NGLess does not require this structure, but it is convenient:: $ find ./SAMEA2621229.sampled ./SAMEA2621229.sampled/ERR594355_2.short.fq.gz ./SAMEA2621229.sampled/ERR594355_1.short.fq.gz ./SAMEA2621155.sampled ./SAMEA2621155.sampled/ERR599133_1.short.fq.gz ./SAMEA2621155.sampled/ERR599133_2.short.fq.gz ./SAMEA2621033.sampled ./SAMEA2621033.sampled/ERR594391_2.short.fq.gz ./SAMEA2621033.sampled/ERR594391_1.short.fq.gz ./tara.demo.short ./process.ngl The whole script we will be using is there as well (``process.ngl``), so you can immediately run it with:: ngless process.ngl The rest of this tutorial is an explanation of the steps in this script. 2. Preliminary imports To run ngless, we need write a script. We start with a few imports:: ngless "1.4" import "parallel" version "1.0" import "omrgc" version "1.0" These will all be used in the tutorial. 3. Parallelization We are going to process each sample separately. For this, we use the ``lock1`` function from the `parallel `__ module (which we imported before):: samples = readlines('tara.demo.short') sample = lock1(samples) The ``readlines`` function reads a file and returns all lines. In this case, we are reading the ``tara.demo.short`` file, which contains the three samples (``SAMEA2621229.sampled``, ``SAMEA2621155.sampled``, and ``SAMEA2621033.sampled``). ``lock1()`` is a slightly more complex function. It takes a list and *locks one of the elements* and returns it. It always chooses an element which has not been locked before, so you each time you run _ngless_, you will get a different sample. 4. Preprocessing First, we load the data (the FastQ files):: input = load_fastq_directory(sample) And, now, we preprocess the data:: input = preprocess(input, keep_singles=False) using |read|: read = substrim(read, min_quality=25) if len(read) < 45: discard 5. Profiling using the OM-RGC After preprocessing, we map the reads to the ocean microbial reference gene catalog:: mapped = map(input, reference='omrgc', mode_all=True) The line above is the reason we needed to import the ``omrgc`` module: it made the ``omrgc`` reference available. :: mapped = select(mapped, keep_if=[{mapped}, {unique}]) Now, we need to ``count`` the results. This function takes the result of the above and aggregates it different ways. In this case, we want to aggregate by KEGG KOs, and eggNOG OGs:: counts = count(mapped, features=['KEGG_ko', 'eggNOG_OG'], normalization={scaled}) 7. Aggregate the results We have done all this computation, now we need to save it somewhere. We will use the ``collect()`` function to aggregate across all the samples processed:: collect(counts current=sample, allneeded=samples, ofile='omgc.profiles.txt') 8. Run it! This is our script. We save it to a file (``process.ngl`` in this example) and run it from the command line:: $ ngless process.ngl Note that we need a large amount (ca. 64GB) of RAM memory to be able to use the OM-RGC. **You also need to run it once for each sample.** However, this can be done in parallel, taking advantage of high performance computing clusters. Full script ----------- Here is the full script:: ngless "1.4" import "parallel" version "0.0" import "omrgc" version "1.0" samples = readlines('tara.demo.short') sample = lock1(samples) input = load_fastq_directory(sample) input = preprocess(input, keep_singles=False) using |read|: read = substrim(read, min_quality=25) if len(read) < 45: discard mapped = map(input, reference='omrgc', mode_all=True) mapped = select(mapped, keep_if=[{mapped}, {unique}]) collect( count(mapped, features=['KEGG_ko', 'eggNOG_OG'], normalization={scaled}), current=sample, allneeded=samples, ofile='omgc.profile.txt')