Révision 3961deb6 doc/sphinx_doc/tuto.rst

b/doc/sphinx_doc/tuto.rst
1 1
Tutorial
2 2
========
3 3

  
4
This tutorial describes steps allowing performing quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
4
This tutorial describes steps allowing to perform quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
5 5

  
6
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts information (nucleosome position and indicators) from the dataset.
6
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts nucleosome-level information (nucleosome position and indicators) from the dataset.
7 7

  
8 8

  
9 9

  
......
14 14
Working Directory Organisation
15 15
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16 16

  
17
After having install NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
17
After having installed NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
18 18

  
19 19
.. code:: bash
20 20

  
......
29 29
$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
30 30

  
31 31

  
32
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain we performed Mnase-Seq and ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark.
32
In this tutorial, we want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark. Illumina sequencing was done in single-read of 50 bp long.
33 33

  
34 34
The dataset is composed of 55 files organised as follows: 
35 35

  
......
62 62
Python and R Common Configuration File
63 63
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64 64

  
65
First of all we define in one place some configuration variables that will be launched by python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
65
First, we need to define useful configuration variables that will be passed to python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
66 66

  
67
To do this, go to the root directory of your project and run the following command:
67
The initialization of this variables is done in the configurator.py file. If you need to adapt variable values (path, default parameters...) you need to edit this file. Then, go to the root directory of your project and run the following command to dump the configuration file:
68 68

  
69 69
.. code:: bash
70 70

  
......
74 74

  
75 75

  
76 76

  
77

  
78

  
79 77
Preprocessing Illumina Fastq Reads for Each Sample
80 78
--------------------------------------------------
81 79

  
82
This preprocessing step consists of 4 main steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
80
Once variables and design have been specified, the script wf.py will automatically run all the analysis. You don't need to do anything. 
81
To run the full analysis, run the following command:
82

  
83
.. code:: bash
84

  
85
  python src/current/wf.py
86

  
87
The details of the steps performed by this script are explained below.
88
This preprocessing consists of 4 steps embedded in the `wf.py` script. They are described bellow.  As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
83 89

  
84 90

  
85 91
.. autodata:: wf.samples
......
95 101
Creating Bowtie Index from each Reference Genome
96 102
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97 103

  
98
For each strain, we need to create bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. This step is performed by the following part of the `wf.py` script:
104
For each strain, the script *wf.py* then creates bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. The part of the script performing this is the following:
99 105

  
100 106
.. literalinclude:: ../../../snep/src/current/wf.py
101 107
   :start-after: # _STARTOF_ step_1
102 108
   :end-before: # _ENDOF_ step_1
103 109
   :language: python
104 110

  
105
The following table summarizes the file sizes and process durations concerning this step.
111
As an indication, the following table summarizes the file sizes and process durations that we experienced when running this step on a Linux server***.
106 112

  
107 113
======  ======================  ======================  ================
108 114
strain  fasta genome file size  bowtie index file size  process duration
......
117 123
Aligning Reads to Reference Genome 
118 124
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119 125

  
120
Next, we launch bowtie to align reads to the reference genome. It produces a 
121
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the `wf.py` script:
126
Next, the *wf.py* script launches bowtie to align reads to the reference genome. It produces a `.sam` file that is converted into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the script:
122 127

  
123 128
.. literalinclude:: ../../../snep/src/current/wf.py
124 129
   :start-after: # _STARTOF_ step_2
......
128 133
Convert Aligned Reads into TemplateFilter Format
129 134
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
130 135

  
131
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads as follows: `chr`, `coord`, `strand` and `#read` where:
136
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads in the following format: `chr`, `coord`, `strand` and `#read` where:
132 137

  
133 138
- `chr` is the number of the chromosome;
134 139
- `coord` is the coordinate of the reads;
......
137 142

  
138 143
Each entry is *tab*-separated.
139 144

  
140
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This step takes this into account by adding the read length (in our case 50) to the reverse reads coordinates.
145
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This is taken into account in NucleoMiner2 by adding the read length (in our case 50) to the reverse reads coordinates.
141 146

  
142
This step is performed by the following part of the `wf.py` script:
147
This step is performed by the following part of the *wf.py* script:
143 148

  
144 149
.. literalinclude:: ../../../snep/src/current/wf.py
145 150
   :start-after: # _STARTOF_ step_3
146 151
   :end-before: # _ENDOF_ step_3
147 152
   :language: python
148 153

  
149
The following table summarises the number of reads, the involved file sizes and process durations concerning the two last steps. In our case, alignment process have been multithreaded over 3 cores.
154
The following table summarizes the number of reads, the involved file sizes and process durations that we experienced when running the two last steps. In our case, alignment process were multithreaded over 3 cores.
150 155

  
151 156
==  ==============  =========================  ======  ================  ==================  ================  
152 157
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
......
170 175
Finally, for each sample we perform TemplateFilter analysis. 
171 176

  
172 177
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
173
define by its center and its width. An odd width leads us to consider non-
178
defined by its center and its width. An odd width leads us to consider non-
174 179
integer lower and upper bound.
175 180

  
176
**WARNING** TemplateFilter is not designed to deal with replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
181
**WARNING** TemplateFilter was not designed to handle replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
177 182

  
178 183
This step is performed by the following part of the `wf.py` script:
179 184

  
......
367 372

  
368 373

  
369 374

  
370
The second part of the tutorial uses R (http://http://www.r-project.org). It consists of a set of R scripts that will be sourced in an R from a console launched at the root of your project. These scripts are:
375
The second part of the tutorial uses R (http://http://www.r-project.org). NucleoMiner2 contains a set of R scripts that will be sourced in R from a console launched at the root of your project. These scripts are:
371 376

  
372 377
  - headers.R
373 378
  - extract_maps.R
......
380 385
The Script headers.R
381 386
^^^^^^^^^^^^^^^^^^^^
382 387

  
383
The script headers.R is included in each other scripts. It is in charge of: 
388
The script headers.R is included in all other R scripts. It is in charge of: 
384 389

  
385 390
  - launching libraries used in the scripts
386 391
  - launching configuration (design, strain, marker...)
387
  - computing and caching CURs (caching means storing the information in the computer's memory)
388

  
389
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. You may be using NucleoMiner2 to analyse data of a single strain, or of several strains. 
390

  
391
  - All the data corresponds to the same strain (e.g. treatment/control, or only few mutations): Then in step 1), the  regions to use are entire chromosomes. Instep 2) simply use the default translate function which is neutral.
392
  - computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory.
392 393

  
393
  - The data come from two or more strains: In this case, edit a list of regions and customize the translate function which performs the correspondence between the different genomes. How we did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to the Appendice "Generate .c2c Files"), then use it to produce the list of regions and customise “translate”.
394
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. 
394 395

  
396
  -  You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral.
395 397

  
398
  - If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to produce the list of regions and customise “translate”.
396 399

  
397

  
398
In your R console, run the following command line:
400
In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console:
399 401

  
400 402
.. code:: bash
401 403

  
......
404 406

  
405 407
The Script extract_maps.R
406 408
^^^^^^^^^^^^^^^^^^^^^^^^^
407
This script is in charge of extracting Maps for well-positioned and fuzzy nucleosomes. First of all, this script computes intra and inter-strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well-positioned, fuzzy and UNR maps.
409
This script is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of  well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions .
408 410

  
409
The well-positioned map for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
411
The map of well-positioned nucleosomes for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
410 412

  
411 413
 - chr, the number of the chromosome 
412 414
 - lower_bound, the lower bound of the nucleosome
......
419 421
 - llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
420 422
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
421 423
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
422
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (`1-pchisq(2.LLR2, df=4)`)
424
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (`1-pchisq(2.LLR2, df=4)`)
423 425
 - dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions. 
424 426

  
425
The fuzzy map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
427
The sensitive map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
426 428

  
427 429
 - chr, the number of the chromosome 
428 430
 - lower_bound, the lower bound of the nucleosome
......
436 438
 - index_nuc_RM, the index of the RM nucleosome in the CUR
437 439
 - llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM 
438 440
 - common_wp_pval,  the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
439
 - diff, the dyads shift between the positions in the two strains
441
 - diff, the dyads shift between the positions in the two strains (in bp)
440 442

  
441 443
The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
442 444

  
......
454 456
The Script translate_common_wp.R
455 457
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
456 458

  
457
This script is used to translate common well-positioned nucleosome maps from a strain to another strain and stores it into a table. 
459
This script is used to translate common well-positioned nucleosome positions from a strain to another strain and stores it into a table. 
458 460

  
459
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosome translated into the BY genome coordinates. It is composed of following columns:
461
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosomes translated into the BY genome coordinates. It is composed of following columns:
460 462

  
461 463
 - strain_ref, the reference genome (in which positioned are defined)
462 464
 - begin, the translated lower bound of the nucleosome
......
476 478
The Script split_samples.R
477 479
^^^^^^^^^^^^^^^^^^^^^^^^^^
478 480

  
479
For memory space usage reasons, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
481
To optimize memory space usage, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
480 482

  
481 483
  - sample_1_chr_1_splited_sample.tab.gz
482 484
  - sample_1_chr_2_splited_sample.tab.gz
......
547 549

  
548 550
sample_id are given in file samples.csv
549 551

  
550
If you don't know which column to use, we recommend using wpunr.
552
If you don't know which column to use for normalization, we recommend using wpunr.
551 553

  
552
If you want the very detailed factors produced by DESeq, here are the information:
554
Here are the details of the factors produced:
553 555

  
554 556
  - unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
555 557
  - wp: same, but for well-positioned nucleosomes.
......
604 606
The 5 last columns concern DESeq analysis:
605 607

  
606 608
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
607
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or not the interaction term marker:strain. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
609
  - pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term *marker:strain* to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
608 610
  - snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
609
To execute this script, run the following command 
610 611

  
611 612
To execute this script, run the following command in your R console:
612 613

  
......
636 637
APPENDICE: Generate .c2c Files
637 638
------------------------------
638 639

  
640
The `.c2c` files is a simple table that describes how two genome
641
sequences are aligned. This file can be generated by using scripts that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics 2010) and which we provide in this release of NucleoMiner2.
639 642

  
640
The `.c2c` files is a simple table that describes how the genome sequence can be aligned. We generate it using some NucleoMiner 1.0 scripts.
641 643

  
642
To use NucleoMiner 1.0 scripts on your UNIX/LINUX computer you need first to install MUMmer which is a system for rapidly aligning entire genomes, whether in complete or draft form.
644
To use these scripts on your UNIX/LINUX computer you need first to install MUMmer which is designed to rapidly align entire genomes, whether in complete or draft form.
643 645

  
644
Installing the MUMmer library
645
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
646
Installing MUMmer
647
^^^^^^^^^^^^^^^^^
646 648

  
647
Get the last version of MUMmer archive on your computer (MUMmer3.23.tar.gz distributed in the directory deps of your working directory). Copy it in a dedicated directory. Install it locally into the src folder of you working directory by typing (working directory):
649
Get the last version of MUMmer archive on your computer (MUMmer3.23.tar.gz is provided in the directory deps of your working directory). Copy it in a dedicated directory. Install it locally into the src folder of you working directory by typing (working directory):
648 650

  
649
tar -xvzf gdl-1.0.tar.gz
651
tar -xvzf MUMmer3.23.tar.gz
650 652

  
651
This creates a directory called gdl-1.0. You now need to go into this directory and compile the library, by typing:
652 653

  
653 654
.. code:: bash
654 655

  
......
661 662
Installing NucleoMiner 1.0 scripts
662 663
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
663 664

  
664
Get the nucleominer-1.0.tar.gz archive on your computer (distributed in the directory deps of your working directory). Install it locally into the src folder of you working directory by typing (working directory):
665
Get the nucleominer-1.0.tar.gz archive on your computer (this archive is provided in the directory deps of your working directory). Install it locally into the src folder of you working directory by typing (working directory):
665 666

  
666 667

  
667 668
.. code:: bash
......
670 671
  tar xfvz ../deps/nucleominer-1.0.tar.gz 
671 672
  cd ..
672 673

  
673
This creates a directory called that contains  NucleoMiner 1.0 scripts (src/nucleominer-1.0/scripts). 
674
This creates a directory that contains  NucleoMiner 1.0 scripts (src/nucleominer-1.0/scripts). 
674 675

  
675 676

  
676 677
Generate .c2c Files

Formats disponibles : Unified diff