Révision e5603c3f doc/sphinx_doc/tuto.rst

b/doc/sphinx_doc/tuto.rst
1 1
Tutorial
2 2
========
3 3

  
4
This tutorial describes steps allowing to perform quantitave analysis of 
5
nucleosomal epigenome. We assume that files are organised around a given 
6
hierarchie and that all command lines are launched from project's root.
4
This tutorial describes steps allowing performing quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
7 5

  
8
This tutorial is divided into t=wo main parts. First one consists in the python 
9
script `wf.py` that aligns and convert Illumina reads. Second one is the R 
10
script `main.r` that extracts information (nucleosome position and indicators) 
11
from the dataset.
6
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts information (nucleosome position and indicators) from the dataset.
12 7

  
13 8

  
14
Python and R Common Configuration File
15
--------------------------------------
16

  
17
First of all we define in one place some configuration variables that will be launch by python and R scripts. This file is **configurator.py**. The execution of this python script dumps variables into the **nucleo_miner_config.json** that will be launch by both kind of scriopts (R and puython).
18

  
19
To do this launch at the root of your project the following command line:
20 9

  
21
.. code:: bash
22 10

  
23
  python src/current/configurator.py
24
  
11
Experimental Dataset, Working Directory and Configuration File 
12
--------------------------------------------------------------
25 13

  
26
$$$ other python script to describe:
27
- libcoverage.py
28
- wf.py
14
Working Directory Organisation
15
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
29 16

  
17
The working directory... 
30 18

  
19
$$$ TODO: Explain how and where retrieve the workdir
31 20

  
32 21

  
33 22

  
34
Dataset and Configuration Variables
35
-----------------------------------
23
Retrieving Experimental Dataset
24
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
36 25

  
37
We want to compare nucleosomes of 3 yeast strains: 
26
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
38 27

  
39
- BY
40
- RM
41
- YJM
28
$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
42 29

  
43
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 following 
44
markers: 
45 30

  
46
- H3K4me1
47
- H3K4me3
48
- H3K9ac
49
- H3K14ac
50
- H4K12ac
31
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain we performed Mnase-Seq and ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark.
51 32

  
52
In order to simplify the design of experiment, we considere Mnase as a marker. 
53
For each couple `(strain, marker)` we perform 3 replicates. So, theoritically 
54
we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 
55
replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a 
56
uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
33
The dataset is composed of 55 files organised as follows: 
57 34

  
58
.. autodata:: configurator.CSV_SAMPLE_FILE
59
    :noindex: 
60
		
61
We use a convention to link sample and Illumina fastq outputs. Illumina output 
62
files of the sample `ID` will be stored in the directory 
63
`ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be 
64
stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
65

  
66
.. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
67
    :noindex: 
68

  
69
For BY (resp. RM and YJM) we use following reference genome 
70
`saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
71
(resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and 
72
`saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
73
The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
35
  - 3 replicates for BY MNase Seq
36
  
37
    - sample 1 (5 fastq.gz files)
38
    - sample 2 (5 fastq.gz files)
39
    - sample 3 (4 fastq.gz files)
40
    
41
  - 3 replicates for RM MNase Seq
42
  
43
    - sample 4 (4 fastq.gz files)
44
    - sample 5 (4 fastq.gz files)
45
    - sample 6 (5 fastq.gz files)
46
    
47
  - 3 replicates for BY ChIP Seq H3K14ac
48
  
49
    - sample 36 (5 fastq.gz files)
50
    - sample 37 (5 fastq.gz files)
51
    - sample 53 (9 fastq.gz files)
52
    
53
  - 2 replicates for RM ChIP Seq H3K14ac
54
  
55
    - sample 38 (5 fastq.gz files)
56
    - sample 39 (4 fastq.gz files)
57
    
74 58

  
75
.. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
76
    :noindex: 
77 59

  
78
Each chromosome/contig is identify in the fasta file by an obscure identifier. 
79
For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when 
80
TemplateFilter is waiting for an integer. So, we translate it. The index 
81
`FASTA_INDEXES` stores this translation.
82 60

  
83
.. autodata:: configurator.FASTA_INDEXES
84
    :noindex: 
61
Python and R Common Configuration File
62
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
85 63

  
86
From a pragamatical point of view we discard some part of the genome (repeated 
87
sequence etc...). The list of the black listed area is explicitely detailled in 
88
`AREA_BLACK_LIST`.
64
First of all we define in one place some configuration variables that will be launched by python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
89 65

  
90
.. autodata:: configurator.AREA_BLACK_LIST
91
    :noindex: 
66
To do this, go to the root directory of your project and run the following command:
92 67

  
93
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously 
94
compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. 
95
`BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about 
96
.c2c files, please read section 5 of the manual of `NucleoMiner`, the old 
97
version of `NucleoMiner2` 
98
(http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
68
.. code:: bash
99 69

  
100
.. autodata:: configurator.C2C_FILES
101
    :noindex: 
70
  python src/current/configurator.py
71
  
102 72

  
103
`nucleominer` uses specific directory to work in, these are described in 
104
`INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
105 73

  
106
Finally, `nucleominer` use external ressources, the path to these resspources 
107
are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, 
108
`BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
109 74

  
110
All paths, prefixes and indexes could be change in the 
111
`src/current/nucleominer_config.json` file.
112 75

  
113
.. autodata:: wf.json_conf_file
114
    :noindex: 
115 76

  
116 77

  
117 78
Preprocessing Illumina Fastq Reads for Each Sample
118 79
--------------------------------------------------
119 80

  
120
This preprocessing step consists in the 4 main steps embed in the `wf.py` and 
121
described bellow. As a preamble, this script computes `samples` `samples_mnase` 
122
and `strains` that will be used along the 4 steps.
81
This preprocessing step consists of 4 main steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
82

  
123 83

  
124 84
.. autodata:: wf.samples
125 85
    :noindex: 
......
134 94
Creating Bowtie Index from each Reference Genome
135 95
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
136 96

  
137
For each strain, we need to create bowtie index. Bowtie index of a strain is a 
138
tree view of the genemoe reference for this strain. It will be used by 
139
bowtie to align reads. This step is performed by the following part of the 
140
`wf.py` script:
97
For each strain, we need to create bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. This step is performed by the following part of the `wf.py` script:
141 98

  
142 99
.. literalinclude:: ../../../snep/src/current/wf.py
143 100
   :start-after: # _STARTOF_ step_1
144 101
   :end-before: # _ENDOF_ step_1
145 102
   :language: python
146 103

  
147
The following table sum up involved file sizes and process durations concerning 
148
this step.
104
The following table summarizes the file sizes and process durations concerning this step.
149 105

  
150 106
======  ======================  ======================  ================
151 107
strain  fasta genome file size  bowtie index file size  process duration
152 108
======  ======================  ======================  ================
153 109
BY      12 Mo                          25 Mo                    11 s.
154 110
RM      12 Mo                          24 Mo                    9 s.
155
YJM     12 Mo                          25 Mo                    11 s.
156 111
======  ======================  ======================  ================
157 112

  
158 113

  
......
162 117
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163 118

  
164 119
Next, we launch bowtie to align reads to the reference genome. It produces a 
165
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` 
166
and `bedtools` are wrapped using python `subprocess` class. This step is 
167
performed by the followinw part of the `wf.py` script:
120
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the `wf.py` script:
168 121

  
169 122
.. literalinclude:: ../../../snep/src/current/wf.py
170 123
   :start-after: # _STARTOF_ step_2
171 124
   :end-before: # _ENDOF_ step_2
172 125
   :language: python
173 126

  
174
Convert Aligned Reads for TemplateFilter
175
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
176
TemplateFilter use particular input format for reads, so we convert `.bed` file. 
177
TemplateFilter expect reads as following: `chr coord strand #read` where:
127
Convert Aligned Reads into TemplateFilter Format
128
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
129

  
130
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads as follows: `chr`, `coord`, `strand` and `#read` where:
178 131

  
179
- chr is the number of the chromosome;
180
- coord is the coordinate of the reads;
181
- strand is `F` for forward and `R` for reverse;
182
- #reads the number of reads for this position.
132
- `chr` is the number of the chromosome;
133
- `coord` is the coordinate of the reads;
134
- `strand` is `F` for forward and `R` for reverse;
135
- `#reads` the number of reads covering this position.
183 136

  
184 137
Each entry is *tab*-separated.
185 138

  
186
**WARNING** for reverse strand bowtie returns the position of left first 
187
nucleotid when TemplateFilter is waiting for right one. So this step takes it 
188
into account and add lenght of reads (in our case 50) to reverse reads 
189
coordinate.
139
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This step takes this into account by adding the read length (in our case 50) to the reverse reads coordinates.
190 140

  
191
This step is performed by the followinw part of the `wf.py` script:
141
This step is performed by the following part of the `wf.py` script:
192 142

  
193 143
.. literalinclude:: ../../../snep/src/current/wf.py
194 144
   :start-after: # _STARTOF_ step_3
195 145
   :end-before: # _ENDOF_ step_3
196 146
   :language: python
197 147

  
198
The following table sum up number of reads, involved file sizes and process durations concerning 
199
the two last steps. In our case, aligment process have been multuthreaded over over 3 cores.
148
The following table summarises the number of reads, the involved file sizes and process durations concerning the two last steps. In our case, alignment process have been multithreaded over 3 cores.
200 149

  
201 150
==  ==============  =========================  ======  ================  ==================  ================  
202 151
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
......
207 156
4   13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
208 157
5   15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
209 158
6   18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
210
7   15591124        12126623                   77,78%  1163 Mo           72  Mo              405   s.
211
8   15659905        12475664                   79,67%  1194 Mo           71  Mo              416   s.
212
9   14668641        10960565                   74,72%  1052 Mo           70  Mo              375   s.
213
10  14339179        10454451                   72,91%  1049 Mo           51  Mo              363   s.
214
11  18019895        13688774                   75,96%  1378 Mo           59  Mo              474   s.
215
12  13746796        10810022                   78,64%  1084 Mo           54  Mo              360   s.
216
13  15205065        11766016                   77,38%  990  Mo           54  Mo              381   s.
217
14  17803097        13838883                   77,73%  1154 Mo           60  Mo              452   s.
218
15  15434564        12307878                   79,74%  1032 Mo           57  Mo              394   s.
219
16  16802587        12725665                   75,74%  1221 Mo           48  Mo              438   s.
220
17  16058417        12513734                   77,93%  1192 Mo           63  Mo              422   s.
221
18  16154482        13204331                   81,74%  1277 Mo           52  Mo              430   s.
222
19  21013924        17102120                   81,38%  1646 Mo           59  Mo              555   s.
223
20  17213114        14433357                   83,85%  1389 Mo           53  Mo              459   s.
224
21  17360907        14733001                   84,86%  1203 Mo           55  Mo              450   s.
225
22  18136816        15389581                   84,85%  1257 Mo           53  Mo              469   s.
226
23  14763678        12173025                   82,45%  1140 Mo           56  Mo              393   s.
227
24  15541709        12890345                   82,94%  1057 Mo           48  Mo              398   s.
228
25  16433215        13094314                   79,68%  1241 Mo           57  Mo              433   s.
229
26  17370850        14264136                   82,12%  1347 Mo           51  Mo              466   s.
230
27  14613512        8654495                    59,22%  887  Mo           56  Mo              339   s.
231
28  15248545        11367589                   74,55%  1166 Mo           67  Mo              405   s.
232
29  14316809        10767926                   75,21%  1103 Mo           63  Mo              379   s.
233
30  15178058        12265794                   80,81%  1030 Mo           66  Mo              390   s.
234
31  14968579        11876186                   79,34%  1009 Mo           63  Mo              387   s.
235
32  16912705        13550508                   80,12%  1143 Mo           70  Mo              442   s.
236
33  16782154        12755111                   76,00%  1227 Mo           65  Mo              438   s.
237
34  16741443        13168071                   78,66%  1260 Mo           71  Mo              442   s.
238
35  13096171        10367041                   79,16%  992  Mo           62  Mo              350   s.
239 159
36  17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
240 160
37  17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
241 161
38  16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
242 162
39  14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
243
40  13784738        10598464                   76,89%  1005 Mo           64  Mo              358   s.
244
41  12438007        9620975                    77,35%  911  Mo           60  Mo              326   s.
245
42  13853959        11031238                   79,63%  1045 Mo           64  Mo              365   s.
246
43  12036162        6654780                    55,29%  684  Mo           46  Mo              268   s.
247
44  13873129        10251074                   73,89%  1048 Mo           61  Mo              365   s.
248
45  19817751        14904502                   75,21%  1520 Mo           72  Mo              528   s.
249
46  13368959        10818619                   80,92%  912  Mo           63  Mo              350   s.
250
47  7566467         6139001                    81,13%  520  Mo           44  Mo              201   s.
251
48  32586928        21191363                   65,03%  1816 Mo           82  Mo              766   s.
252
49  30733184        18791373                   61,14%  1801 Mo           89  Mo              721   s.
253
50  41287616        30383875                   73,59%  2911 Mo           112 Mo              1065  s.
254
51  40439965        31177914                   77,10%  2981 Mo           117 Mo              1070  s.
255 163
53  40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
256
55  52424414        47117107                   89,88%  3811 Mo           119 Mo              1477  s.
257 164
==  ==============  =========================  ======  ================  ==================  ================  
258 165

  
259
For some reasons (manipulation efficiency, e.g. PCR...), we remove samples 33, 45, 48 and 55.
260

  
261

  
262 166
Run TemplateFilter on Mnase Samples
263 167
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
264 168

  
265
Finally, for each sample we perfome TemplateFilter analysis. 
169
Finally, for each sample we perform TemplateFilter analysis. 
266 170

  
267 171
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
268
define by its center and its width. An odd width leads us to considere non 
269
interger lower and upper bound.
172
define by its center and its width. An odd width leads us to consider non-
173
integer lower and upper bound.
270 174

  
271
**WARNING** TemplateFilter is not design to deal with replicate. So we choose to 
272
keep a maximum of nucleosome and filter in a second time using the benefit of 
273
replicate. To do that we set a low correlation threshold parameter (`0.5`) and a 
274
particularly high value of overlaping (`300%`).
175
**WARNING** TemplateFilter is not designed to deal with replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
275 176

  
276
This step is performed by the followinw part of the `wf.py` script:
177
This step is performed by the following part of the `wf.py` script:
277 178

  
278 179
.. literalinclude:: ../../../snep/src/current/wf.py
279 180
   :start-after: # _STARTOF_ step_4
......
289 190
4    RM     88076       62 Mo          984 s.                      
290 191
5    RM     90141       64 Mo          967 s.                      
291 192
6    RM     87517       62 Mo          980 s.                      
292
7    YJM    88945       64 Mo          566 s.                      
293
8    YJM    88689       64 Mo          570 s.                      
294
9    YJM    88128       63 Mo          565 s.                    
295 193
==  ======  ==========  =============  ================
296 194

  
297 195

  
......
306 204

  
307 205

  
308 206

  
207

  
208

  
209

  
210

  
211

  
212

  
213

  
214

  
215

  
216

  
217

  
218

  
219

  
220

  
221

  
222

  
223

  
224

  
225

  
226

  
227

  
228

  
229

  
230

  
231

  
232

  
233

  
234

  
235

  
236

  
237

  
238

  
239

  
240

  
241

  
242

  
243

  
244

  
245

  
246

  
247

  
248
..
249
..
250
.. $$$ other python script to describe:
251
.. - libcoverage.py
252
.. - wf.py
253
..
254
..
255
..
256
..
257
..
258
..
259
.. In order to simplify the design of experiment, we consider Mnase as a marker.
260
.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
261
.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
262
.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
263
.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
264
..
265
.. .. autodata:: configurator.CSV_SAMPLE_FILE
266
..     :noindex:
267
..
268
.. We use a convention to link sample and Illumina fastq outputs. Illumina output
269
.. files of the sample `ID` will be stored in the directory
270
.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
271
.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
272
..
273
.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
274
..     :noindex:
275
..
276
.. For BY (resp. RM and YJM) we use following reference genome
277
.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
278
.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
279
.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
280
.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
281
..
282
.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
283
..     :noindex:
284
..
285
.. Each chromosome/contig is identify in the fasta file by an obscure identifier.
286
.. For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when
287
.. TemplateFilter is waiting for an integer. So, we translate it. The index
288
.. `FASTA_INDEXES` stores this translation.
289
..
290
.. .. autodata:: configurator.FASTA_INDEXES
291
..     :noindex:
292
..
293
.. From a pragamatical point of view we discard some part of the genome (repeated
294
.. sequence etc...). The list of the black listed area is explicitely detailled in
295
.. `AREA_BLACK_LIST`.
296
..
297
.. .. autodata:: configurator.AREA_BLACK_LIST
298
..     :noindex:
299
..
300
.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
301
.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
302
.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
303
.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
304
.. version of `NucleoMiner2`
305
.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
306
..
307
.. .. autodata:: configurator.C2C_FILES
308
..     :noindex:
309
..
310
.. `nucleominer` uses specific directory to work in, these are described in
311
.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
312
..
313
.. Finally, `nucleominer` use external ressources, the path to these resspources
314
.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
315
.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
316
..
317
.. All paths, prefixes and indexes could be change in the
318
.. `src/current/nucleominer_config.json` file.
319
..
320
.. .. autodata:: wf.json_conf_file
321
..     :noindex:
322
..
323

  
324

  
325

  
326

  
327

  
328

  
329

  
330

  
331

  
332

  
333

  
334

  
335

  
336

  
337

  
338

  
339

  
340

  
341

  
342

  
343

  
344

  
345

  
346

  
347

  
348

  
349

  
350

  
351

  
352

  
353

  
354

  
355

  
356

  
357

  
358

  
359

  
360

  
361

  
362

  
363

  
364

  
309 365
Inferring Nucleosome Position and Extracting Read Counts
310 366
--------------------------------------------------------
311 367

  
312 368

  
313 369

  
314
The second part of the tutorial uses `R` (http://http://www.r-project.org). It consists in a set of R scripts taht will be sourced in an R console launched at the root of your project. the R srcipts are:
370
The second part of the tutorial uses R (http://http://www.r-project.org). It consists of a set of R scripts that will be sourced in an R from a console launched at the root of your project. These scripts are:
315 371

  
316 372
  - headers.R
317 373
  - extract_maps.R
318
  - compare_common_wp.R
374
  - translate_common_wp.R
319 375
  - split_samples.R
320 376
  - count_reads.R
321 377
  - get_size_factors  
......
324 380
The Script headers.R
325 381
^^^^^^^^^^^^^^^^^^^^
326 382

  
327
The script header.R is included in each other scripts. It is in charge of: 
383
The script headers.R is included in each other scripts. It is in charge of: 
328 384

  
329
  - launching libraries used in thes scripts
385
  - launching libraries used in the scripts
330 386
  - launching configuration (design, strain, marker...)
331
  - computing and caching CURs
387
  - computing and caching CURs (caching means storing the information in the computer's memory)
388

  
389
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. You may be using NucleoMiner2 to analyse data of a single strain, or of several strains. 
390

  
391
  - All the data corresponds to the same strain (e.g. treatment/control, or only few mutations): Then in step 1), the  regions to use are entire chromosomes. Instep 2) simply use the default translate function which is neutral.
392

  
393
  - The data come from two or more strains: In this case, edit a list of regions and customize the translate function which performs the correspondence between the different genomes. How we did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to the Appendice "Generate .c2c Files"), then use it to produce the list of regions and customise “translate”.
394

  
395

  
396

  
332 397

  
333 398
In your R console, run the following command line:
334 399

  
335 400
.. code:: bash
336 401

  
337
  R CMD BATCH src/current/header.R
402
  source("src/current/headers.R")
338 403

  
339 404

  
340 405
The Script extract_maps.R
341 406
^^^^^^^^^^^^^^^^^^^^^^^^^
342
This script is in charge of extracting Maps for well positioned and fuzzy nucleosomes. First of all, this script computed intra and inter strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well positioned, fuzzy and UNR maps.
407
This script is in charge of extracting Maps for well-positioned and fuzzy nucleosomes. First of all, this script computes intra and inter-strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well-positioned, fuzzy and UNR maps.
343 408

  
344
The well-positioned map for BY is collected in the result directory and is called **BY_wp.tab**. It is composed of following columns:
409
The well-positioned map for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
345 410

  
346 411
 - chr, the number of the chromosome 
347 412
 - lower_bound, the lower bound of the nucleosome
348 413
 - upper_bound, the upper bound of the nucleosome 
349 414
 - cur_index, index of the CUR
350 415
 - index_nuc, the index of the nucleosome in the CUR
351
 - wp, 1 if it is a well positioned nucleosome, 0 else
352
 - nb_reads, the number of reads that supports this nucleosome
353
 - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates if it is a well-positioned nucleosome)
354
 - llr_1, for a well-positioned nucleosome, it is the LLR1 between the first and the second TemplateFilter nucleosome.
355
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the first TemplateFilter nucleosome.
356
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 overall TemplateFilter nucleosomes.
357
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (**1-pchisq(2.LLR2, df=4)**)
358
 - dyad_shift, for a well-positioned nucleosome, it is shift between the two extreme TemplateFilter nucleosome dyad positions. 
416
 - wp, 1 if it is a well positioned nucleosome, 0 otherwise
417
 - nb_reads, the number of reads that support this nucleosome
418
 - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
419
 - llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
420
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
421
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
422
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (`1-pchisq(2.LLR2, df=4)`)
423
 - dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions. 
359 424

  
360
The fuzzy map for BY is collected in the result directory and is called **BY_fuzzy.tab**. It is composed of following columns:
425
The fuzzy map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
361 426

  
362 427
 - chr, the number of the chromosome 
363 428
 - lower_bound, the lower bound of the nucleosome
364 429
 - upper_bound, the upper bound of the nucleosome 
365 430
 - cur_index, index of the CUR
366 431

  
367
The common well-position map for BY and RM strains is collected in the result directory and is called **BY_RM_common_wp.tab**. It is composed of following columns:
432
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
368 433

  
369 434
 - cur_index, the index of the CUR
370 435
 - index_nuc_BY, the index of the BY nucleosome in the CUR
371
 - index_nuc_RM,the index of the RM nucleosome in the CUR
372
 - llr_score, the LLR3 score between th eBy and RM nucleosomes
373
 - common_wp_pval,  the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**)
436
 - index_nuc_RM, the index of the RM nucleosome in the CUR
437
 - llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM 
438
 - common_wp_pval,  the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
439
 - diff, the dyads shift between the positions in the two strains
374 440

  
375
The common UNR map for BY and RM strains is collected in the result directory and is called **BY_RM_common_unr.tab**. It is composed of following columns:
441
The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
376 442

  
377 443
 - cur_index, the index of the CUR
378 444
 - index_nuc_BY, the index of the BY nucleosome in the CUR
379 445
 - index_nuc_RM,the index of the RM nucleosome in the CUR
380 446

  
381
To execute this script, run the following command line in your R console:
447
To execute this script, run the following command in your R console:
382 448

  
383 449
.. code:: bash
384 450

  
385 451
  source("src/current/extract_maps.R")
386 452

  
387 453

  
388
The Script compare_common_wp.R
389
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
390

  
391
This script is used to compare inter strain distances between common well-positioned nucleosomes. 
392

  
393
For example, it compute the file **BY_RM_common_wp_diff.tab** that contains dyad shifts between two well-positioned nucleosomes. It is composed of following columns:
394
 - cur_index, the index of the CUR
395
 - index_nuc_BY, the index of the BY nucleosome in the CUR
396
 - index_nuc_RM,the index of the RM nucleosome in the CUR
397
 - llr_score, the LLR3 score between th eBy and RM nucleosomes
398
 - common_wp_pval,  the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**)
399
 - diff, the dyad shifts between two well-positioned nucleosomes
454
The Script translate_common_wp.R
455
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
400 456

  
401
It also translates well-positioned nucleosome maps from a strain to an other strain and stores it into a table. 
457
This script is used to translate common well-positioned nucleosome maps from a strain to another strain and stores it into a table. 
402 458

  
403
For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains RM well-positioned nucleosome translated into the BY genome referential. It is composed of following columns:
459
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosome translated into the BY genome coordinates. It is composed of following columns:
404 460

  
405 461
 - strain_ref, the reference genome (in which positioned are defined)
406 462
 - begin, the translated lower bound of the nucleosome
407 463
 - end, the translated upper bound of the nucleosome
408
 - chr, the number of chromosome for the reference genome (in which positioned are defined)
464
 - chr, the number of chromosomes for the reference genome (in which positioned are defined)
409 465
 - length, the length of the nucleosome (could be negative)
410 466
 - cur_index, the index of the CUR
411 467
 - index_nuc, the index of the nucleosome in the CUR
412 468

  
469
To execute this script, run the following command in your R console:
413 470

  
471
.. code:: bash
414 472

  
415
To execute this script, run the following command line in your R console:
473
  source("src/current/translate_common_wp.R")
416 474

  
417
.. code:: bash
418 475

  
419
  R CMD BATCH src/current/compare_common_wp.R
476
The Script split_samples.R
477
^^^^^^^^^^^^^^^^^^^^^^^^^^
420 478

  
479
For memory space usage reasons, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
421 480

  
422
Split and Compress Samples According CURs
423
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
481
  - sample_1_chr_1_splited_sample.tab.gz
482
  - sample_1_chr_2_splited_sample.tab.gz
483
  - ...
484
  - sample_1_chr_17_splited_sample.tab.gz
485
  
486

  
487
To execute this script, run the following command in your R console:
424 488

  
425 489
.. code:: bash
426 490

  
427
  R CMD BATCH src/current/split_samples.R
491
  source("src/current/split_samples.R")
428 492

  
429 493

  
430
Count Reads for Each Nucleosome
431
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
494
The Script count_reads.R
495
^^^^^^^^^^^^^^^^^^^^^^^^
496

  
497
To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`  
498
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively. 
499

  
500
For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
501

  
502
  - chr_BY, the number of the chromosome for BY
503
  - lower_bound_BY, the lower bound of the nucleosome for BY
504
  - upper_bound_BY, the upper bound of the nucleosome  for BY
505
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
506
  - chr_RM, the number of the chromosome for RM
507
  - lower_bound_RM, the lower bound of the nucleosome for RM
508
  - upper_bound_RM, the upper bound of the nucleosome  for RM
509
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
510
  - cur_index, index of the CUR
511
  - BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
512
  - BY_H3K14ac_37, #reads for sample 37
513
  - BY_H3K14ac_53, #reads for sample 53
514
  - RM_H3K14ac_38, #reads for sample 38
515
  - RM_H3K14ac_39, #reads for sample 39
516

  
517
To execute this script, run the following command in your R console:
432 518

  
433 519
.. code:: bash
434 520

  
435
  R CMD BATCH src/current/count_reads.R
521
  source("src/current/count_reads.R")
522

  
523

  
524
The Script get_size_factors.R
525
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
526

  
527

  
528
This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
529
the normalised count is n/f where f is the factor contained in this file.
530
The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
531

  
532
========= ======= ======= =======
533
sample_id      wp     unr   wpunr
534
========= ======= ======= =======
535
        1 0.87396 0.88097 0.87584
536
        2 1.07890 1.07440 1.07760
537
        3 1.06400 1.05890 1.06250
538
        4 0.85782 0.87948 0.86305
539
        5 0.97577 0.96590 0.97307
540
        6 1.19630 1.18120 1.19190
541
       36 0.93318 0.92762 0.93166
542
       37 0.48315 0.48453 0.48350
543
       38 1.11240 1.11210 1.11230
544
       39 0.89897 0.89917 0.89903
545
       53 2.22650 2.22700 2.22660
546
========= ======= ======= =======
547

  
548
sample_id are given in file samples.csv
436 549

  
550
If you don't know which column to use, we recommend using wpunr.
437 551

  
438
Get Size Factors Using DESeq
439
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
552
If you want the very detailed factors produced by DESeq, here are the information:
553

  
554
  - unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
555
  - wp: same, but for well-positioned nucleosomes.
556
  - wpunr: both types of regions.
557

  
558
To execute this script, run the following command in your R console:
440 559

  
441 560
.. code:: bash
442 561

  
443
  R CMD BATCH src/current/get_size_factors.R
562
  source("src/current/get_size_factors.R")
444 563

  
445 564

  
446
Performing DESeq Analysis
565
The Script launch_deseq.R
447 566
^^^^^^^^^^^^^^^^^^^^^^^^^
448 567

  
568
Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
569
 
570
  - results/current/BY_RM_H3K14ac_wp_snep.tab
571
  - results/current/BY_RM_H3K14ac_unr_snep.tab
572
  - results/current/BY_RM_H3K14ac_wpunr_snep.tab
573
  - results/current/BY_RM_H3K14ac_wp_mnase.tab
574
  - results/current/BY_RM_H3K14ac_unr_mnase.tab
575
  - results/current/BY_RM_H3K14ac_wpunr_mnase.tab
576

  
577
These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
578

  
579
  - chr_BY, the number of the chromosome for BY
580
  - lower_bound_BY, the lower bound of the nucleosome for BY
581
  - upper_bound_BY, the upper bound of the nucleosome  for BY
582
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
583
  - chr_RM, the number of the chromosome for RM
584
  - lower_bound_RM, the lower bound of the nucleosome for RM
585
  - upper_bound_RM, the upper bound of the nucleosome  for RM
586
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
587
  - cur_index, index of the CUR
588
  - form 
589
  - BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1  
590

  
591
Next columns concern indicators for each sample:
592

  
593
  - BY_Mnase_Seq_2, #reads for sample 2
594
  - BY_Mnase_Seq_3, #reads for sample 3  
595
  - RM_Mnase_Seq_4, #reads for sample 4  
596
  - RM_Mnase_Seq_5, #reads for sample 5
597
  - RM_Mnase_Seq_6, #reads for sample 6
598
  - BY_H3K14ac_36, #reads for sample 36
599
  - BY_H3K14ac_37, #reads for sample 37
600
  - BY_H3K14ac_53, #reads for sample 53
601
  - RM_H3K14ac_38, #reads for sample 38
602
  - RM_H3K14ac_39, #reads for sample 39
603
  
604
The 5 last columns concern DESeq analysis:
605

  
606
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
607
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or not the interaction term marker:strain. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
608
  - snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
609
To execute this script, run the following command 
610

  
611
To execute this script, run the following command in your R console:
612

  
449 613
.. code:: bash
450 614

  
451
  R CMD BATCH src/current/launch_deseq.R
615
  source("src/current/launch_deseq.R")
452 616

  
453 617

  
454
Results
455
-------
618
Results: Number of SNEPs
619
------------------------
456 620

  
457
Output Files Organisation
458
^^^^^^^^^^^^^^^^^^^^^^^^^
459
Previous steps produce following 45 files. 
460
Each filename is under the form 
621
Here are the number of computed SNEPs for each forms.
461 622

  
462
.. code:: bash
623
===== ======= ===== =======
624
 form strains #nucs H3K14ac
625
===== ======= ===== =======
626
   wp   BY-RM 30464    3549
627
  unr   BY-RM  9497    1559
628
wpunr   BY-RM 39961    5240
629
===== ======= ===== =======
630
  
463 631

  
464
  results/current/[combi]_[marker]_[form]_snep.tab 
465 632

  
466
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, marker is 
467
in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each post translational 
468
histone modification and form is in {wp, fuzzy, wpfuzzy} considering well 
469
positioned nucleosomes, fuzzy nucleosomes or both for SNEP computation.
470 633

  
471 634

  
472 635

  
473
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM lower_bound_RM upper_bound_RM index_nuc_RM roi_index form BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 
474
BY_H3K14ac_37 BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM 
636
APPENDICE: Generate .c2c Files
637
------------------------------
475 638

  
476
For each file, there is 1 line per nucleosome and each line is composed of many columns divided into 3 main topics:
477
  - nuc information
478
  - number opf reads for each sample
479
  - DESeq analysis results.
639
The `.c2c` files is a simple table that describes how the genome sequence can be aligned. We generate it using NucleoMiner 1.0.
480 640

  
481
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: 
482
  - chr_BY, the BY chr involved
483
  - lower_bound_BY, the lower bound of the BY nuc
484
  - upper_bound_BY, the upper_bound of the BY nuc
485
  - index_nuc_BY, the index of the nuc in the entire list of BY nucs
486
  - chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM 
487
	are the same information for the RM strain
488
  - roi_index, the index of the region of interrest involved.
489
  
490
Next cols concern indicators for each sample. They are labeled [strain]_[marker]_[sample_id] and each value represents the number of reads for the current nuc for the sample *sample_id*. 
641
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first to install the Genetic Data analysis Library (GDL), which is a dynamic library of useful C functions derived from the GNU Scientific Library.
491 642

  
492
The 5 final columns concern DESeq analysis:
493
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect.  
494
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or the interaction term *marker:strain* 
495
  - snep_index, a boolean set to TRUE if the *pvalueGLM* value is under the threshold computed with FDR function with a rate set to 0.01%. 
643
Installing the GDL library
644
^^^^^^^^^^^^^^^^^^^^^^^^^^
496 645

  
497
It also produces the file that explicts size factor for each involved sample in differents strain combination and nucleosomal region type:
646
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps of your working directory). Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
498 647

  
499
TODO: include this file... /home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt
648
tar -xvzf gdl-1.0.tar.gz
500 649

  
650
This creates a directory called gdl-1.0. You now need to go into this directory and compile the library, by typing:
501 651

  
502 652
.. code:: bash
503 653

  
504
  results/current/size_factors.tab
654
  cd gdl-1.0
655
  ./configure
656
  make
505 657

  
658
Now you need to install the library on your system. This needs root priviledges:
506 659

  
660
.. code:: bash
661

  
662
  sudo make install
663

  
664
Installing NucleoMiner
665
^^^^^^^^^^^^^^^^^^^^^^
666

  
667
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
668

  
669
tar -xvzf nucleominer-1.0.tar.gz
670
This creates a directory called nucleominer-1.0. You now need to go into this directory and compile the library, by typing:
507 671

  
672
.. code:: bash
508 673

  
509
Number of SNEPs
510
^^^^^^^^^^^^^^^
674
  cd nucleominer-1.0
675
  ./configure
676
  make
511 677

  
512
Here are the number of computed for each forms.
678
You can then use the binaries dircetly from this folder (best then is to add the path to this folder in your PATH environment variable). If you want to install nucleominer at the system's level (useful if mutiple users will need it) then type, with root priviledges:
513 679

  
514 680
.. code:: bash
515 681

  
516
  [1] "wp"
517
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
518
  BY-RM  30234     520     798     83    3566      26
519
  BY-YJM 31298     303     619    102     103     128
520
  RM-YJM 29863     129     340     46    3177      18
521
  [1] "fuzzy"
522
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
523
  BY-RM  10748     294     308    101    1681      42
524
  BY-YJM 10669     122     176    124      93      87
525
  RM-YJM 11478      54     112     41    1389      20
526
  [1] "wpfuzzy"
527
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
528
  BY-RM  40982     770    1136    183    5404      73
529
  BY-YJM 41967     439     804    214     198     199
530
  RM-YJM 41341     184     468     87    4687      37
531

  
532

  
533
TODO: 
534
  - Print/study intra/inter strain LODs.
535
  - Check the normality of sample using Shapiro–Wilk (Hypothesis for computing LODs)
536
  	
682
  sudo make install
683

  
684
Generate .c2c Files
685
^^^^^^^^^^^^^^^^^^^
686

  
687
To generate .c2c files you need to type the following command in a terminal:
688

  
689
.. code:: bash
537 690

  
691
  mkdir dir_4_c2c
692
  NMgxcomp Data/BY_S288c/Sequence/Genome.fasta \
693
           Data/RM_11-1a/Sequence/Genome.fasta \
694
           dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log
695
            
696
After execution, the directory dir_4_c2c will hold the .c2c files.

Formats disponibles : Unified diff