Statistiques
| Branche: | Révision :

root / doc / sphinx_doc / tuto.rst @ 1d833b97

Historique | Voir | Annoter | Télécharger (18,39 ko)

1
Tutorial
2
========
3

    
4
This tutorial describes steps allowing to perform quantitave analysis of 
5
nucleosomal epigenome. We assume that files are organised around a given 
6
hierarchie and that all command lines are launched from project's root.
7

    
8
This tutorial is divided into t=wo main parts. First one consists in the python 
9
script `wf.py` that aligns and convert Illumina reads. Second one is the R 
10
script `main.r` that extracts information (nucleosome position and indicators) 
11
from the dataset.
12

    
13

    
14
Dataset and Configuration File
15
------------------------------
16

    
17
We want to compare nucleosomes of 3 yeast strains: 
18

    
19
- BY
20
- RM
21
- YJM
22

    
23
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 following 
24
markers: 
25

    
26
- H3K4me1
27
- H3K4me3
28
- H3K9ac
29
- H3K14ac
30
- H4K12ac
31

    
32
In order to simplify the design of exeriment, we considere Mnase as a marker. 
33
For each couple `(strain, marker)` we perform 3 replicates. So, theoritically 
34
we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 
35
replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a 
36
uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
37

    
38
.. autodata:: configurator.CSV_SAMPLE_FILE
39
    :noindex: 
40
		
41
We use a convention to link sample and Illumina fastq outputs. Illumina output 
42
files of the sample `ID` will be stored in the directory 
43
`ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be 
44
stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
45

    
46
.. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
47
    :noindex: 
48

    
49
For BY (resp. RM and YJM) we use following reference genome 
50
`saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
51
(resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and 
52
`saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
53
The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
54

    
55
.. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
56
    :noindex: 
57

    
58
Each chromosome/contig is identify in the fasta file by an obscure identifier. 
59
For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when 
60
TemplateFilter is waiting for an integer. So, we translate it. The index 
61
`FASTA_INDEXES` stores this translation.
62

    
63
.. autodata:: configurator.FASTA_INDEXES
64
    :noindex: 
65

    
66
From a pragamatical point of view we discard some part of the genome (repeated 
67
sequence etc...). The list of the black listed area is explicitely detailled in 
68
`AREA_BLACK_LIST`.
69

    
70
.. autodata:: configurator.AREA_BLACK_LIST
71
    :noindex: 
72

    
73
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously 
74
compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. 
75
`BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about 
76
.c2c files, please read section 5 of the manual of `NucleoMiner`, the old 
77
version of `NucleoMiner2` 
78
(http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
79

    
80
.. autodata:: configurator.C2C_FILES
81
    :noindex: 
82

    
83
`nucleominer` uses specific directory to work in, these are described in 
84
`INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
85

    
86
Finally, `nucleominer` use external ressources, the path to these resspources 
87
are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, 
88
`BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
89

    
90
All paths, prefixes and indexes could be change in the 
91
`src/nucleo_miner/nucleo_miner_config.json` file.
92

    
93
.. autodata:: wf.json_conf_file
94
    :noindex: 
95

    
96

    
97
Preprocessing Illumina Fastq Reads for Each Sample
98
--------------------------------------------------
99

    
100
This preprocessing step consists in the 4 main steps embed in the `wf.py` and 
101
described bellow. As a preamble, this script computes `samples` `samples_mnase` 
102
and `strains` that will be used along the 4 steps.
103

    
104
.. autodata:: wf.samples
105
    :noindex: 
106

    
107
.. autodata:: wf.samples_mnase
108
    :noindex: 
109

    
110
.. autodata:: wf.strains
111
    :noindex: 
112

    
113

    
114
Creating Bowtie Index from each Reference Genome
115
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
116

    
117
For each strain, we need to create bowtie index. Bowtie index of a strain is a 
118
tree view of the genemoe reference for this strain. It will be used by 
119
bowtie to align reads. This step is performed by the following part of the 
120
`wf.py` script:
121

    
122
.. literalinclude:: ../../../snep/src/nucleo_miner/wf.py
123
   :start-after: # _STARTOF_ step_1
124
   :end-before: # _ENDOF_ step_1
125
   :language: python
126

    
127
The following table sum up involved file sizes and process durations concerning 
128
this step.
129

    
130
======  ======================  ======================  ================
131
strain  fasta genome file size  bowtie index file size  process duration
132
======  ======================  ======================  ================
133
BY      12 Mo                          25 Mo                    11 s.
134
RM      12 Mo                          24 Mo                    9 s.
135
YJM     12 Mo                          25 Mo                    11 s.
136
======  ======================  ======================  ================
137

    
138

    
139

    
140

    
141
Aligning Reads to Reference Genome 
142
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
143

    
144
Next, we launch bowtie to align reads to the reference genome. It produces a 
145
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` 
146
and `bedtools` are wrapped using python `subprocess` class. This step is 
147
performed by the followinw part of the `wf.py` script:
148

    
149
.. literalinclude:: ../../../snep/src/nucleo_miner/wf.py
150
   :start-after: # _STARTOF_ step_2
151
   :end-before: # _ENDOF_ step_2
152
   :language: python
153

    
154
Convert Aligned Reads for TemplateFilter
155
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
156
TemplateFilter use particular input format for reads, so we convert `.bed` file. 
157
TemplateFilter expect reads as following: `chr coord strand #read` where:
158

    
159
- chr is the number of the chromosome;
160
- coord is the coordinate of the reads;
161
- strand is `F` for forward and `R` for reverse;
162
- #reads the number of reads for this position.
163

    
164
Each entry is *tab*-separated.
165

    
166
**WARNING** for reverse strand bowtie returns the position of left first 
167
nucleotid when TemplateFilter is waiting for right one. So this step takes it 
168
into account and add lenght of reads (in our case 50) to reverse reads 
169
coordinate.
170

    
171
This step is performed by the followinw part of the `wf.py` script:
172

    
173
.. literalinclude:: ../../../snep/src/nucleo_miner/wf.py
174
   :start-after: # _STARTOF_ step_3
175
   :end-before: # _ENDOF_ step_3
176
   :language: python
177

    
178
The following table sum up number of reads, involved file sizes and process durations concerning 
179
the two last steps. In our case, aligment process have been multuthreaded over over 3 cores.
180

    
181
==  ==============  =========================  ======  ================  ==================  ================  
182
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
183
==  ==============  =========================  ======  ================  ==================  ================
184
1   16436138        10199695                   62,06%  1064 Mo           60  Mo              383   s.
185
2   16911132        12512727                   73,99%  1298 Mo           64  Mo              437   s.
186
3   15946902        12340426                   77,38%  1280 Mo           65  Mo              423   s.
187
4   13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
188
5   15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
189
6   18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
190
7   15591124        12126623                   77,78%  1163 Mo           72  Mo              405   s.
191
8   15659905        12475664                   79,67%  1194 Mo           71  Mo              416   s.
192
9   14668641        10960565                   74,72%  1052 Mo           70  Mo              375   s.
193
10  14339179        10454451                   72,91%  1049 Mo           51  Mo              363   s.
194
11  18019895        13688774                   75,96%  1378 Mo           59  Mo              474   s.
195
12  13746796        10810022                   78,64%  1084 Mo           54  Mo              360   s.
196
13  15205065        11766016                   77,38%  990  Mo           54  Mo              381   s.
197
14  17803097        13838883                   77,73%  1154 Mo           60  Mo              452   s.
198
15  15434564        12307878                   79,74%  1032 Mo           57  Mo              394   s.
199
16  16802587        12725665                   75,74%  1221 Mo           48  Mo              438   s.
200
17  16058417        12513734                   77,93%  1192 Mo           63  Mo              422   s.
201
18  16154482        13204331                   81,74%  1277 Mo           52  Mo              430   s.
202
19  21013924        17102120                   81,38%  1646 Mo           59  Mo              555   s.
203
20  17213114        14433357                   83,85%  1389 Mo           53  Mo              459   s.
204
21  17360907        14733001                   84,86%  1203 Mo           55  Mo              450   s.
205
22  18136816        15389581                   84,85%  1257 Mo           53  Mo              469   s.
206
23  14763678        12173025                   82,45%  1140 Mo           56  Mo              393   s.
207
24  15541709        12890345                   82,94%  1057 Mo           48  Mo              398   s.
208
25  16433215        13094314                   79,68%  1241 Mo           57  Mo              433   s.
209
26  17370850        14264136                   82,12%  1347 Mo           51  Mo              466   s.
210
27  14613512        8654495                    59,22%  887  Mo           56  Mo              339   s.
211
28  15248545        11367589                   74,55%  1166 Mo           67  Mo              405   s.
212
29  14316809        10767926                   75,21%  1103 Mo           63  Mo              379   s.
213
30  15178058        12265794                   80,81%  1030 Mo           66  Mo              390   s.
214
31  14968579        11876186                   79,34%  1009 Mo           63  Mo              387   s.
215
32  16912705        13550508                   80,12%  1143 Mo           70  Mo              442   s.
216
33  16782154        12755111                   76,00%  1227 Mo           65  Mo              438   s.
217
34  16741443        13168071                   78,66%  1260 Mo           71  Mo              442   s.
218
35  13096171        10367041                   79,16%  992  Mo           62  Mo              350   s.
219
36  17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
220
37  17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
221
38  16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
222
39  14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
223
40  13784738        10598464                   76,89%  1005 Mo           64  Mo              358   s.
224
41  12438007        9620975                    77,35%  911  Mo           60  Mo              326   s.
225
42  13853959        11031238                   79,63%  1045 Mo           64  Mo              365   s.
226
43  12036162        6654780                    55,29%  684  Mo           46  Mo              268   s.
227
44  13873129        10251074                   73,89%  1048 Mo           61  Mo              365   s.
228
45  19817751        14904502                   75,21%  1520 Mo           72  Mo              528   s.
229
46  13368959        10818619                   80,92%  912  Mo           63  Mo              350   s.
230
47  7566467         6139001                    81,13%  520  Mo           44  Mo              201   s.
231
48  32586928        21191363                   65,03%  1816 Mo           82  Mo              766   s.
232
49  30733184        18791373                   61,14%  1801 Mo           89  Mo              721   s.
233
50  41287616        30383875                   73,59%  2911 Mo           112 Mo              1065  s.
234
51  40439965        31177914                   77,10%  2981 Mo           117 Mo              1070  s.
235
53  40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
236
55  52424414        47117107                   89,88%  3811 Mo           119 Mo              1477  s.
237
==  ==============  =========================  ======  ================  ==================  ================  
238

    
239
For some reasons (manipulation efficency, e.g. PCR...), we remove samples 33, 45, 48 and 55.
240

    
241

    
242
Run TemplateFilter on Mnase Samples
243
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
244

    
245
Finally, for each sample we perfome TemplateFilter analysis. 
246

    
247
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
248
define by its center and its width. An odd width leads us to considere non 
249
interger lower and upper bound.
250

    
251
**WARNING** TemplateFilter is not design to deal with replicate. So we choose to 
252
keep a maximum of nucleosome and filter in a second time using the benefit of 
253
replicate. To do that we set a low correlation threshold parameter (`0.5`) and a 
254
particularly high value of overlaping (`300%`).
255

    
256
This step is performed by the followinw part of the `wf.py` script:
257

    
258
.. literalinclude:: ../../../snep/src/nucleo_miner/wf.py
259
   :start-after: # _STARTOF_ step_4
260
   :end-before: # _ENDOF_ step_4
261
   :language: python
262

    
263
==  ======  ==========  =============  ================
264
id  strain  found nucs  nuc file size  process duration
265
==  ======  ==========  =============  ================
266
1    BY     96214       68 Mo          1022 s.                     
267
2    BY     91694       65 Mo          1038 s.                      
268
3    BY     91205       65 Mo          1036 s.                       
269
4    RM     88076       62 Mo          984 s.                      
270
5    RM     90141       64 Mo          967 s.                      
271
6    RM     87517       62 Mo          980 s.                      
272
7    YJM    88945       64 Mo          566 s.                      
273
8    YJM    88689       64 Mo          570 s.                      
274
9    YJM    88128       63 Mo          565 s.                    
275
==  ======  ==========  =============  ================
276

    
277

    
278

    
279

    
280

    
281

    
282

    
283

    
284

    
285

    
286

    
287

    
288

    
289
Inferring Nucleosome Position and Extracting Read Counts
290
--------------------------------------------------------
291

    
292

    
293

    
294
This preprocessing step consists in the 4 main steps embed in the `wf.py` and 
295
described bellow. As a preamble, this script computes `samples` `samples_mnase` 
296
and `strains` that will be used along the 4 steps.
297

    
298

    
299
The second part of the tutoriel use `R` (http://http://www.r-project.org). It 
300
consists in the 3 main steps corresponding to 4 R scripts:
301

    
302
  - compute_rois.R
303
  - extract_maps.R
304
  - count_reads.R
305
  - get_size_factors  
306
  - launch_deseq.R
307

    
308
Computing Common Genome Region Between Strains
309
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
310

    
311
.. code:: bash
312

    
313
  R CMD BATCH src/nucleo_miner/compute_rois.R
314

    
315

    
316
Extracting Maps for Well Positionned and Fuzzy Nucleosomes
317
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
318

    
319
.. code:: bash
320

    
321
  R CMD BATCH src/nucleo_miner/extract_maps.R
322

    
323

    
324
Count Reads for Each Nucleosome
325
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
326

    
327
.. code:: bash
328

    
329
  R CMD BATCH src/nucleo_miner/count_reads.R
330

    
331

    
332
Get Size Factors Using DESeq
333
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
334

    
335
.. code:: bash
336

    
337
  R CMD BATCH src/nucleo_miner/get_size_factors.R
338

    
339

    
340
Performing DESeq Analysis
341
^^^^^^^^^^^^^^^^^^^^^^^^^
342

    
343
.. code:: bash
344

    
345
  R CMD BATCH src/nucleo_miner/launch_deseq.R
346

    
347

    
348
Results
349
-------
350

    
351
Output Files Organisation
352
^^^^^^^^^^^^^^^^^^^^^^^^^
353
Previous steps produce following 45 files. 
354
Each filename is under the form 
355

    
356
.. code:: bash
357

    
358
  results/nucleo_miner/[combi]_[marker]_[form]_snep.tab 
359

    
360
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, marker is 
361
in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each post translational 
362
histone modification and form is in {wp, fuzzy, wpfuzzy} considering well 
363
positionned nucleosomes, fuzzy nucleosomes or both for SNEP computation.
364

    
365

    
366

    
367
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM lower_bound_RM upper_bound_RM index_nuc_RM roi_index form BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 
368
BY_H3K14ac_37 BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM 
369

    
370
For each file, there is 1 line per nucleosome and each line is composed of many columns divided into 3 main topics:
371
  - nuc information
372
  - number opf reads for each sample
373
  - DESeq analysis results.
374

    
375
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: 
376
  - chr_BY, the BY chr involved
377
  - lower_bound_BY, the lower bound of the BY nuc
378
  - upper_bound_BY, the upper_bound of the BY nuc
379
  - index_nuc_BY, the index of the nuc in the entire list of BY nucs
380
  - chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM 
381
	are the same information for the RM strain
382
  - roi_index, the index of the region of interrest involved.
383
  
384
Next cols concern indicators for each sample. They are labeled [strain]_[marker]_[sample_id] and each value represents the number of reads for the current nuc for the sample *sample_id*. 
385

    
386
The 5 final columns concern DESeq analysis:
387
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect.  
388
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or the interaction term *marker:strain* 
389
  - snep_index, a boolean set to TRUE if the *pvalueGLM* value is under the threshold computed with FDR function with a rate set to 0.01%. 
390

    
391
It also produces the file that explicts size factor for each involved sample in differents strain combination and nucleosomal region type:
392

    
393
TODO: include this file... /home/filleton/analyses/snepcatalog/data/2013-10-09/nucleo_miner/README.txt
394

    
395

    
396
.. code:: bash
397

    
398
  results/nucleo_miner/size_factors.tab
399

    
400

    
401

    
402

    
403
Number of SNEPs
404
^^^^^^^^^^^^^^^
405

    
406
Here are the number of computed for each forms.
407

    
408
.. code:: bash
409

    
410
  [1] "wp"
411
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
412
  BY-RM  30234     520     798     83    3566      26
413
  BY-YJM 31298     303     619    102     103     128
414
  RM-YJM 29863     129     340     46    3177      18
415
  [1] "fuzzy"
416
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
417
  BY-RM  10748     294     308    101    1681      42
418
  BY-YJM 10669     122     176    124      93      87
419
  RM-YJM 11478      54     112     41    1389      20
420
  [1] "wpfuzzy"
421
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
422
  BY-RM  40982     770    1136    183    5404      73
423
  BY-YJM 41967     439     804    214     198     199
424
  RM-YJM 41341     184     468     87    4687      37
425

    
426

    
427
TODO: 
428
  - Print/study intra/inter strain LODs.
429
  - Check the normality of sample using Shapiro–Wilk (Hypothesis for computing LODs)
430
  	
431