root / doc / sphinx_doc / tuto.rst @ a0b91fee
Historique | Voir | Annoter | Télécharger (18,32 ko)
1 |
Tutorial |
---|---|
2 |
======== |
3 |
|
4 |
This tutorial describes steps allowing to perform quantitave analysis of |
5 |
nucleosomal epigenome. We assume that files are organised around a given |
6 |
hierarchie and that all command lines are launched from project's root. |
7 |
|
8 |
This tutorial is divided into t=wo main parts. First one consists in the python |
9 |
script `wf.py` that aligns and convert Illumina reads. Second one is the R |
10 |
script `main.r` that extracts information (nucleosome position and indicators) |
11 |
from the dataset. |
12 |
|
13 |
|
14 |
Dataset and Configuration File |
15 |
------------------------------ |
16 |
|
17 |
We want to compare nucleosomes of 3 yeast strains: |
18 |
|
19 |
- BY |
20 |
- RM |
21 |
- YJM |
22 |
|
23 |
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 following |
24 |
markers: |
25 |
|
26 |
- H3K4me1 |
27 |
- H3K4me3 |
28 |
- H3K9ac |
29 |
- H3K14ac |
30 |
- H4K12ac |
31 |
|
32 |
In order to simplify the design of exeriment, we considere Mnase as a marker. |
33 |
For each couple `(strain, marker)` we perform 3 replicates. So, theoritically |
34 |
we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 |
35 |
replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a |
36 |
uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information. |
37 |
|
38 |
.. autodata:: configurator.CSV_SAMPLE_FILE |
39 |
:noindex: |
40 |
|
41 |
We use a convention to link sample and Illumina fastq outputs. Illumina output |
42 |
files of the sample `ID` will be stored in the directory |
43 |
`ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be |
44 |
stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`. |
45 |
|
46 |
.. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX |
47 |
:noindex: |
48 |
|
49 |
For BY (resp. RM and YJM) we use following reference genome |
50 |
`saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta` |
51 |
(resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and |
52 |
`saccharomyces_cerevisiae_YJM_789_screencontig.fasta`). |
53 |
The index `FASTA_REFERENCE_GENOME_FILES` stores this information. |
54 |
|
55 |
.. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES |
56 |
:noindex: |
57 |
|
58 |
Each chromosome/contig is identify in the fasta file by an obscure identifier. |
59 |
For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when |
60 |
TemplateFilter is waiting for an integer. So, we translate it. The index |
61 |
`FASTA_INDEXES` stores this translation. |
62 |
|
63 |
.. autodata:: configurator.FASTA_INDEXES |
64 |
:noindex: |
65 |
|
66 |
From a pragamatical point of view we discard some part of the genome (repeated |
67 |
sequence etc...). The list of the black listed area is explicitely detailled in |
68 |
`AREA_BLACK_LIST`. |
69 |
|
70 |
.. autodata:: configurator.AREA_BLACK_LIST |
71 |
:noindex: |
72 |
|
73 |
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously |
74 |
compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. |
75 |
`BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about |
76 |
.c2c files, please read section 5 of the manual of `NucleoMiner`, the old |
77 |
version of `NucleoMiner2` |
78 |
(http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf). |
79 |
|
80 |
.. autodata:: configurator.C2C_FILES |
81 |
:noindex: |
82 |
|
83 |
`nucleominer` uses specific directory to work in, these are described in |
84 |
`INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`. |
85 |
|
86 |
Finally, `nucleominer` use external ressources, the path to these resspources |
87 |
are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, |
88 |
`BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`. |
89 |
|
90 |
All paths, prefixes and indexes could be change in the |
91 |
`src/current/nucleominer_config.json` file. |
92 |
|
93 |
.. autodata:: wf.json_conf_file |
94 |
:noindex: |
95 |
|
96 |
|
97 |
Preprocessing Illumina Fastq Reads for Each Sample |
98 |
-------------------------------------------------- |
99 |
|
100 |
This preprocessing step consists in the 4 main steps embed in the `wf.py` and |
101 |
described bellow. As a preamble, this script computes `samples` `samples_mnase` |
102 |
and `strains` that will be used along the 4 steps. |
103 |
|
104 |
.. autodata:: wf.samples |
105 |
:noindex: |
106 |
|
107 |
.. autodata:: wf.samples_mnase |
108 |
:noindex: |
109 |
|
110 |
.. autodata:: wf.strains |
111 |
:noindex: |
112 |
|
113 |
|
114 |
Creating Bowtie Index from each Reference Genome |
115 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
116 |
|
117 |
For each strain, we need to create bowtie index. Bowtie index of a strain is a |
118 |
tree view of the genemoe reference for this strain. It will be used by |
119 |
bowtie to align reads. This step is performed by the following part of the |
120 |
`wf.py` script: |
121 |
|
122 |
.. literalinclude:: ../../../snep/src/current/wf.py |
123 |
:start-after: # _STARTOF_ step_1 |
124 |
:end-before: # _ENDOF_ step_1 |
125 |
:language: python |
126 |
|
127 |
The following table sum up involved file sizes and process durations concerning |
128 |
this step. |
129 |
|
130 |
====== ====================== ====================== ================ |
131 |
strain fasta genome file size bowtie index file size process duration |
132 |
====== ====================== ====================== ================ |
133 |
BY 12 Mo 25 Mo 11 s. |
134 |
RM 12 Mo 24 Mo 9 s. |
135 |
YJM 12 Mo 25 Mo 11 s. |
136 |
====== ====================== ====================== ================ |
137 |
|
138 |
|
139 |
|
140 |
|
141 |
Aligning Reads to Reference Genome |
142 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
143 |
|
144 |
Next, we launch bowtie to align reads to the reference genome. It produces a |
145 |
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` |
146 |
and `bedtools` are wrapped using python `subprocess` class. This step is |
147 |
performed by the followinw part of the `wf.py` script: |
148 |
|
149 |
.. literalinclude:: ../../../snep/src/current/wf.py |
150 |
:start-after: # _STARTOF_ step_2 |
151 |
:end-before: # _ENDOF_ step_2 |
152 |
:language: python |
153 |
|
154 |
Convert Aligned Reads for TemplateFilter |
155 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
156 |
TemplateFilter use particular input format for reads, so we convert `.bed` file. |
157 |
TemplateFilter expect reads as following: `chr coord strand #read` where: |
158 |
|
159 |
- chr is the number of the chromosome; |
160 |
- coord is the coordinate of the reads; |
161 |
- strand is `F` for forward and `R` for reverse; |
162 |
- #reads the number of reads for this position. |
163 |
|
164 |
Each entry is *tab*-separated. |
165 |
|
166 |
**WARNING** for reverse strand bowtie returns the position of left first |
167 |
nucleotid when TemplateFilter is waiting for right one. So this step takes it |
168 |
into account and add lenght of reads (in our case 50) to reverse reads |
169 |
coordinate. |
170 |
|
171 |
This step is performed by the followinw part of the `wf.py` script: |
172 |
|
173 |
.. literalinclude:: ../../../snep/src/current/wf.py |
174 |
:start-after: # _STARTOF_ step_3 |
175 |
:end-before: # _ENDOF_ step_3 |
176 |
:language: python |
177 |
|
178 |
The following table sum up number of reads, involved file sizes and process durations concerning |
179 |
the two last steps. In our case, aligment process have been multuthreaded over over 3 cores. |
180 |
|
181 |
== ============== ========================= ====== ================ ================== ================ |
182 |
id Illumina reads aligned and filtred reads ratio `.bed` file size TF input file size process duration |
183 |
== ============== ========================= ====== ================ ================== ================ |
184 |
1 16436138 10199695 62,06% 1064 Mo 60 Mo 383 s. |
185 |
2 16911132 12512727 73,99% 1298 Mo 64 Mo 437 s. |
186 |
3 15946902 12340426 77,38% 1280 Mo 65 Mo 423 s. |
187 |
4 13765584 10381903 75,42% 931 Mo 59 Mo 352 s. |
188 |
5 15168268 11502855 75,83% 1031 Mo 64 Mo 386 s. |
189 |
6 18850820 14024905 74,40% 1254 Mo 69 Mo 482 s. |
190 |
7 15591124 12126623 77,78% 1163 Mo 72 Mo 405 s. |
191 |
8 15659905 12475664 79,67% 1194 Mo 71 Mo 416 s. |
192 |
9 14668641 10960565 74,72% 1052 Mo 70 Mo 375 s. |
193 |
10 14339179 10454451 72,91% 1049 Mo 51 Mo 363 s. |
194 |
11 18019895 13688774 75,96% 1378 Mo 59 Mo 474 s. |
195 |
12 13746796 10810022 78,64% 1084 Mo 54 Mo 360 s. |
196 |
13 15205065 11766016 77,38% 990 Mo 54 Mo 381 s. |
197 |
14 17803097 13838883 77,73% 1154 Mo 60 Mo 452 s. |
198 |
15 15434564 12307878 79,74% 1032 Mo 57 Mo 394 s. |
199 |
16 16802587 12725665 75,74% 1221 Mo 48 Mo 438 s. |
200 |
17 16058417 12513734 77,93% 1192 Mo 63 Mo 422 s. |
201 |
18 16154482 13204331 81,74% 1277 Mo 52 Mo 430 s. |
202 |
19 21013924 17102120 81,38% 1646 Mo 59 Mo 555 s. |
203 |
20 17213114 14433357 83,85% 1389 Mo 53 Mo 459 s. |
204 |
21 17360907 14733001 84,86% 1203 Mo 55 Mo 450 s. |
205 |
22 18136816 15389581 84,85% 1257 Mo 53 Mo 469 s. |
206 |
23 14763678 12173025 82,45% 1140 Mo 56 Mo 393 s. |
207 |
24 15541709 12890345 82,94% 1057 Mo 48 Mo 398 s. |
208 |
25 16433215 13094314 79,68% 1241 Mo 57 Mo 433 s. |
209 |
26 17370850 14264136 82,12% 1347 Mo 51 Mo 466 s. |
210 |
27 14613512 8654495 59,22% 887 Mo 56 Mo 339 s. |
211 |
28 15248545 11367589 74,55% 1166 Mo 67 Mo 405 s. |
212 |
29 14316809 10767926 75,21% 1103 Mo 63 Mo 379 s. |
213 |
30 15178058 12265794 80,81% 1030 Mo 66 Mo 390 s. |
214 |
31 14968579 11876186 79,34% 1009 Mo 63 Mo 387 s. |
215 |
32 16912705 13550508 80,12% 1143 Mo 70 Mo 442 s. |
216 |
33 16782154 12755111 76,00% 1227 Mo 65 Mo 438 s. |
217 |
34 16741443 13168071 78,66% 1260 Mo 71 Mo 442 s. |
218 |
35 13096171 10367041 79,16% 992 Mo 62 Mo 350 s. |
219 |
36 17715118 14092985 79,55% 1404 Mo 68 Mo 483 s. |
220 |
37 17288466 7402082 42,82% 741 Mo 48 Mo 339 s. |
221 |
38 16116394 13178457 81,77% 1101 Mo 63 Mo 420 s. |
222 |
39 14241106 10537228 73,99% 880 Mo 57 Mo 348 s. |
223 |
40 13784738 10598464 76,89% 1005 Mo 64 Mo 358 s. |
224 |
41 12438007 9620975 77,35% 911 Mo 60 Mo 326 s. |
225 |
42 13853959 11031238 79,63% 1045 Mo 64 Mo 365 s. |
226 |
43 12036162 6654780 55,29% 684 Mo 46 Mo 268 s. |
227 |
44 13873129 10251074 73,89% 1048 Mo 61 Mo 365 s. |
228 |
45 19817751 14904502 75,21% 1520 Mo 72 Mo 528 s. |
229 |
46 13368959 10818619 80,92% 912 Mo 63 Mo 350 s. |
230 |
47 7566467 6139001 81,13% 520 Mo 44 Mo 201 s. |
231 |
48 32586928 21191363 65,03% 1816 Mo 82 Mo 766 s. |
232 |
49 30733184 18791373 61,14% 1801 Mo 89 Mo 721 s. |
233 |
50 41287616 30383875 73,59% 2911 Mo 112 Mo 1065 s. |
234 |
51 40439965 31177914 77,10% 2981 Mo 117 Mo 1070 s. |
235 |
53 40876476 33780065 82,64% 3316 Mo 103 Mo 1165 s. |
236 |
55 52424414 47117107 89,88% 3811 Mo 119 Mo 1477 s. |
237 |
== ============== ========================= ====== ================ ================== ================ |
238 |
|
239 |
For some reasons (manipulation efficency, e.g. PCR...), we remove samples 33, 45, 48 and 55. |
240 |
|
241 |
|
242 |
Run TemplateFilter on Mnase Samples |
243 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
244 |
|
245 |
Finally, for each sample we perfome TemplateFilter analysis. |
246 |
|
247 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is |
248 |
define by its center and its width. An odd width leads us to considere non |
249 |
interger lower and upper bound. |
250 |
|
251 |
**WARNING** TemplateFilter is not design to deal with replicate. So we choose to |
252 |
keep a maximum of nucleosome and filter in a second time using the benefit of |
253 |
replicate. To do that we set a low correlation threshold parameter (`0.5`) and a |
254 |
particularly high value of overlaping (`300%`). |
255 |
|
256 |
This step is performed by the followinw part of the `wf.py` script: |
257 |
|
258 |
.. literalinclude:: ../../../snep/src/current/wf.py |
259 |
:start-after: # _STARTOF_ step_4 |
260 |
:end-before: # _ENDOF_ step_4 |
261 |
:language: python |
262 |
|
263 |
== ====== ========== ============= ================ |
264 |
id strain found nucs nuc file size process duration |
265 |
== ====== ========== ============= ================ |
266 |
1 BY 96214 68 Mo 1022 s. |
267 |
2 BY 91694 65 Mo 1038 s. |
268 |
3 BY 91205 65 Mo 1036 s. |
269 |
4 RM 88076 62 Mo 984 s. |
270 |
5 RM 90141 64 Mo 967 s. |
271 |
6 RM 87517 62 Mo 980 s. |
272 |
7 YJM 88945 64 Mo 566 s. |
273 |
8 YJM 88689 64 Mo 570 s. |
274 |
9 YJM 88128 63 Mo 565 s. |
275 |
== ====== ========== ============= ================ |
276 |
|
277 |
|
278 |
|
279 |
|
280 |
|
281 |
|
282 |
|
283 |
|
284 |
|
285 |
|
286 |
|
287 |
|
288 |
|
289 |
Inferring Nucleosome Position and Extracting Read Counts |
290 |
-------------------------------------------------------- |
291 |
|
292 |
|
293 |
|
294 |
This preprocessing step consists in the 4 main steps embed in the `wf.py` and |
295 |
described bellow. As a preamble, this script computes `samples` `samples_mnase` |
296 |
and `strains` that will be used along the 4 steps. |
297 |
|
298 |
|
299 |
The second part of the tutoriel use `R` (http://http://www.r-project.org). It |
300 |
consists in the 3 main steps corresponding to 4 R scripts: |
301 |
|
302 |
- compute_rois.R |
303 |
- extract_maps.R |
304 |
- count_reads.R |
305 |
- get_size_factors |
306 |
- launch_deseq.R |
307 |
|
308 |
Computing Common Genome Region Between Strains |
309 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
310 |
|
311 |
.. code:: bash |
312 |
|
313 |
R CMD BATCH src/current/compute_rois.R |
314 |
|
315 |
|
316 |
Extracting Maps for Well Positionned and Fuzzy Nucleosomes |
317 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
318 |
|
319 |
.. code:: bash |
320 |
|
321 |
R CMD BATCH src/current/extract_maps.R |
322 |
|
323 |
|
324 |
Count Reads for Each Nucleosome |
325 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
326 |
|
327 |
.. code:: bash |
328 |
|
329 |
R CMD BATCH src/current/count_reads.R |
330 |
|
331 |
|
332 |
Get Size Factors Using DESeq |
333 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
334 |
|
335 |
.. code:: bash |
336 |
|
337 |
R CMD BATCH src/current/get_size_factors.R |
338 |
|
339 |
|
340 |
Performing DESeq Analysis |
341 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
342 |
|
343 |
.. code:: bash |
344 |
|
345 |
R CMD BATCH src/current/launch_deseq.R |
346 |
|
347 |
|
348 |
Results |
349 |
------- |
350 |
|
351 |
Output Files Organisation |
352 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
353 |
Previous steps produce following 45 files. |
354 |
Each filename is under the form |
355 |
|
356 |
.. code:: bash |
357 |
|
358 |
results/current/[combi]_[marker]_[form]_snep.tab |
359 |
|
360 |
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, marker is |
361 |
in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each post translational |
362 |
histone modification and form is in {wp, fuzzy, wpfuzzy} considering well |
363 |
positionned nucleosomes, fuzzy nucleosomes or both for SNEP computation. |
364 |
|
365 |
|
366 |
|
367 |
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM lower_bound_RM upper_bound_RM index_nuc_RM roi_index form BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 |
368 |
BY_H3K14ac_37 BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM |
369 |
|
370 |
For each file, there is 1 line per nucleosome and each line is composed of many columns divided into 3 main topics: |
371 |
- nuc information |
372 |
- number opf reads for each sample |
373 |
- DESeq analysis results. |
374 |
|
375 |
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: |
376 |
- chr_BY, the BY chr involved |
377 |
- lower_bound_BY, the lower bound of the BY nuc |
378 |
- upper_bound_BY, the upper_bound of the BY nuc |
379 |
- index_nuc_BY, the index of the nuc in the entire list of BY nucs |
380 |
- chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM |
381 |
are the same information for the RM strain |
382 |
- roi_index, the index of the region of interrest involved. |
383 |
|
384 |
Next cols concern indicators for each sample. They are labeled [strain]_[marker]_[sample_id] and each value represents the number of reads for the current nuc for the sample *sample_id*. |
385 |
|
386 |
The 5 final columns concern DESeq analysis: |
387 |
- manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. |
388 |
- pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or the interaction term *marker:strain* |
389 |
- snep_index, a boolean set to TRUE if the *pvalueGLM* value is under the threshold computed with FDR function with a rate set to 0.01%. |
390 |
|
391 |
It also produces the file that explicts size factor for each involved sample in differents strain combination and nucleosomal region type: |
392 |
|
393 |
TODO: include this file... /home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt |
394 |
|
395 |
|
396 |
.. code:: bash |
397 |
|
398 |
results/current/size_factors.tab |
399 |
|
400 |
|
401 |
|
402 |
|
403 |
Number of SNEPs |
404 |
^^^^^^^^^^^^^^^ |
405 |
|
406 |
Here are the number of computed for each forms. |
407 |
|
408 |
.. code:: bash |
409 |
|
410 |
[1] "wp" |
411 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
412 |
BY-RM 30234 520 798 83 3566 26 |
413 |
BY-YJM 31298 303 619 102 103 128 |
414 |
RM-YJM 29863 129 340 46 3177 18 |
415 |
[1] "fuzzy" |
416 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
417 |
BY-RM 10748 294 308 101 1681 42 |
418 |
BY-YJM 10669 122 176 124 93 87 |
419 |
RM-YJM 11478 54 112 41 1389 20 |
420 |
[1] "wpfuzzy" |
421 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
422 |
BY-RM 40982 770 1136 183 5404 73 |
423 |
BY-YJM 41967 439 804 214 198 199 |
424 |
RM-YJM 41341 184 468 87 4687 37 |
425 |
|
426 |
|
427 |
TODO: |
428 |
- Print/study intra/inter strain LODs. |
429 |
- Check the normality of sample using Shapiro–Wilk (Hypothesis for computing LODs) |
430 |
|
431 |
|