Révision e5603c3f doc/sphinx_doc/tuto.rst
b/doc/sphinx_doc/tuto.rst | ||
---|---|---|
1 | 1 |
Tutorial |
2 | 2 |
======== |
3 | 3 |
|
4 |
This tutorial describes steps allowing to perform quantitave analysis of |
|
5 |
nucleosomal epigenome. We assume that files are organised around a given |
|
6 |
hierarchie and that all command lines are launched from project's root. |
|
4 |
This tutorial describes steps allowing performing quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory. |
|
7 | 5 |
|
8 |
This tutorial is divided into t=wo main parts. First one consists in the python |
|
9 |
script `wf.py` that aligns and convert Illumina reads. Second one is the R |
|
10 |
script `main.r` that extracts information (nucleosome position and indicators) |
|
11 |
from the dataset. |
|
6 |
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts information (nucleosome position and indicators) from the dataset. |
|
12 | 7 |
|
13 | 8 |
|
14 |
Python and R Common Configuration File |
|
15 |
-------------------------------------- |
|
16 |
|
|
17 |
First of all we define in one place some configuration variables that will be launch by python and R scripts. This file is **configurator.py**. The execution of this python script dumps variables into the **nucleo_miner_config.json** that will be launch by both kind of scriopts (R and puython). |
|
18 |
|
|
19 |
To do this launch at the root of your project the following command line: |
|
20 | 9 |
|
21 |
.. code:: bash |
|
22 | 10 |
|
23 |
python src/current/configurator.py
|
|
24 |
|
|
11 |
Experimental Dataset, Working Directory and Configuration File
|
|
12 |
-------------------------------------------------------------- |
|
25 | 13 |
|
26 |
$$$ other python script to describe: |
|
27 |
- libcoverage.py |
|
28 |
- wf.py |
|
14 |
Working Directory Organisation |
|
15 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
29 | 16 |
|
17 |
The working directory... |
|
30 | 18 |
|
19 |
$$$ TODO: Explain how and where retrieve the workdir |
|
31 | 20 |
|
32 | 21 |
|
33 | 22 |
|
34 |
Dataset and Configuration Variables
|
|
35 |
-----------------------------------
|
|
23 |
Retrieving Experimental Dataset
|
|
24 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
36 | 25 |
|
37 |
We want to compare nucleosomes of 3 yeast strains:
|
|
26 |
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
|
|
38 | 27 |
|
39 |
- BY |
|
40 |
- RM |
|
41 |
- YJM |
|
28 |
$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory. |
|
42 | 29 |
|
43 |
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 following |
|
44 |
markers: |
|
45 | 30 |
|
46 |
- H3K4me1 |
|
47 |
- H3K4me3 |
|
48 |
- H3K9ac |
|
49 |
- H3K14ac |
|
50 |
- H4K12ac |
|
31 |
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain we performed Mnase-Seq and ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark. |
|
51 | 32 |
|
52 |
In order to simplify the design of experiment, we considere Mnase as a marker. |
|
53 |
For each couple `(strain, marker)` we perform 3 replicates. So, theoritically |
|
54 |
we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 |
|
55 |
replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a |
|
56 |
uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information. |
|
33 |
The dataset is composed of 55 files organised as follows: |
|
57 | 34 |
|
58 |
.. autodata:: configurator.CSV_SAMPLE_FILE |
|
59 |
:noindex: |
|
60 |
|
|
61 |
We use a convention to link sample and Illumina fastq outputs. Illumina output |
|
62 |
files of the sample `ID` will be stored in the directory |
|
63 |
`ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be |
|
64 |
stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`. |
|
65 |
|
|
66 |
.. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX |
|
67 |
:noindex: |
|
68 |
|
|
69 |
For BY (resp. RM and YJM) we use following reference genome |
|
70 |
`saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta` |
|
71 |
(resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and |
|
72 |
`saccharomyces_cerevisiae_YJM_789_screencontig.fasta`). |
|
73 |
The index `FASTA_REFERENCE_GENOME_FILES` stores this information. |
|
35 |
- 3 replicates for BY MNase Seq |
|
36 |
|
|
37 |
- sample 1 (5 fastq.gz files) |
|
38 |
- sample 2 (5 fastq.gz files) |
|
39 |
- sample 3 (4 fastq.gz files) |
|
40 |
|
|
41 |
- 3 replicates for RM MNase Seq |
|
42 |
|
|
43 |
- sample 4 (4 fastq.gz files) |
|
44 |
- sample 5 (4 fastq.gz files) |
|
45 |
- sample 6 (5 fastq.gz files) |
|
46 |
|
|
47 |
- 3 replicates for BY ChIP Seq H3K14ac |
|
48 |
|
|
49 |
- sample 36 (5 fastq.gz files) |
|
50 |
- sample 37 (5 fastq.gz files) |
|
51 |
- sample 53 (9 fastq.gz files) |
|
52 |
|
|
53 |
- 2 replicates for RM ChIP Seq H3K14ac |
|
54 |
|
|
55 |
- sample 38 (5 fastq.gz files) |
|
56 |
- sample 39 (4 fastq.gz files) |
|
57 |
|
|
74 | 58 |
|
75 |
.. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES |
|
76 |
:noindex: |
|
77 | 59 |
|
78 |
Each chromosome/contig is identify in the fasta file by an obscure identifier. |
|
79 |
For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when |
|
80 |
TemplateFilter is waiting for an integer. So, we translate it. The index |
|
81 |
`FASTA_INDEXES` stores this translation. |
|
82 | 60 |
|
83 |
.. autodata:: configurator.FASTA_INDEXES
|
|
84 |
:noindex:
|
|
61 |
Python and R Common Configuration File
|
|
62 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
85 | 63 |
|
86 |
From a pragamatical point of view we discard some part of the genome (repeated |
|
87 |
sequence etc...). The list of the black listed area is explicitely detailled in |
|
88 |
`AREA_BLACK_LIST`. |
|
64 |
First of all we define in one place some configuration variables that will be launched by python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts. |
|
89 | 65 |
|
90 |
.. autodata:: configurator.AREA_BLACK_LIST |
|
91 |
:noindex: |
|
66 |
To do this, go to the root directory of your project and run the following command: |
|
92 | 67 |
|
93 |
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously |
|
94 |
compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. |
|
95 |
`BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about |
|
96 |
.c2c files, please read section 5 of the manual of `NucleoMiner`, the old |
|
97 |
version of `NucleoMiner2` |
|
98 |
(http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf). |
|
68 |
.. code:: bash |
|
99 | 69 |
|
100 |
.. autodata:: configurator.C2C_FILES
|
|
101 |
:noindex: |
|
70 |
python src/current/configurator.py
|
|
71 |
|
|
102 | 72 |
|
103 |
`nucleominer` uses specific directory to work in, these are described in |
|
104 |
`INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`. |
|
105 | 73 |
|
106 |
Finally, `nucleominer` use external ressources, the path to these resspources |
|
107 |
are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, |
|
108 |
`BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`. |
|
109 | 74 |
|
110 |
All paths, prefixes and indexes could be change in the |
|
111 |
`src/current/nucleominer_config.json` file. |
|
112 | 75 |
|
113 |
.. autodata:: wf.json_conf_file |
|
114 |
:noindex: |
|
115 | 76 |
|
116 | 77 |
|
117 | 78 |
Preprocessing Illumina Fastq Reads for Each Sample |
118 | 79 |
-------------------------------------------------- |
119 | 80 |
|
120 |
This preprocessing step consists in the 4 main steps embed in the `wf.py` and |
|
121 |
described bellow. As a preamble, this script computes `samples` `samples_mnase` |
|
122 |
and `strains` that will be used along the 4 steps. |
|
81 |
This preprocessing step consists of 4 main steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps. |
|
82 |
|
|
123 | 83 |
|
124 | 84 |
.. autodata:: wf.samples |
125 | 85 |
:noindex: |
... | ... | |
134 | 94 |
Creating Bowtie Index from each Reference Genome |
135 | 95 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
136 | 96 |
|
137 |
For each strain, we need to create bowtie index. Bowtie index of a strain is a |
|
138 |
tree view of the genemoe reference for this strain. It will be used by |
|
139 |
bowtie to align reads. This step is performed by the following part of the |
|
140 |
`wf.py` script: |
|
97 |
For each strain, we need to create bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. This step is performed by the following part of the `wf.py` script: |
|
141 | 98 |
|
142 | 99 |
.. literalinclude:: ../../../snep/src/current/wf.py |
143 | 100 |
:start-after: # _STARTOF_ step_1 |
144 | 101 |
:end-before: # _ENDOF_ step_1 |
145 | 102 |
:language: python |
146 | 103 |
|
147 |
The following table sum up involved file sizes and process durations concerning |
|
148 |
this step. |
|
104 |
The following table summarizes the file sizes and process durations concerning this step. |
|
149 | 105 |
|
150 | 106 |
====== ====================== ====================== ================ |
151 | 107 |
strain fasta genome file size bowtie index file size process duration |
152 | 108 |
====== ====================== ====================== ================ |
153 | 109 |
BY 12 Mo 25 Mo 11 s. |
154 | 110 |
RM 12 Mo 24 Mo 9 s. |
155 |
YJM 12 Mo 25 Mo 11 s. |
|
156 | 111 |
====== ====================== ====================== ================ |
157 | 112 |
|
158 | 113 |
|
... | ... | |
162 | 117 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
163 | 118 |
|
164 | 119 |
Next, we launch bowtie to align reads to the reference genome. It produces a |
165 |
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` |
|
166 |
and `bedtools` are wrapped using python `subprocess` class. This step is |
|
167 |
performed by the followinw part of the `wf.py` script: |
|
120 |
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the `wf.py` script: |
|
168 | 121 |
|
169 | 122 |
.. literalinclude:: ../../../snep/src/current/wf.py |
170 | 123 |
:start-after: # _STARTOF_ step_2 |
171 | 124 |
:end-before: # _ENDOF_ step_2 |
172 | 125 |
:language: python |
173 | 126 |
|
174 |
Convert Aligned Reads for TemplateFilter
|
|
175 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
176 |
TemplateFilter use particular input format for reads, so we convert `.bed` file. |
|
177 |
TemplateFilter expect reads as following: `chr coord strand #read` where:
|
|
127 |
Convert Aligned Reads into TemplateFilter Format
|
|
128 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
|
|
129 |
|
|
130 |
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads as follows: `chr`, `coord`, `strand` and `#read` where:
|
|
178 | 131 |
|
179 |
- chr is the number of the chromosome;
|
|
180 |
- coord is the coordinate of the reads;
|
|
181 |
- strand is `F` for forward and `R` for reverse;
|
|
182 |
- #reads the number of reads for this position.
|
|
132 |
- `chr` is the number of the chromosome;
|
|
133 |
- `coord` is the coordinate of the reads;
|
|
134 |
- `strand` is `F` for forward and `R` for reverse;
|
|
135 |
- `#reads` the number of reads covering this position.
|
|
183 | 136 |
|
184 | 137 |
Each entry is *tab*-separated. |
185 | 138 |
|
186 |
**WARNING** for reverse strand bowtie returns the position of left first |
|
187 |
nucleotid when TemplateFilter is waiting for right one. So this step takes it |
|
188 |
into account and add lenght of reads (in our case 50) to reverse reads |
|
189 |
coordinate. |
|
139 |
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side. This step takes this into account by adding the read length (in our case 50) to the reverse reads coordinates. |
|
190 | 140 |
|
191 |
This step is performed by the followinw part of the `wf.py` script:
|
|
141 |
This step is performed by the following part of the `wf.py` script:
|
|
192 | 142 |
|
193 | 143 |
.. literalinclude:: ../../../snep/src/current/wf.py |
194 | 144 |
:start-after: # _STARTOF_ step_3 |
195 | 145 |
:end-before: # _ENDOF_ step_3 |
196 | 146 |
:language: python |
197 | 147 |
|
198 |
The following table sum up number of reads, involved file sizes and process durations concerning |
|
199 |
the two last steps. In our case, aligment process have been multuthreaded over over 3 cores. |
|
148 |
The following table summarises the number of reads, the involved file sizes and process durations concerning the two last steps. In our case, alignment process have been multithreaded over 3 cores. |
|
200 | 149 |
|
201 | 150 |
== ============== ========================= ====== ================ ================== ================ |
202 | 151 |
id Illumina reads aligned and filtred reads ratio `.bed` file size TF input file size process duration |
... | ... | |
207 | 156 |
4 13765584 10381903 75,42% 931 Mo 59 Mo 352 s. |
208 | 157 |
5 15168268 11502855 75,83% 1031 Mo 64 Mo 386 s. |
209 | 158 |
6 18850820 14024905 74,40% 1254 Mo 69 Mo 482 s. |
210 |
7 15591124 12126623 77,78% 1163 Mo 72 Mo 405 s. |
|
211 |
8 15659905 12475664 79,67% 1194 Mo 71 Mo 416 s. |
|
212 |
9 14668641 10960565 74,72% 1052 Mo 70 Mo 375 s. |
|
213 |
10 14339179 10454451 72,91% 1049 Mo 51 Mo 363 s. |
|
214 |
11 18019895 13688774 75,96% 1378 Mo 59 Mo 474 s. |
|
215 |
12 13746796 10810022 78,64% 1084 Mo 54 Mo 360 s. |
|
216 |
13 15205065 11766016 77,38% 990 Mo 54 Mo 381 s. |
|
217 |
14 17803097 13838883 77,73% 1154 Mo 60 Mo 452 s. |
|
218 |
15 15434564 12307878 79,74% 1032 Mo 57 Mo 394 s. |
|
219 |
16 16802587 12725665 75,74% 1221 Mo 48 Mo 438 s. |
|
220 |
17 16058417 12513734 77,93% 1192 Mo 63 Mo 422 s. |
|
221 |
18 16154482 13204331 81,74% 1277 Mo 52 Mo 430 s. |
|
222 |
19 21013924 17102120 81,38% 1646 Mo 59 Mo 555 s. |
|
223 |
20 17213114 14433357 83,85% 1389 Mo 53 Mo 459 s. |
|
224 |
21 17360907 14733001 84,86% 1203 Mo 55 Mo 450 s. |
|
225 |
22 18136816 15389581 84,85% 1257 Mo 53 Mo 469 s. |
|
226 |
23 14763678 12173025 82,45% 1140 Mo 56 Mo 393 s. |
|
227 |
24 15541709 12890345 82,94% 1057 Mo 48 Mo 398 s. |
|
228 |
25 16433215 13094314 79,68% 1241 Mo 57 Mo 433 s. |
|
229 |
26 17370850 14264136 82,12% 1347 Mo 51 Mo 466 s. |
|
230 |
27 14613512 8654495 59,22% 887 Mo 56 Mo 339 s. |
|
231 |
28 15248545 11367589 74,55% 1166 Mo 67 Mo 405 s. |
|
232 |
29 14316809 10767926 75,21% 1103 Mo 63 Mo 379 s. |
|
233 |
30 15178058 12265794 80,81% 1030 Mo 66 Mo 390 s. |
|
234 |
31 14968579 11876186 79,34% 1009 Mo 63 Mo 387 s. |
|
235 |
32 16912705 13550508 80,12% 1143 Mo 70 Mo 442 s. |
|
236 |
33 16782154 12755111 76,00% 1227 Mo 65 Mo 438 s. |
|
237 |
34 16741443 13168071 78,66% 1260 Mo 71 Mo 442 s. |
|
238 |
35 13096171 10367041 79,16% 992 Mo 62 Mo 350 s. |
|
239 | 159 |
36 17715118 14092985 79,55% 1404 Mo 68 Mo 483 s. |
240 | 160 |
37 17288466 7402082 42,82% 741 Mo 48 Mo 339 s. |
241 | 161 |
38 16116394 13178457 81,77% 1101 Mo 63 Mo 420 s. |
242 | 162 |
39 14241106 10537228 73,99% 880 Mo 57 Mo 348 s. |
243 |
40 13784738 10598464 76,89% 1005 Mo 64 Mo 358 s. |
|
244 |
41 12438007 9620975 77,35% 911 Mo 60 Mo 326 s. |
|
245 |
42 13853959 11031238 79,63% 1045 Mo 64 Mo 365 s. |
|
246 |
43 12036162 6654780 55,29% 684 Mo 46 Mo 268 s. |
|
247 |
44 13873129 10251074 73,89% 1048 Mo 61 Mo 365 s. |
|
248 |
45 19817751 14904502 75,21% 1520 Mo 72 Mo 528 s. |
|
249 |
46 13368959 10818619 80,92% 912 Mo 63 Mo 350 s. |
|
250 |
47 7566467 6139001 81,13% 520 Mo 44 Mo 201 s. |
|
251 |
48 32586928 21191363 65,03% 1816 Mo 82 Mo 766 s. |
|
252 |
49 30733184 18791373 61,14% 1801 Mo 89 Mo 721 s. |
|
253 |
50 41287616 30383875 73,59% 2911 Mo 112 Mo 1065 s. |
|
254 |
51 40439965 31177914 77,10% 2981 Mo 117 Mo 1070 s. |
|
255 | 163 |
53 40876476 33780065 82,64% 3316 Mo 103 Mo 1165 s. |
256 |
55 52424414 47117107 89,88% 3811 Mo 119 Mo 1477 s. |
|
257 | 164 |
== ============== ========================= ====== ================ ================== ================ |
258 | 165 |
|
259 |
For some reasons (manipulation efficiency, e.g. PCR...), we remove samples 33, 45, 48 and 55. |
|
260 |
|
|
261 |
|
|
262 | 166 |
Run TemplateFilter on Mnase Samples |
263 | 167 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
264 | 168 |
|
265 |
Finally, for each sample we perfome TemplateFilter analysis.
|
|
169 |
Finally, for each sample we perform TemplateFilter analysis.
|
|
266 | 170 |
|
267 | 171 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is |
268 |
define by its center and its width. An odd width leads us to considere non
|
|
269 |
interger lower and upper bound.
|
|
172 |
define by its center and its width. An odd width leads us to consider non-
|
|
173 |
integer lower and upper bound. |
|
270 | 174 |
|
271 |
**WARNING** TemplateFilter is not design to deal with replicate. So we choose to |
|
272 |
keep a maximum of nucleosome and filter in a second time using the benefit of |
|
273 |
replicate. To do that we set a low correlation threshold parameter (`0.5`) and a |
|
274 |
particularly high value of overlaping (`300%`). |
|
175 |
**WARNING** TemplateFilter is not designed to deal with replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%). |
|
275 | 176 |
|
276 |
This step is performed by the followinw part of the `wf.py` script:
|
|
177 |
This step is performed by the following part of the `wf.py` script:
|
|
277 | 178 |
|
278 | 179 |
.. literalinclude:: ../../../snep/src/current/wf.py |
279 | 180 |
:start-after: # _STARTOF_ step_4 |
... | ... | |
289 | 190 |
4 RM 88076 62 Mo 984 s. |
290 | 191 |
5 RM 90141 64 Mo 967 s. |
291 | 192 |
6 RM 87517 62 Mo 980 s. |
292 |
7 YJM 88945 64 Mo 566 s. |
|
293 |
8 YJM 88689 64 Mo 570 s. |
|
294 |
9 YJM 88128 63 Mo 565 s. |
|
295 | 193 |
== ====== ========== ============= ================ |
296 | 194 |
|
297 | 195 |
|
... | ... | |
306 | 204 |
|
307 | 205 |
|
308 | 206 |
|
207 |
|
|
208 |
|
|
209 |
|
|
210 |
|
|
211 |
|
|
212 |
|
|
213 |
|
|
214 |
|
|
215 |
|
|
216 |
|
|
217 |
|
|
218 |
|
|
219 |
|
|
220 |
|
|
221 |
|
|
222 |
|
|
223 |
|
|
224 |
|
|
225 |
|
|
226 |
|
|
227 |
|
|
228 |
|
|
229 |
|
|
230 |
|
|
231 |
|
|
232 |
|
|
233 |
|
|
234 |
|
|
235 |
|
|
236 |
|
|
237 |
|
|
238 |
|
|
239 |
|
|
240 |
|
|
241 |
|
|
242 |
|
|
243 |
|
|
244 |
|
|
245 |
|
|
246 |
|
|
247 |
|
|
248 |
.. |
|
249 |
.. |
|
250 |
.. $$$ other python script to describe: |
|
251 |
.. - libcoverage.py |
|
252 |
.. - wf.py |
|
253 |
.. |
|
254 |
.. |
|
255 |
.. |
|
256 |
.. |
|
257 |
.. |
|
258 |
.. |
|
259 |
.. In order to simplify the design of experiment, we consider Mnase as a marker. |
|
260 |
.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically |
|
261 |
.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 |
|
262 |
.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a |
|
263 |
.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information. |
|
264 |
.. |
|
265 |
.. .. autodata:: configurator.CSV_SAMPLE_FILE |
|
266 |
.. :noindex: |
|
267 |
.. |
|
268 |
.. We use a convention to link sample and Illumina fastq outputs. Illumina output |
|
269 |
.. files of the sample `ID` will be stored in the directory |
|
270 |
.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be |
|
271 |
.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`. |
|
272 |
.. |
|
273 |
.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX |
|
274 |
.. :noindex: |
|
275 |
.. |
|
276 |
.. For BY (resp. RM and YJM) we use following reference genome |
|
277 |
.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta` |
|
278 |
.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and |
|
279 |
.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`). |
|
280 |
.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information. |
|
281 |
.. |
|
282 |
.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES |
|
283 |
.. :noindex: |
|
284 |
.. |
|
285 |
.. Each chromosome/contig is identify in the fasta file by an obscure identifier. |
|
286 |
.. For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when |
|
287 |
.. TemplateFilter is waiting for an integer. So, we translate it. The index |
|
288 |
.. `FASTA_INDEXES` stores this translation. |
|
289 |
.. |
|
290 |
.. .. autodata:: configurator.FASTA_INDEXES |
|
291 |
.. :noindex: |
|
292 |
.. |
|
293 |
.. From a pragamatical point of view we discard some part of the genome (repeated |
|
294 |
.. sequence etc...). The list of the black listed area is explicitely detailled in |
|
295 |
.. `AREA_BLACK_LIST`. |
|
296 |
.. |
|
297 |
.. .. autodata:: configurator.AREA_BLACK_LIST |
|
298 |
.. :noindex: |
|
299 |
.. |
|
300 |
.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously |
|
301 |
.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. |
|
302 |
.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about |
|
303 |
.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old |
|
304 |
.. version of `NucleoMiner2` |
|
305 |
.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf). |
|
306 |
.. |
|
307 |
.. .. autodata:: configurator.C2C_FILES |
|
308 |
.. :noindex: |
|
309 |
.. |
|
310 |
.. `nucleominer` uses specific directory to work in, these are described in |
|
311 |
.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`. |
|
312 |
.. |
|
313 |
.. Finally, `nucleominer` use external ressources, the path to these resspources |
|
314 |
.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, |
|
315 |
.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`. |
|
316 |
.. |
|
317 |
.. All paths, prefixes and indexes could be change in the |
|
318 |
.. `src/current/nucleominer_config.json` file. |
|
319 |
.. |
|
320 |
.. .. autodata:: wf.json_conf_file |
|
321 |
.. :noindex: |
|
322 |
.. |
|
323 |
|
|
324 |
|
|
325 |
|
|
326 |
|
|
327 |
|
|
328 |
|
|
329 |
|
|
330 |
|
|
331 |
|
|
332 |
|
|
333 |
|
|
334 |
|
|
335 |
|
|
336 |
|
|
337 |
|
|
338 |
|
|
339 |
|
|
340 |
|
|
341 |
|
|
342 |
|
|
343 |
|
|
344 |
|
|
345 |
|
|
346 |
|
|
347 |
|
|
348 |
|
|
349 |
|
|
350 |
|
|
351 |
|
|
352 |
|
|
353 |
|
|
354 |
|
|
355 |
|
|
356 |
|
|
357 |
|
|
358 |
|
|
359 |
|
|
360 |
|
|
361 |
|
|
362 |
|
|
363 |
|
|
364 |
|
|
309 | 365 |
Inferring Nucleosome Position and Extracting Read Counts |
310 | 366 |
-------------------------------------------------------- |
311 | 367 |
|
312 | 368 |
|
313 | 369 |
|
314 |
The second part of the tutorial uses `R` (http://http://www.r-project.org). It consists in a set of R scripts taht will be sourced in an R console launched at the root of your project. the R srcipts are:
|
|
370 |
The second part of the tutorial uses R (http://http://www.r-project.org). It consists of a set of R scripts that will be sourced in an R from a console launched at the root of your project. These scripts are:
|
|
315 | 371 |
|
316 | 372 |
- headers.R |
317 | 373 |
- extract_maps.R |
318 |
- compare_common_wp.R
|
|
374 |
- translate_common_wp.R
|
|
319 | 375 |
- split_samples.R |
320 | 376 |
- count_reads.R |
321 | 377 |
- get_size_factors |
... | ... | |
324 | 380 |
The Script headers.R |
325 | 381 |
^^^^^^^^^^^^^^^^^^^^ |
326 | 382 |
|
327 |
The script header.R is included in each other scripts. It is in charge of: |
|
383 |
The script headers.R is included in each other scripts. It is in charge of:
|
|
328 | 384 |
|
329 |
- launching libraries used in thes scripts
|
|
385 |
- launching libraries used in the scripts |
|
330 | 386 |
- launching configuration (design, strain, marker...) |
331 |
- computing and caching CURs |
|
387 |
- computing and caching CURs (caching means storing the information in the computer's memory) |
|
388 |
|
|
389 |
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. You may be using NucleoMiner2 to analyse data of a single strain, or of several strains. |
|
390 |
|
|
391 |
- All the data corresponds to the same strain (e.g. treatment/control, or only few mutations): Then in step 1), the regions to use are entire chromosomes. Instep 2) simply use the default translate function which is neutral. |
|
392 |
|
|
393 |
- The data come from two or more strains: In this case, edit a list of regions and customize the translate function which performs the correspondence between the different genomes. How we did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to the Appendice "Generate .c2c Files"), then use it to produce the list of regions and customise “translate”. |
|
394 |
|
|
395 |
|
|
396 |
|
|
332 | 397 |
|
333 | 398 |
In your R console, run the following command line: |
334 | 399 |
|
335 | 400 |
.. code:: bash |
336 | 401 |
|
337 |
R CMD BATCH src/current/header.R
|
|
402 |
source("src/current/headers.R")
|
|
338 | 403 |
|
339 | 404 |
|
340 | 405 |
The Script extract_maps.R |
341 | 406 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
342 |
This script is in charge of extracting Maps for well positioned and fuzzy nucleosomes. First of all, this script computed intra and inter strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well positioned, fuzzy and UNR maps.
|
|
407 |
This script is in charge of extracting Maps for well-positioned and fuzzy nucleosomes. First of all, this script computes intra and inter-strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well-positioned, fuzzy and UNR maps.
|
|
343 | 408 |
|
344 |
The well-positioned map for BY is collected in the result directory and is called **BY_wp.tab**. It is composed of following columns:
|
|
409 |
The well-positioned map for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
|
|
345 | 410 |
|
346 | 411 |
- chr, the number of the chromosome |
347 | 412 |
- lower_bound, the lower bound of the nucleosome |
348 | 413 |
- upper_bound, the upper bound of the nucleosome |
349 | 414 |
- cur_index, index of the CUR |
350 | 415 |
- index_nuc, the index of the nucleosome in the CUR |
351 |
- wp, 1 if it is a well positioned nucleosome, 0 else
|
|
352 |
- nb_reads, the number of reads that supports this nucleosome
|
|
353 |
- nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates if it is a well-positioned nucleosome)
|
|
354 |
- llr_1, for a well-positioned nucleosome, it is the LLR1 between the first and the second TemplateFilter nucleosome.
|
|
355 |
- llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the first TemplateFilter nucleosome.
|
|
356 |
- wp_llr, for a well-positioned nucleosome, it is the LLR2 overall TemplateFilter nucleosomes.
|
|
357 |
- wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (**1-pchisq(2.LLR2, df=4)**)
|
|
358 |
- dyad_shift, for a well-positioned nucleosome, it is shift between the two extreme TemplateFilter nucleosome dyad positions. |
|
416 |
- wp, 1 if it is a well positioned nucleosome, 0 otherwise
|
|
417 |
- nb_reads, the number of reads that support this nucleosome |
|
418 |
- nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
|
|
419 |
- llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
|
|
420 |
- llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
|
|
421 |
- wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
|
|
422 |
- wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (`1-pchisq(2.LLR2, df=4)`)
|
|
423 |
- dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions.
|
|
359 | 424 |
|
360 |
The fuzzy map for BY is collected in the result directory and is called **BY_fuzzy.tab**. It is composed of following columns:
|
|
425 |
The fuzzy map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
|
|
361 | 426 |
|
362 | 427 |
- chr, the number of the chromosome |
363 | 428 |
- lower_bound, the lower bound of the nucleosome |
364 | 429 |
- upper_bound, the upper bound of the nucleosome |
365 | 430 |
- cur_index, index of the CUR |
366 | 431 |
|
367 |
The common well-position map for BY and RM strains is collected in the result directory and is called **BY_RM_common_wp.tab**. It is composed of following columns:
|
|
432 |
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
|
|
368 | 433 |
|
369 | 434 |
- cur_index, the index of the CUR |
370 | 435 |
- index_nuc_BY, the index of the BY nucleosome in the CUR |
371 |
- index_nuc_RM,the index of the RM nucleosome in the CUR |
|
372 |
- llr_score, the LLR3 score between th eBy and RM nucleosomes |
|
373 |
- common_wp_pval, the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**) |
|
436 |
- index_nuc_RM, the index of the RM nucleosome in the CUR |
|
437 |
- llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM |
|
438 |
- common_wp_pval, the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`) |
|
439 |
- diff, the dyads shift between the positions in the two strains |
|
374 | 440 |
|
375 |
The common UNR map for BY and RM strains is collected in the result directory and is called **BY_RM_common_unr.tab**. It is composed of following columns:
|
|
441 |
The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
|
|
376 | 442 |
|
377 | 443 |
- cur_index, the index of the CUR |
378 | 444 |
- index_nuc_BY, the index of the BY nucleosome in the CUR |
379 | 445 |
- index_nuc_RM,the index of the RM nucleosome in the CUR |
380 | 446 |
|
381 |
To execute this script, run the following command line in your R console:
|
|
447 |
To execute this script, run the following command in your R console: |
|
382 | 448 |
|
383 | 449 |
.. code:: bash |
384 | 450 |
|
385 | 451 |
source("src/current/extract_maps.R") |
386 | 452 |
|
387 | 453 |
|
388 |
The Script compare_common_wp.R |
|
389 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
390 |
|
|
391 |
This script is used to compare inter strain distances between common well-positioned nucleosomes. |
|
392 |
|
|
393 |
For example, it compute the file **BY_RM_common_wp_diff.tab** that contains dyad shifts between two well-positioned nucleosomes. It is composed of following columns: |
|
394 |
- cur_index, the index of the CUR |
|
395 |
- index_nuc_BY, the index of the BY nucleosome in the CUR |
|
396 |
- index_nuc_RM,the index of the RM nucleosome in the CUR |
|
397 |
- llr_score, the LLR3 score between th eBy and RM nucleosomes |
|
398 |
- common_wp_pval, the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**) |
|
399 |
- diff, the dyad shifts between two well-positioned nucleosomes |
|
454 |
The Script translate_common_wp.R |
|
455 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
400 | 456 |
|
401 |
It also translates well-positioned nucleosome maps from a strain to an other strain and stores it into a table.
|
|
457 |
This script is used to translate common well-positioned nucleosome maps from a strain to another strain and stores it into a table.
|
|
402 | 458 |
|
403 |
For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains RM well-positioned nucleosome translated into the BY genome referential. It is composed of following columns:
|
|
459 |
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosome translated into the BY genome coordinates. It is composed of following columns:
|
|
404 | 460 |
|
405 | 461 |
- strain_ref, the reference genome (in which positioned are defined) |
406 | 462 |
- begin, the translated lower bound of the nucleosome |
407 | 463 |
- end, the translated upper bound of the nucleosome |
408 |
- chr, the number of chromosome for the reference genome (in which positioned are defined) |
|
464 |
- chr, the number of chromosomes for the reference genome (in which positioned are defined)
|
|
409 | 465 |
- length, the length of the nucleosome (could be negative) |
410 | 466 |
- cur_index, the index of the CUR |
411 | 467 |
- index_nuc, the index of the nucleosome in the CUR |
412 | 468 |
|
469 |
To execute this script, run the following command in your R console: |
|
413 | 470 |
|
471 |
.. code:: bash |
|
414 | 472 |
|
415 |
To execute this script, run the following command line in your R console:
|
|
473 |
source("src/current/translate_common_wp.R")
|
|
416 | 474 |
|
417 |
.. code:: bash |
|
418 | 475 |
|
419 |
R CMD BATCH src/current/compare_common_wp.R |
|
476 |
The Script split_samples.R |
|
477 |
^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
420 | 478 |
|
479 |
For memory space usage reasons, we split and compress TemplateFilter input files according to their corresponding chromosome. for example, `sample_1_TF.tab` will be split into : |
|
421 | 480 |
|
422 |
Split and Compress Samples According CURs |
|
423 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
481 |
- sample_1_chr_1_splited_sample.tab.gz |
|
482 |
- sample_1_chr_2_splited_sample.tab.gz |
|
483 |
- ... |
|
484 |
- sample_1_chr_17_splited_sample.tab.gz |
|
485 |
|
|
486 |
|
|
487 |
To execute this script, run the following command in your R console: |
|
424 | 488 |
|
425 | 489 |
.. code:: bash |
426 | 490 |
|
427 |
R CMD BATCH src/current/split_samples.R
|
|
491 |
source("src/current/split_samples.R")
|
|
428 | 492 |
|
429 | 493 |
|
430 |
Count Reads for Each Nucleosome |
|
431 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
494 |
The Script count_reads.R |
|
495 |
^^^^^^^^^^^^^^^^^^^^^^^^ |
|
496 |
|
|
497 |
To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab` |
|
498 |
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively. |
|
499 |
|
|
500 |
For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns: |
|
501 |
|
|
502 |
- chr_BY, the number of the chromosome for BY |
|
503 |
- lower_bound_BY, the lower bound of the nucleosome for BY |
|
504 |
- upper_bound_BY, the upper bound of the nucleosome for BY |
|
505 |
- index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
|
506 |
- chr_RM, the number of the chromosome for RM |
|
507 |
- lower_bound_RM, the lower bound of the nucleosome for RM |
|
508 |
- upper_bound_RM, the upper bound of the nucleosome for RM |
|
509 |
- index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
|
510 |
- cur_index, index of the CUR |
|
511 |
- BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36 |
|
512 |
- BY_H3K14ac_37, #reads for sample 37 |
|
513 |
- BY_H3K14ac_53, #reads for sample 53 |
|
514 |
- RM_H3K14ac_38, #reads for sample 38 |
|
515 |
- RM_H3K14ac_39, #reads for sample 39 |
|
516 |
|
|
517 |
To execute this script, run the following command in your R console: |
|
432 | 518 |
|
433 | 519 |
.. code:: bash |
434 | 520 |
|
435 |
R CMD BATCH src/current/count_reads.R |
|
521 |
source("src/current/count_reads.R") |
|
522 |
|
|
523 |
|
|
524 |
The Script get_size_factors.R |
|
525 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
526 |
|
|
527 |
|
|
528 |
This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR, |
|
529 |
the normalised count is n/f where f is the factor contained in this file. |
|
530 |
The script dumps computed size factors into the file `size_factors.tab`. This file has the form: |
|
531 |
|
|
532 |
========= ======= ======= ======= |
|
533 |
sample_id wp unr wpunr |
|
534 |
========= ======= ======= ======= |
|
535 |
1 0.87396 0.88097 0.87584 |
|
536 |
2 1.07890 1.07440 1.07760 |
|
537 |
3 1.06400 1.05890 1.06250 |
|
538 |
4 0.85782 0.87948 0.86305 |
|
539 |
5 0.97577 0.96590 0.97307 |
|
540 |
6 1.19630 1.18120 1.19190 |
|
541 |
36 0.93318 0.92762 0.93166 |
|
542 |
37 0.48315 0.48453 0.48350 |
|
543 |
38 1.11240 1.11210 1.11230 |
|
544 |
39 0.89897 0.89917 0.89903 |
|
545 |
53 2.22650 2.22700 2.22660 |
|
546 |
========= ======= ======= ======= |
|
547 |
|
|
548 |
sample_id are given in file samples.csv |
|
436 | 549 |
|
550 |
If you don't know which column to use, we recommend using wpunr. |
|
437 | 551 |
|
438 |
Get Size Factors Using DESeq |
|
439 |
^^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
552 |
If you want the very detailed factors produced by DESeq, here are the information: |
|
553 |
|
|
554 |
- unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM) |
|
555 |
- wp: same, but for well-positioned nucleosomes. |
|
556 |
- wpunr: both types of regions. |
|
557 |
|
|
558 |
To execute this script, run the following command in your R console: |
|
440 | 559 |
|
441 | 560 |
.. code:: bash |
442 | 561 |
|
443 |
R CMD BATCH src/current/get_size_factors.R
|
|
562 |
source("src/current/get_size_factors.R")
|
|
444 | 563 |
|
445 | 564 |
|
446 |
Performing DESeq Analysis
|
|
565 |
The Script launch_deseq.R
|
|
447 | 566 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
448 | 567 |
|
568 |
Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files: |
|
569 |
|
|
570 |
- results/current/BY_RM_H3K14ac_wp_snep.tab |
|
571 |
- results/current/BY_RM_H3K14ac_unr_snep.tab |
|
572 |
- results/current/BY_RM_H3K14ac_wpunr_snep.tab |
|
573 |
- results/current/BY_RM_H3K14ac_wp_mnase.tab |
|
574 |
- results/current/BY_RM_H3K14ac_unr_mnase.tab |
|
575 |
- results/current/BY_RM_H3K14ac_wpunr_mnase.tab |
|
576 |
|
|
577 |
These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example): |
|
578 |
|
|
579 |
- chr_BY, the number of the chromosome for BY |
|
580 |
- lower_bound_BY, the lower bound of the nucleosome for BY |
|
581 |
- upper_bound_BY, the upper bound of the nucleosome for BY |
|
582 |
- index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
|
583 |
- chr_RM, the number of the chromosome for RM |
|
584 |
- lower_bound_RM, the lower bound of the nucleosome for RM |
|
585 |
- upper_bound_RM, the upper bound of the nucleosome for RM |
|
586 |
- index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
|
587 |
- cur_index, index of the CUR |
|
588 |
- form |
|
589 |
- BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1 |
|
590 |
|
|
591 |
Next columns concern indicators for each sample: |
|
592 |
|
|
593 |
- BY_Mnase_Seq_2, #reads for sample 2 |
|
594 |
- BY_Mnase_Seq_3, #reads for sample 3 |
|
595 |
- RM_Mnase_Seq_4, #reads for sample 4 |
|
596 |
- RM_Mnase_Seq_5, #reads for sample 5 |
|
597 |
- RM_Mnase_Seq_6, #reads for sample 6 |
|
598 |
- BY_H3K14ac_36, #reads for sample 36 |
|
599 |
- BY_H3K14ac_37, #reads for sample 37 |
|
600 |
- BY_H3K14ac_53, #reads for sample 53 |
|
601 |
- RM_H3K14ac_38, #reads for sample 38 |
|
602 |
- RM_H3K14ac_39, #reads for sample 39 |
|
603 |
|
|
604 |
The 5 last columns concern DESeq analysis: |
|
605 |
|
|
606 |
- manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model. |
|
607 |
- pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or not the interaction term marker:strain. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP. |
|
608 |
- snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001. |
|
609 |
To execute this script, run the following command |
|
610 |
|
|
611 |
To execute this script, run the following command in your R console: |
|
612 |
|
|
449 | 613 |
.. code:: bash |
450 | 614 |
|
451 |
R CMD BATCH src/current/launch_deseq.R
|
|
615 |
source("src/current/launch_deseq.R")
|
|
452 | 616 |
|
453 | 617 |
|
454 |
Results |
|
455 |
------- |
|
618 |
Results: Number of SNEPs
|
|
619 |
------------------------
|
|
456 | 620 |
|
457 |
Output Files Organisation |
|
458 |
^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
459 |
Previous steps produce following 45 files. |
|
460 |
Each filename is under the form |
|
621 |
Here are the number of computed SNEPs for each forms. |
|
461 | 622 |
|
462 |
.. code:: bash |
|
623 |
===== ======= ===== ======= |
|
624 |
form strains #nucs H3K14ac |
|
625 |
===== ======= ===== ======= |
|
626 |
wp BY-RM 30464 3549 |
|
627 |
unr BY-RM 9497 1559 |
|
628 |
wpunr BY-RM 39961 5240 |
|
629 |
===== ======= ===== ======= |
|
630 |
|
|
463 | 631 |
|
464 |
results/current/[combi]_[marker]_[form]_snep.tab |
|
465 | 632 |
|
466 |
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, marker is |
|
467 |
in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each post translational |
|
468 |
histone modification and form is in {wp, fuzzy, wpfuzzy} considering well |
|
469 |
positioned nucleosomes, fuzzy nucleosomes or both for SNEP computation. |
|
470 | 633 |
|
471 | 634 |
|
472 | 635 |
|
473 |
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM lower_bound_RM upper_bound_RM index_nuc_RM roi_index form BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36
|
|
474 |
BY_H3K14ac_37 BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM
|
|
636 |
APPENDICE: Generate .c2c Files
|
|
637 |
------------------------------
|
|
475 | 638 |
|
476 |
For each file, there is 1 line per nucleosome and each line is composed of many columns divided into 3 main topics: |
|
477 |
- nuc information |
|
478 |
- number opf reads for each sample |
|
479 |
- DESeq analysis results. |
|
639 |
The `.c2c` files is a simple table that describes how the genome sequence can be aligned. We generate it using NucleoMiner 1.0. |
|
480 | 640 |
|
481 |
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: |
|
482 |
- chr_BY, the BY chr involved |
|
483 |
- lower_bound_BY, the lower bound of the BY nuc |
|
484 |
- upper_bound_BY, the upper_bound of the BY nuc |
|
485 |
- index_nuc_BY, the index of the nuc in the entire list of BY nucs |
|
486 |
- chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM |
|
487 |
are the same information for the RM strain |
|
488 |
- roi_index, the index of the region of interrest involved. |
|
489 |
|
|
490 |
Next cols concern indicators for each sample. They are labeled [strain]_[marker]_[sample_id] and each value represents the number of reads for the current nuc for the sample *sample_id*. |
|
641 |
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first to install the Genetic Data analysis Library (GDL), which is a dynamic library of useful C functions derived from the GNU Scientific Library. |
|
491 | 642 |
|
492 |
The 5 final columns concern DESeq analysis: |
|
493 |
- manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. |
|
494 |
- pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or the interaction term *marker:strain* |
|
495 |
- snep_index, a boolean set to TRUE if the *pvalueGLM* value is under the threshold computed with FDR function with a rate set to 0.01%. |
|
643 |
Installing the GDL library |
|
644 |
^^^^^^^^^^^^^^^^^^^^^^^^^^ |
|
496 | 645 |
|
497 |
It also produces the file that explicts size factor for each involved sample in differents strain combination and nucleosomal region type:
|
|
646 |
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps of your working directory). Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
|
|
498 | 647 |
|
499 |
TODO: include this file... /home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt
|
|
648 |
tar -xvzf gdl-1.0.tar.gz
|
|
500 | 649 |
|
650 |
This creates a directory called gdl-1.0. You now need to go into this directory and compile the library, by typing: |
|
501 | 651 |
|
502 | 652 |
.. code:: bash |
503 | 653 |
|
504 |
results/current/size_factors.tab |
|
654 |
cd gdl-1.0 |
|
655 |
./configure |
|
656 |
make |
|
505 | 657 |
|
658 |
Now you need to install the library on your system. This needs root priviledges: |
|
506 | 659 |
|
660 |
.. code:: bash |
|
661 |
|
|
662 |
sudo make install |
|
663 |
|
|
664 |
Installing NucleoMiner |
|
665 |
^^^^^^^^^^^^^^^^^^^^^^ |
|
666 |
|
|
667 |
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing: |
|
668 |
|
|
669 |
tar -xvzf nucleominer-1.0.tar.gz |
|
670 |
This creates a directory called nucleominer-1.0. You now need to go into this directory and compile the library, by typing: |
|
507 | 671 |
|
672 |
.. code:: bash |
|
508 | 673 |
|
509 |
Number of SNEPs |
|
510 |
^^^^^^^^^^^^^^^ |
|
674 |
cd nucleominer-1.0 |
|
675 |
./configure |
|
676 |
make |
|
511 | 677 |
|
512 |
Here are the number of computed for each forms.
|
|
678 |
You can then use the binaries dircetly from this folder (best then is to add the path to this folder in your PATH environment variable). If you want to install nucleominer at the system's level (useful if mutiple users will need it) then type, with root priviledges:
|
|
513 | 679 |
|
514 | 680 |
.. code:: bash |
515 | 681 |
|
516 |
[1] "wp" |
|
517 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
|
518 |
BY-RM 30234 520 798 83 3566 26 |
|
519 |
BY-YJM 31298 303 619 102 103 128 |
|
520 |
RM-YJM 29863 129 340 46 3177 18 |
|
521 |
[1] "fuzzy" |
|
522 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
|
523 |
BY-RM 10748 294 308 101 1681 42 |
|
524 |
BY-YJM 10669 122 176 124 93 87 |
|
525 |
RM-YJM 11478 54 112 41 1389 20 |
|
526 |
[1] "wpfuzzy" |
|
527 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
|
528 |
BY-RM 40982 770 1136 183 5404 73 |
|
529 |
BY-YJM 41967 439 804 214 198 199 |
|
530 |
RM-YJM 41341 184 468 87 4687 37 |
|
531 |
|
|
532 |
|
|
533 |
TODO: |
|
534 |
- Print/study intra/inter strain LODs. |
|
535 |
- Check the normality of sample using Shapiro–Wilk (Hypothesis for computing LODs) |
|
536 |
|
|
682 |
sudo make install |
|
683 |
|
|
684 |
Generate .c2c Files |
|
685 |
^^^^^^^^^^^^^^^^^^^ |
|
686 |
|
|
687 |
To generate .c2c files you need to type the following command in a terminal: |
|
688 |
|
|
689 |
.. code:: bash |
|
537 | 690 |
|
691 |
mkdir dir_4_c2c |
|
692 |
NMgxcomp Data/BY_S288c/Sequence/Genome.fasta \ |
|
693 |
Data/RM_11-1a/Sequence/Genome.fasta \ |
|
694 |
dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log |
|
695 |
|
|
696 |
After execution, the directory dir_4_c2c will hold the .c2c files. |
Formats disponibles : Unified diff