Statistiques
| Branche: | Révision :

root / doc / sphinx_doc / tuto.rst @ dadb6a4d

Historique | Voir | Annoter | Télécharger (23,21 ko)

1 935a568c Florent Chuffart
Tutorial
2 935a568c Florent Chuffart
========
3 935a568c Florent Chuffart
4 935a568c Florent Chuffart
This tutorial describes steps allowing to perform quantitave analysis of 
5 935a568c Florent Chuffart
nucleosomal epigenome. We assume that files are organised around a given 
6 935a568c Florent Chuffart
hierarchie and that all command lines are launched from project's root.
7 935a568c Florent Chuffart
8 935a568c Florent Chuffart
This tutorial is divided into t=wo main parts. First one consists in the python 
9 935a568c Florent Chuffart
script `wf.py` that aligns and convert Illumina reads. Second one is the R 
10 935a568c Florent Chuffart
script `main.r` that extracts information (nucleosome position and indicators) 
11 935a568c Florent Chuffart
from the dataset.
12 935a568c Florent Chuffart
13 935a568c Florent Chuffart
14 dadb6a4d Florent Chuffart
Python and R Common Configuration File
15 dadb6a4d Florent Chuffart
--------------------------------------
16 dadb6a4d Florent Chuffart
17 dadb6a4d Florent Chuffart
First of all we define in one place some configuration variables that will be launch by python and R scripts. This file is **configurator.py**. The execution of this python script dumps variables into the **nucleo_miner_config.json** that will be launch by both kind of scriopts (R and puython).
18 dadb6a4d Florent Chuffart
19 dadb6a4d Florent Chuffart
To do this launch at the root of your project the following command line:
20 dadb6a4d Florent Chuffart
21 dadb6a4d Florent Chuffart
.. code:: bash
22 dadb6a4d Florent Chuffart
23 dadb6a4d Florent Chuffart
  python src/current/configurator.py
24 dadb6a4d Florent Chuffart
  
25 dadb6a4d Florent Chuffart
26 dadb6a4d Florent Chuffart
$$$ other python script to describe:
27 dadb6a4d Florent Chuffart
- libcoverage.py
28 dadb6a4d Florent Chuffart
- wf.py
29 dadb6a4d Florent Chuffart
30 dadb6a4d Florent Chuffart
31 dadb6a4d Florent Chuffart
32 dadb6a4d Florent Chuffart
33 dadb6a4d Florent Chuffart
34 dadb6a4d Florent Chuffart
Dataset and Configuration Variables
35 dadb6a4d Florent Chuffart
-----------------------------------
36 935a568c Florent Chuffart
37 935a568c Florent Chuffart
We want to compare nucleosomes of 3 yeast strains: 
38 935a568c Florent Chuffart
39 935a568c Florent Chuffart
- BY
40 935a568c Florent Chuffart
- RM
41 935a568c Florent Chuffart
- YJM
42 935a568c Florent Chuffart
43 935a568c Florent Chuffart
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 following 
44 935a568c Florent Chuffart
markers: 
45 935a568c Florent Chuffart
46 935a568c Florent Chuffart
- H3K4me1
47 935a568c Florent Chuffart
- H3K4me3
48 935a568c Florent Chuffart
- H3K9ac
49 935a568c Florent Chuffart
- H3K14ac
50 935a568c Florent Chuffart
- H4K12ac
51 935a568c Florent Chuffart
52 dadb6a4d Florent Chuffart
In order to simplify the design of experiment, we considere Mnase as a marker. 
53 935a568c Florent Chuffart
For each couple `(strain, marker)` we perform 3 replicates. So, theoritically 
54 935a568c Florent Chuffart
we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2 
55 935a568c Florent Chuffart
replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a 
56 935a568c Florent Chuffart
uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
57 935a568c Florent Chuffart
58 935a568c Florent Chuffart
.. autodata:: configurator.CSV_SAMPLE_FILE
59 935a568c Florent Chuffart
    :noindex: 
60 935a568c Florent Chuffart
		
61 935a568c Florent Chuffart
We use a convention to link sample and Illumina fastq outputs. Illumina output 
62 935a568c Florent Chuffart
files of the sample `ID` will be stored in the directory 
63 935a568c Florent Chuffart
`ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be 
64 935a568c Florent Chuffart
stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
65 935a568c Florent Chuffart
66 935a568c Florent Chuffart
.. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
67 935a568c Florent Chuffart
    :noindex: 
68 935a568c Florent Chuffart
69 935a568c Florent Chuffart
For BY (resp. RM and YJM) we use following reference genome 
70 935a568c Florent Chuffart
`saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
71 935a568c Florent Chuffart
(resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and 
72 935a568c Florent Chuffart
`saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
73 935a568c Florent Chuffart
The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
74 935a568c Florent Chuffart
75 935a568c Florent Chuffart
.. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
76 935a568c Florent Chuffart
    :noindex: 
77 935a568c Florent Chuffart
78 935a568c Florent Chuffart
Each chromosome/contig is identify in the fasta file by an obscure identifier. 
79 935a568c Florent Chuffart
For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when 
80 935a568c Florent Chuffart
TemplateFilter is waiting for an integer. So, we translate it. The index 
81 935a568c Florent Chuffart
`FASTA_INDEXES` stores this translation.
82 935a568c Florent Chuffart
83 935a568c Florent Chuffart
.. autodata:: configurator.FASTA_INDEXES
84 935a568c Florent Chuffart
    :noindex: 
85 935a568c Florent Chuffart
86 935a568c Florent Chuffart
From a pragamatical point of view we discard some part of the genome (repeated 
87 935a568c Florent Chuffart
sequence etc...). The list of the black listed area is explicitely detailled in 
88 935a568c Florent Chuffart
`AREA_BLACK_LIST`.
89 935a568c Florent Chuffart
90 935a568c Florent Chuffart
.. autodata:: configurator.AREA_BLACK_LIST
91 935a568c Florent Chuffart
    :noindex: 
92 935a568c Florent Chuffart
93 935a568c Florent Chuffart
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously 
94 935a568c Florent Chuffart
compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp. 
95 935a568c Florent Chuffart
`BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about 
96 935a568c Florent Chuffart
.c2c files, please read section 5 of the manual of `NucleoMiner`, the old 
97 935a568c Florent Chuffart
version of `NucleoMiner2` 
98 935a568c Florent Chuffart
(http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
99 935a568c Florent Chuffart
100 935a568c Florent Chuffart
.. autodata:: configurator.C2C_FILES
101 935a568c Florent Chuffart
    :noindex: 
102 935a568c Florent Chuffart
103 935a568c Florent Chuffart
`nucleominer` uses specific directory to work in, these are described in 
104 935a568c Florent Chuffart
`INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
105 935a568c Florent Chuffart
106 935a568c Florent Chuffart
Finally, `nucleominer` use external ressources, the path to these resspources 
107 935a568c Florent Chuffart
are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`, 
108 935a568c Florent Chuffart
`BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
109 935a568c Florent Chuffart
110 935a568c Florent Chuffart
All paths, prefixes and indexes could be change in the 
111 8e9facd8 Florent Chuffart
`src/current/nucleominer_config.json` file.
112 935a568c Florent Chuffart
113 935a568c Florent Chuffart
.. autodata:: wf.json_conf_file
114 935a568c Florent Chuffart
    :noindex: 
115 935a568c Florent Chuffart
116 935a568c Florent Chuffart
117 935a568c Florent Chuffart
Preprocessing Illumina Fastq Reads for Each Sample
118 935a568c Florent Chuffart
--------------------------------------------------
119 935a568c Florent Chuffart
120 935a568c Florent Chuffart
This preprocessing step consists in the 4 main steps embed in the `wf.py` and 
121 935a568c Florent Chuffart
described bellow. As a preamble, this script computes `samples` `samples_mnase` 
122 935a568c Florent Chuffart
and `strains` that will be used along the 4 steps.
123 935a568c Florent Chuffart
124 935a568c Florent Chuffart
.. autodata:: wf.samples
125 935a568c Florent Chuffart
    :noindex: 
126 935a568c Florent Chuffart
127 935a568c Florent Chuffart
.. autodata:: wf.samples_mnase
128 935a568c Florent Chuffart
    :noindex: 
129 935a568c Florent Chuffart
130 935a568c Florent Chuffart
.. autodata:: wf.strains
131 935a568c Florent Chuffart
    :noindex: 
132 935a568c Florent Chuffart
133 935a568c Florent Chuffart
134 935a568c Florent Chuffart
Creating Bowtie Index from each Reference Genome
135 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
136 935a568c Florent Chuffart
137 935a568c Florent Chuffart
For each strain, we need to create bowtie index. Bowtie index of a strain is a 
138 935a568c Florent Chuffart
tree view of the genemoe reference for this strain. It will be used by 
139 935a568c Florent Chuffart
bowtie to align reads. This step is performed by the following part of the 
140 935a568c Florent Chuffart
`wf.py` script:
141 935a568c Florent Chuffart
142 8e9facd8 Florent Chuffart
.. literalinclude:: ../../../snep/src/current/wf.py
143 935a568c Florent Chuffart
   :start-after: # _STARTOF_ step_1
144 935a568c Florent Chuffart
   :end-before: # _ENDOF_ step_1
145 935a568c Florent Chuffart
   :language: python
146 935a568c Florent Chuffart
147 935a568c Florent Chuffart
The following table sum up involved file sizes and process durations concerning 
148 935a568c Florent Chuffart
this step.
149 935a568c Florent Chuffart
150 935a568c Florent Chuffart
======  ======================  ======================  ================
151 935a568c Florent Chuffart
strain  fasta genome file size  bowtie index file size  process duration
152 935a568c Florent Chuffart
======  ======================  ======================  ================
153 935a568c Florent Chuffart
BY      12 Mo                          25 Mo                    11 s.
154 935a568c Florent Chuffart
RM      12 Mo                          24 Mo                    9 s.
155 935a568c Florent Chuffart
YJM     12 Mo                          25 Mo                    11 s.
156 935a568c Florent Chuffart
======  ======================  ======================  ================
157 935a568c Florent Chuffart
158 935a568c Florent Chuffart
159 935a568c Florent Chuffart
160 935a568c Florent Chuffart
161 935a568c Florent Chuffart
Aligning Reads to Reference Genome 
162 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
163 935a568c Florent Chuffart
164 935a568c Florent Chuffart
Next, we launch bowtie to align reads to the reference genome. It produces a 
165 935a568c Florent Chuffart
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` 
166 935a568c Florent Chuffart
and `bedtools` are wrapped using python `subprocess` class. This step is 
167 935a568c Florent Chuffart
performed by the followinw part of the `wf.py` script:
168 935a568c Florent Chuffart
169 8e9facd8 Florent Chuffart
.. literalinclude:: ../../../snep/src/current/wf.py
170 935a568c Florent Chuffart
   :start-after: # _STARTOF_ step_2
171 935a568c Florent Chuffart
   :end-before: # _ENDOF_ step_2
172 935a568c Florent Chuffart
   :language: python
173 935a568c Florent Chuffart
174 935a568c Florent Chuffart
Convert Aligned Reads for TemplateFilter
175 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
176 935a568c Florent Chuffart
TemplateFilter use particular input format for reads, so we convert `.bed` file. 
177 935a568c Florent Chuffart
TemplateFilter expect reads as following: `chr coord strand #read` where:
178 935a568c Florent Chuffart
179 935a568c Florent Chuffart
- chr is the number of the chromosome;
180 935a568c Florent Chuffart
- coord is the coordinate of the reads;
181 935a568c Florent Chuffart
- strand is `F` for forward and `R` for reverse;
182 935a568c Florent Chuffart
- #reads the number of reads for this position.
183 935a568c Florent Chuffart
184 935a568c Florent Chuffart
Each entry is *tab*-separated.
185 935a568c Florent Chuffart
186 935a568c Florent Chuffart
**WARNING** for reverse strand bowtie returns the position of left first 
187 935a568c Florent Chuffart
nucleotid when TemplateFilter is waiting for right one. So this step takes it 
188 935a568c Florent Chuffart
into account and add lenght of reads (in our case 50) to reverse reads 
189 935a568c Florent Chuffart
coordinate.
190 935a568c Florent Chuffart
191 935a568c Florent Chuffart
This step is performed by the followinw part of the `wf.py` script:
192 935a568c Florent Chuffart
193 8e9facd8 Florent Chuffart
.. literalinclude:: ../../../snep/src/current/wf.py
194 935a568c Florent Chuffart
   :start-after: # _STARTOF_ step_3
195 935a568c Florent Chuffart
   :end-before: # _ENDOF_ step_3
196 935a568c Florent Chuffart
   :language: python
197 935a568c Florent Chuffart
198 935a568c Florent Chuffart
The following table sum up number of reads, involved file sizes and process durations concerning 
199 935a568c Florent Chuffart
the two last steps. In our case, aligment process have been multuthreaded over over 3 cores.
200 935a568c Florent Chuffart
201 935a568c Florent Chuffart
==  ==============  =========================  ======  ================  ==================  ================  
202 935a568c Florent Chuffart
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
203 935a568c Florent Chuffart
==  ==============  =========================  ======  ================  ==================  ================
204 935a568c Florent Chuffart
1   16436138        10199695                   62,06%  1064 Mo           60  Mo              383   s.
205 935a568c Florent Chuffart
2   16911132        12512727                   73,99%  1298 Mo           64  Mo              437   s.
206 935a568c Florent Chuffart
3   15946902        12340426                   77,38%  1280 Mo           65  Mo              423   s.
207 935a568c Florent Chuffart
4   13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
208 935a568c Florent Chuffart
5   15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
209 935a568c Florent Chuffart
6   18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
210 935a568c Florent Chuffart
7   15591124        12126623                   77,78%  1163 Mo           72  Mo              405   s.
211 935a568c Florent Chuffart
8   15659905        12475664                   79,67%  1194 Mo           71  Mo              416   s.
212 935a568c Florent Chuffart
9   14668641        10960565                   74,72%  1052 Mo           70  Mo              375   s.
213 935a568c Florent Chuffart
10  14339179        10454451                   72,91%  1049 Mo           51  Mo              363   s.
214 935a568c Florent Chuffart
11  18019895        13688774                   75,96%  1378 Mo           59  Mo              474   s.
215 935a568c Florent Chuffart
12  13746796        10810022                   78,64%  1084 Mo           54  Mo              360   s.
216 935a568c Florent Chuffart
13  15205065        11766016                   77,38%  990  Mo           54  Mo              381   s.
217 935a568c Florent Chuffart
14  17803097        13838883                   77,73%  1154 Mo           60  Mo              452   s.
218 935a568c Florent Chuffart
15  15434564        12307878                   79,74%  1032 Mo           57  Mo              394   s.
219 935a568c Florent Chuffart
16  16802587        12725665                   75,74%  1221 Mo           48  Mo              438   s.
220 935a568c Florent Chuffart
17  16058417        12513734                   77,93%  1192 Mo           63  Mo              422   s.
221 935a568c Florent Chuffart
18  16154482        13204331                   81,74%  1277 Mo           52  Mo              430   s.
222 935a568c Florent Chuffart
19  21013924        17102120                   81,38%  1646 Mo           59  Mo              555   s.
223 935a568c Florent Chuffart
20  17213114        14433357                   83,85%  1389 Mo           53  Mo              459   s.
224 935a568c Florent Chuffart
21  17360907        14733001                   84,86%  1203 Mo           55  Mo              450   s.
225 935a568c Florent Chuffart
22  18136816        15389581                   84,85%  1257 Mo           53  Mo              469   s.
226 935a568c Florent Chuffart
23  14763678        12173025                   82,45%  1140 Mo           56  Mo              393   s.
227 935a568c Florent Chuffart
24  15541709        12890345                   82,94%  1057 Mo           48  Mo              398   s.
228 935a568c Florent Chuffart
25  16433215        13094314                   79,68%  1241 Mo           57  Mo              433   s.
229 935a568c Florent Chuffart
26  17370850        14264136                   82,12%  1347 Mo           51  Mo              466   s.
230 935a568c Florent Chuffart
27  14613512        8654495                    59,22%  887  Mo           56  Mo              339   s.
231 935a568c Florent Chuffart
28  15248545        11367589                   74,55%  1166 Mo           67  Mo              405   s.
232 935a568c Florent Chuffart
29  14316809        10767926                   75,21%  1103 Mo           63  Mo              379   s.
233 935a568c Florent Chuffart
30  15178058        12265794                   80,81%  1030 Mo           66  Mo              390   s.
234 935a568c Florent Chuffart
31  14968579        11876186                   79,34%  1009 Mo           63  Mo              387   s.
235 935a568c Florent Chuffart
32  16912705        13550508                   80,12%  1143 Mo           70  Mo              442   s.
236 935a568c Florent Chuffart
33  16782154        12755111                   76,00%  1227 Mo           65  Mo              438   s.
237 935a568c Florent Chuffart
34  16741443        13168071                   78,66%  1260 Mo           71  Mo              442   s.
238 935a568c Florent Chuffart
35  13096171        10367041                   79,16%  992  Mo           62  Mo              350   s.
239 935a568c Florent Chuffart
36  17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
240 935a568c Florent Chuffart
37  17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
241 935a568c Florent Chuffart
38  16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
242 935a568c Florent Chuffart
39  14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
243 935a568c Florent Chuffart
40  13784738        10598464                   76,89%  1005 Mo           64  Mo              358   s.
244 935a568c Florent Chuffart
41  12438007        9620975                    77,35%  911  Mo           60  Mo              326   s.
245 935a568c Florent Chuffart
42  13853959        11031238                   79,63%  1045 Mo           64  Mo              365   s.
246 935a568c Florent Chuffart
43  12036162        6654780                    55,29%  684  Mo           46  Mo              268   s.
247 935a568c Florent Chuffart
44  13873129        10251074                   73,89%  1048 Mo           61  Mo              365   s.
248 935a568c Florent Chuffart
45  19817751        14904502                   75,21%  1520 Mo           72  Mo              528   s.
249 935a568c Florent Chuffart
46  13368959        10818619                   80,92%  912  Mo           63  Mo              350   s.
250 935a568c Florent Chuffart
47  7566467         6139001                    81,13%  520  Mo           44  Mo              201   s.
251 935a568c Florent Chuffart
48  32586928        21191363                   65,03%  1816 Mo           82  Mo              766   s.
252 935a568c Florent Chuffart
49  30733184        18791373                   61,14%  1801 Mo           89  Mo              721   s.
253 935a568c Florent Chuffart
50  41287616        30383875                   73,59%  2911 Mo           112 Mo              1065  s.
254 935a568c Florent Chuffart
51  40439965        31177914                   77,10%  2981 Mo           117 Mo              1070  s.
255 935a568c Florent Chuffart
53  40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
256 935a568c Florent Chuffart
55  52424414        47117107                   89,88%  3811 Mo           119 Mo              1477  s.
257 935a568c Florent Chuffart
==  ==============  =========================  ======  ================  ==================  ================  
258 935a568c Florent Chuffart
259 dadb6a4d Florent Chuffart
For some reasons (manipulation efficiency, e.g. PCR...), we remove samples 33, 45, 48 and 55.
260 935a568c Florent Chuffart
261 935a568c Florent Chuffart
262 935a568c Florent Chuffart
Run TemplateFilter on Mnase Samples
263 dadb6a4d Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
264 935a568c Florent Chuffart
265 935a568c Florent Chuffart
Finally, for each sample we perfome TemplateFilter analysis. 
266 935a568c Florent Chuffart
267 935a568c Florent Chuffart
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
268 935a568c Florent Chuffart
define by its center and its width. An odd width leads us to considere non 
269 935a568c Florent Chuffart
interger lower and upper bound.
270 935a568c Florent Chuffart
271 935a568c Florent Chuffart
**WARNING** TemplateFilter is not design to deal with replicate. So we choose to 
272 935a568c Florent Chuffart
keep a maximum of nucleosome and filter in a second time using the benefit of 
273 935a568c Florent Chuffart
replicate. To do that we set a low correlation threshold parameter (`0.5`) and a 
274 935a568c Florent Chuffart
particularly high value of overlaping (`300%`).
275 935a568c Florent Chuffart
276 935a568c Florent Chuffart
This step is performed by the followinw part of the `wf.py` script:
277 935a568c Florent Chuffart
278 8e9facd8 Florent Chuffart
.. literalinclude:: ../../../snep/src/current/wf.py
279 935a568c Florent Chuffart
   :start-after: # _STARTOF_ step_4
280 935a568c Florent Chuffart
   :end-before: # _ENDOF_ step_4
281 935a568c Florent Chuffart
   :language: python
282 935a568c Florent Chuffart
283 935a568c Florent Chuffart
==  ======  ==========  =============  ================
284 935a568c Florent Chuffart
id  strain  found nucs  nuc file size  process duration
285 935a568c Florent Chuffart
==  ======  ==========  =============  ================
286 935a568c Florent Chuffart
1    BY     96214       68 Mo          1022 s.                     
287 935a568c Florent Chuffart
2    BY     91694       65 Mo          1038 s.                      
288 935a568c Florent Chuffart
3    BY     91205       65 Mo          1036 s.                       
289 935a568c Florent Chuffart
4    RM     88076       62 Mo          984 s.                      
290 935a568c Florent Chuffart
5    RM     90141       64 Mo          967 s.                      
291 935a568c Florent Chuffart
6    RM     87517       62 Mo          980 s.                      
292 935a568c Florent Chuffart
7    YJM    88945       64 Mo          566 s.                      
293 935a568c Florent Chuffart
8    YJM    88689       64 Mo          570 s.                      
294 935a568c Florent Chuffart
9    YJM    88128       63 Mo          565 s.                    
295 935a568c Florent Chuffart
==  ======  ==========  =============  ================
296 935a568c Florent Chuffart
297 935a568c Florent Chuffart
298 935a568c Florent Chuffart
299 935a568c Florent Chuffart
300 935a568c Florent Chuffart
301 935a568c Florent Chuffart
302 935a568c Florent Chuffart
303 935a568c Florent Chuffart
304 935a568c Florent Chuffart
305 935a568c Florent Chuffart
306 935a568c Florent Chuffart
307 935a568c Florent Chuffart
308 935a568c Florent Chuffart
309 935a568c Florent Chuffart
Inferring Nucleosome Position and Extracting Read Counts
310 935a568c Florent Chuffart
--------------------------------------------------------
311 935a568c Florent Chuffart
312 935a568c Florent Chuffart
313 935a568c Florent Chuffart
314 dadb6a4d Florent Chuffart
The second part of the tutorial uses `R` (http://http://www.r-project.org). It consists in a set of R scripts taht will be sourced in an R console launched at the root of your project. the R srcipts are:
315 935a568c Florent Chuffart
316 dadb6a4d Florent Chuffart
  - headers.R
317 935a568c Florent Chuffart
  - extract_maps.R
318 b20637ed Florent Chuffart
  - compare_common_wp.R
319 b20637ed Florent Chuffart
  - split_samples.R
320 935a568c Florent Chuffart
  - count_reads.R
321 935a568c Florent Chuffart
  - get_size_factors  
322 935a568c Florent Chuffart
  - launch_deseq.R
323 935a568c Florent Chuffart
324 dadb6a4d Florent Chuffart
The Script headers.R
325 dadb6a4d Florent Chuffart
^^^^^^^^^^^^^^^^^^^^
326 dadb6a4d Florent Chuffart
327 dadb6a4d Florent Chuffart
The script header.R is included in each other scripts. It is in charge of: 
328 dadb6a4d Florent Chuffart
329 dadb6a4d Florent Chuffart
  - launching libraries used in thes scripts
330 dadb6a4d Florent Chuffart
  - launching configuration (design, strain, marker...)
331 dadb6a4d Florent Chuffart
  - computing and caching CURs
332 dadb6a4d Florent Chuffart
333 dadb6a4d Florent Chuffart
In your R console, run the following command line:
334 935a568c Florent Chuffart
335 935a568c Florent Chuffart
.. code:: bash
336 935a568c Florent Chuffart
337 dadb6a4d Florent Chuffart
  R CMD BATCH src/current/header.R
338 935a568c Florent Chuffart
339 935a568c Florent Chuffart
340 dadb6a4d Florent Chuffart
The Script extract_maps.R
341 dadb6a4d Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^
342 dadb6a4d Florent Chuffart
This script is in charge of extracting Maps for well positioned and fuzzy nucleosomes. First of all, this script computed intra and inter strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well positioned, fuzzy and UNR maps.
343 dadb6a4d Florent Chuffart
344 dadb6a4d Florent Chuffart
The well-positioned map for BY is collected in the result directory and is called **BY_wp.tab**. It is composed of following columns:
345 dadb6a4d Florent Chuffart
346 dadb6a4d Florent Chuffart
 - chr, the number of the chromosome 
347 dadb6a4d Florent Chuffart
 - lower_bound, the lower bound of the nucleosome
348 dadb6a4d Florent Chuffart
 - upper_bound, the upper bound of the nucleosome 
349 dadb6a4d Florent Chuffart
 - cur_index, index of the CUR
350 dadb6a4d Florent Chuffart
 - index_nuc, the index of the nucleosome in the CUR
351 dadb6a4d Florent Chuffart
 - wp, 1 if it is a well positioned nucleosome, 0 else
352 dadb6a4d Florent Chuffart
 - nb_reads, the number of reads that supports this nucleosome
353 dadb6a4d Florent Chuffart
 - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates if it is a well-positioned nucleosome)
354 dadb6a4d Florent Chuffart
 - llr_1, for a well-positioned nucleosome, it is the LLR1 between the first and the second TemplateFilter nucleosome.
355 dadb6a4d Florent Chuffart
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the first TemplateFilter nucleosome.
356 dadb6a4d Florent Chuffart
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 overall TemplateFilter nucleosomes.
357 dadb6a4d Florent Chuffart
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (**1-pchisq(2.LLR2, df=4)**)
358 dadb6a4d Florent Chuffart
 - dyad_shift, for a well-positioned nucleosome, it is shift between the two extreme TemplateFilter nucleosome dyad positions. 
359 dadb6a4d Florent Chuffart
360 dadb6a4d Florent Chuffart
The fuzzy map for BY is collected in the result directory and is called **BY_fuzzy.tab**. It is composed of following columns:
361 dadb6a4d Florent Chuffart
362 dadb6a4d Florent Chuffart
 - chr, the number of the chromosome 
363 dadb6a4d Florent Chuffart
 - lower_bound, the lower bound of the nucleosome
364 dadb6a4d Florent Chuffart
 - upper_bound, the upper bound of the nucleosome 
365 dadb6a4d Florent Chuffart
 - cur_index, index of the CUR
366 dadb6a4d Florent Chuffart
367 dadb6a4d Florent Chuffart
The common well-position map for BY and RM strains is collected in the result directory and is called **BY_RM_common_wp.tab**. It is composed of following columns:
368 dadb6a4d Florent Chuffart
369 dadb6a4d Florent Chuffart
 - cur_index, the index of the CUR
370 dadb6a4d Florent Chuffart
 - index_nuc_BY, the index of the BY nucleosome in the CUR
371 dadb6a4d Florent Chuffart
 - index_nuc_RM,the index of the RM nucleosome in the CUR
372 dadb6a4d Florent Chuffart
 - llr_score, the LLR3 score between th eBy and RM nucleosomes
373 dadb6a4d Florent Chuffart
 - common_wp_pval,  the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**)
374 dadb6a4d Florent Chuffart
375 dadb6a4d Florent Chuffart
The common UNR map for BY and RM strains is collected in the result directory and is called **BY_RM_common_unr.tab**. It is composed of following columns:
376 dadb6a4d Florent Chuffart
377 dadb6a4d Florent Chuffart
 - cur_index, the index of the CUR
378 dadb6a4d Florent Chuffart
 - index_nuc_BY, the index of the BY nucleosome in the CUR
379 dadb6a4d Florent Chuffart
 - index_nuc_RM,the index of the RM nucleosome in the CUR
380 dadb6a4d Florent Chuffart
381 dadb6a4d Florent Chuffart
To execute this script, run the following command line in your R console:
382 935a568c Florent Chuffart
383 935a568c Florent Chuffart
.. code:: bash
384 935a568c Florent Chuffart
385 dadb6a4d Florent Chuffart
  source("src/current/extract_maps.R")
386 dadb6a4d Florent Chuffart
387 dadb6a4d Florent Chuffart
388 dadb6a4d Florent Chuffart
The Script compare_common_wp.R
389 dadb6a4d Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
390 dadb6a4d Florent Chuffart
391 dadb6a4d Florent Chuffart
This script is used to compare inter strain distances between common well-positioned nucleosomes. 
392 dadb6a4d Florent Chuffart
393 dadb6a4d Florent Chuffart
For example, it compute the file **BY_RM_common_wp_diff.tab** that contains dyad shifts between two well-positioned nucleosomes. It is composed of following columns:
394 dadb6a4d Florent Chuffart
 - cur_index, the index of the CUR
395 dadb6a4d Florent Chuffart
 - index_nuc_BY, the index of the BY nucleosome in the CUR
396 dadb6a4d Florent Chuffart
 - index_nuc_RM,the index of the RM nucleosome in the CUR
397 dadb6a4d Florent Chuffart
 - llr_score, the LLR3 score between th eBy and RM nucleosomes
398 dadb6a4d Florent Chuffart
 - common_wp_pval,  the p-value chi square test obtained with the LLR3 (**1-pchisq(2.LLR3, df=2)**)
399 dadb6a4d Florent Chuffart
 - diff, the dyad shifts between two well-positioned nucleosomes
400 dadb6a4d Florent Chuffart
401 dadb6a4d Florent Chuffart
It also translates well-positioned nucleosome maps from a strain to an other strain and stores it into a table. 
402 dadb6a4d Florent Chuffart
403 dadb6a4d Florent Chuffart
For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains RM well-positioned nucleosome translated into the BY genome referential. It is composed of following columns:
404 dadb6a4d Florent Chuffart
405 dadb6a4d Florent Chuffart
 - strain_ref, the reference genome (in which positioned are defined)
406 dadb6a4d Florent Chuffart
 - begin, the translated lower bound of the nucleosome
407 dadb6a4d Florent Chuffart
 - end, the translated upper bound of the nucleosome
408 dadb6a4d Florent Chuffart
 - chr, the number of chromosome for the reference genome (in which positioned are defined)
409 dadb6a4d Florent Chuffart
 - length, the length of the nucleosome (could be negative)
410 dadb6a4d Florent Chuffart
 - cur_index, the index of the CUR
411 dadb6a4d Florent Chuffart
 - index_nuc, the index of the nucleosome in the CUR
412 dadb6a4d Florent Chuffart
413 935a568c Florent Chuffart
414 935a568c Florent Chuffart
415 dadb6a4d Florent Chuffart
To execute this script, run the following command line in your R console:
416 b20637ed Florent Chuffart
417 b20637ed Florent Chuffart
.. code:: bash
418 b20637ed Florent Chuffart
419 b20637ed Florent Chuffart
  R CMD BATCH src/current/compare_common_wp.R
420 b20637ed Florent Chuffart
421 b20637ed Florent Chuffart
422 b20637ed Florent Chuffart
Split and Compress Samples According CURs
423 b20637ed Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
424 b20637ed Florent Chuffart
425 b20637ed Florent Chuffart
.. code:: bash
426 b20637ed Florent Chuffart
427 b20637ed Florent Chuffart
  R CMD BATCH src/current/split_samples.R
428 b20637ed Florent Chuffart
429 b20637ed Florent Chuffart
430 935a568c Florent Chuffart
Count Reads for Each Nucleosome
431 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
432 935a568c Florent Chuffart
433 935a568c Florent Chuffart
.. code:: bash
434 935a568c Florent Chuffart
435 8e9facd8 Florent Chuffart
  R CMD BATCH src/current/count_reads.R
436 935a568c Florent Chuffart
437 935a568c Florent Chuffart
438 935a568c Florent Chuffart
Get Size Factors Using DESeq
439 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
440 935a568c Florent Chuffart
441 935a568c Florent Chuffart
.. code:: bash
442 935a568c Florent Chuffart
443 8e9facd8 Florent Chuffart
  R CMD BATCH src/current/get_size_factors.R
444 935a568c Florent Chuffart
445 935a568c Florent Chuffart
446 935a568c Florent Chuffart
Performing DESeq Analysis
447 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^
448 935a568c Florent Chuffart
449 935a568c Florent Chuffart
.. code:: bash
450 935a568c Florent Chuffart
451 8e9facd8 Florent Chuffart
  R CMD BATCH src/current/launch_deseq.R
452 935a568c Florent Chuffart
453 935a568c Florent Chuffart
454 935a568c Florent Chuffart
Results
455 935a568c Florent Chuffart
-------
456 935a568c Florent Chuffart
457 935a568c Florent Chuffart
Output Files Organisation
458 935a568c Florent Chuffart
^^^^^^^^^^^^^^^^^^^^^^^^^
459 935a568c Florent Chuffart
Previous steps produce following 45 files. 
460 935a568c Florent Chuffart
Each filename is under the form 
461 935a568c Florent Chuffart
462 935a568c Florent Chuffart
.. code:: bash
463 935a568c Florent Chuffart
464 8e9facd8 Florent Chuffart
  results/current/[combi]_[marker]_[form]_snep.tab 
465 935a568c Florent Chuffart
466 935a568c Florent Chuffart
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, marker is 
467 935a568c Florent Chuffart
in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each post translational 
468 935a568c Florent Chuffart
histone modification and form is in {wp, fuzzy, wpfuzzy} considering well 
469 dadb6a4d Florent Chuffart
positioned nucleosomes, fuzzy nucleosomes or both for SNEP computation.
470 935a568c Florent Chuffart
471 935a568c Florent Chuffart
472 935a568c Florent Chuffart
473 935a568c Florent Chuffart
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM lower_bound_RM upper_bound_RM index_nuc_RM roi_index form BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 
474 935a568c Florent Chuffart
BY_H3K14ac_37 BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM 
475 935a568c Florent Chuffart
476 935a568c Florent Chuffart
For each file, there is 1 line per nucleosome and each line is composed of many columns divided into 3 main topics:
477 935a568c Florent Chuffart
  - nuc information
478 935a568c Florent Chuffart
  - number opf reads for each sample
479 935a568c Florent Chuffart
  - DESeq analysis results.
480 935a568c Florent Chuffart
481 935a568c Florent Chuffart
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: 
482 935a568c Florent Chuffart
  - chr_BY, the BY chr involved
483 935a568c Florent Chuffart
  - lower_bound_BY, the lower bound of the BY nuc
484 935a568c Florent Chuffart
  - upper_bound_BY, the upper_bound of the BY nuc
485 935a568c Florent Chuffart
  - index_nuc_BY, the index of the nuc in the entire list of BY nucs
486 935a568c Florent Chuffart
  - chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM 
487 935a568c Florent Chuffart
	are the same information for the RM strain
488 935a568c Florent Chuffart
  - roi_index, the index of the region of interrest involved.
489 935a568c Florent Chuffart
  
490 935a568c Florent Chuffart
Next cols concern indicators for each sample. They are labeled [strain]_[marker]_[sample_id] and each value represents the number of reads for the current nuc for the sample *sample_id*. 
491 935a568c Florent Chuffart
492 935a568c Florent Chuffart
The 5 final columns concern DESeq analysis:
493 935a568c Florent Chuffart
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect.  
494 935a568c Florent Chuffart
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or the interaction term *marker:strain* 
495 935a568c Florent Chuffart
  - snep_index, a boolean set to TRUE if the *pvalueGLM* value is under the threshold computed with FDR function with a rate set to 0.01%. 
496 935a568c Florent Chuffart
497 935a568c Florent Chuffart
It also produces the file that explicts size factor for each involved sample in differents strain combination and nucleosomal region type:
498 935a568c Florent Chuffart
499 8e9facd8 Florent Chuffart
TODO: include this file... /home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt
500 935a568c Florent Chuffart
501 935a568c Florent Chuffart
502 935a568c Florent Chuffart
.. code:: bash
503 935a568c Florent Chuffart
504 8e9facd8 Florent Chuffart
  results/current/size_factors.tab
505 935a568c Florent Chuffart
506 935a568c Florent Chuffart
507 935a568c Florent Chuffart
508 935a568c Florent Chuffart
509 935a568c Florent Chuffart
Number of SNEPs
510 935a568c Florent Chuffart
^^^^^^^^^^^^^^^
511 935a568c Florent Chuffart
512 935a568c Florent Chuffart
Here are the number of computed for each forms.
513 935a568c Florent Chuffart
514 935a568c Florent Chuffart
.. code:: bash
515 935a568c Florent Chuffart
516 935a568c Florent Chuffart
  [1] "wp"
517 935a568c Florent Chuffart
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
518 935a568c Florent Chuffart
  BY-RM  30234     520     798     83    3566      26
519 935a568c Florent Chuffart
  BY-YJM 31298     303     619    102     103     128
520 935a568c Florent Chuffart
  RM-YJM 29863     129     340     46    3177      18
521 935a568c Florent Chuffart
  [1] "fuzzy"
522 935a568c Florent Chuffart
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
523 935a568c Florent Chuffart
  BY-RM  10748     294     308    101    1681      42
524 935a568c Florent Chuffart
  BY-YJM 10669     122     176    124      93      87
525 935a568c Florent Chuffart
  RM-YJM 11478      54     112     41    1389      20
526 935a568c Florent Chuffart
  [1] "wpfuzzy"
527 935a568c Florent Chuffart
         #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
528 935a568c Florent Chuffart
  BY-RM  40982     770    1136    183    5404      73
529 935a568c Florent Chuffart
  BY-YJM 41967     439     804    214     198     199
530 935a568c Florent Chuffart
  RM-YJM 41341     184     468     87    4687      37
531 935a568c Florent Chuffart
532 935a568c Florent Chuffart
533 935a568c Florent Chuffart
TODO: 
534 935a568c Florent Chuffart
  - Print/study intra/inter strain LODs.
535 935a568c Florent Chuffart
  - Check the normality of sample using Shapiro–Wilk (Hypothesis for computing LODs)
536 935a568c Florent Chuffart