root / doc / sphinx_doc / build / text / tuto.txt @ b20637ed
Historique | Voir | Annoter | Télécharger (27,3 ko)
1 |
|
---|---|
2 |
Tutorial |
3 |
******** |
4 |
|
5 |
This tutorial describes steps allowing to perform quantitave analysis |
6 |
of nucleosomal epigenome. We assume that files are organised around a |
7 |
given hierarchie and that all command lines are launched from |
8 |
project's root. |
9 |
|
10 |
This tutorial is divided into t=wo main parts. First one consists in |
11 |
the python script *wf.py* that aligns and convert Illumina reads. |
12 |
Second one is the R script *main.r* that extracts information |
13 |
(nucleosome position and indicators) from the dataset. |
14 |
|
15 |
|
16 |
Dataset and Configuration File |
17 |
============================== |
18 |
|
19 |
We want to compare nucleosomes of 3 yeast strains: |
20 |
|
21 |
* BY |
22 |
|
23 |
* RM |
24 |
|
25 |
* YJM |
26 |
|
27 |
For each strain we perform Mnase-Seq and ChIP-Seq using the 5 |
28 |
following markers: |
29 |
|
30 |
* H3K4me1 |
31 |
|
32 |
* H3K4me3 |
33 |
|
34 |
* H3K9ac |
35 |
|
36 |
* H3K14ac |
37 |
|
38 |
* H4K12ac |
39 |
|
40 |
In order to simplify the design of exeriment, we considere Mnase as a |
41 |
marker. For each couple *(strain, marker)* we perform 3 replicates. |
42 |
So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In |
43 |
practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of |
44 |
the 53 samples is indentify by a uniq identifier. The file |
45 |
*CSV_SAMPLE_FILE* sums up this information. |
46 |
|
47 |
configurator.CSV_SAMPLE_FILE = None |
48 |
|
49 |
Path to cvs file that contains sample information. |
50 |
|
51 |
We use a convention to link sample and Illumina fastq outputs. |
52 |
Illumina output files of the sample *ID* will be stored in the |
53 |
directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41 |
54 |
outputs will be stored in the directory |
55 |
*data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*. |
56 |
|
57 |
configurator.ILLUMINA_OUTPUTFILE_PREFIX = None |
58 |
|
59 |
Prefix for Illumina fastq output files. |
60 |
|
61 |
For BY (resp. RM and YJM) we use following reference genome |
62 |
*saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp. |
63 |
*saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and |
64 |
*saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index |
65 |
*FASTA_REFERENCE_GENOME_FILES* stores this information. |
66 |
|
67 |
configurator.FASTA_REFERENCE_GENOME_FILES = None |
68 |
|
69 |
Dictionary where each fasta reference genomes is indexed by |
70 |
reference strain that it corresponds. |
71 |
|
72 |
Each chromosome/contig is identify in the fasta file by an obscure |
73 |
identifier. For example, BY chromosome I is identify by |
74 |
*gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an |
75 |
integer. So, we translate it. The index *FASTA_INDEXES* stores this |
76 |
translation. |
77 |
|
78 |
configurator.FASTA_INDEXES = None |
79 |
|
80 |
Dictionary of strain that indexes dictionaries where keys are |
81 |
chromosome reference from Fastq file and value are its |
82 |
correspondance for Templatefilter. |
83 |
|
84 |
From a pragamatical point of view we discard some part of the genome |
85 |
(repeated sequence etc...). The list of the black listed area is |
86 |
explicitely detailled in *AREA_BLACK_LIST*. |
87 |
|
88 |
configurator.AREA_BLACK_LIST = None |
89 |
|
90 |
Dictionary where keys are strain and values are black listed of |
91 |
geneome region. |
92 |
|
93 |
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use |
94 |
previously compute .c2c file |
95 |
*data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp. |
96 |
*BY_YJM_GComp_All.c2c* and *RM_YJM_gxcomp.c2c*). For more information |
97 |
about .c2c files, please read section 5 of the manual of |
98 |
*NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens- |
99 |
lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf). |
100 |
|
101 |
configurator.C2C_FILES = None |
102 |
|
103 |
Dictionary where each strain combination indexes genome aligment. |
104 |
|
105 |
*nucleominer* uses specific directory to work in, these are described |
106 |
in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*. |
107 |
|
108 |
Finally, *nucleominer* use external ressources, the path to these |
109 |
resspources are describe in *BOWTIE_BUILD_BIN*, *BOWTIE2_BIN*, |
110 |
*SAMTOOLS_BIN*, *BEDTOOLS_BIN* and *TF_BIN* and *TF_TEMPLATES_FILE*. |
111 |
|
112 |
All paths, prefixes and indexes could be change in the |
113 |
*src/current/nucleominer_config.json* file. |
114 |
|
115 |
wf.json_conf_file = 'src/nucleo_miner/nucleo_miner_config.json' |
116 |
|
117 |
Path to the json configuration file. |
118 |
|
119 |
|
120 |
Preprocessing Illumina Fastq Reads for Each Sample |
121 |
================================================== |
122 |
|
123 |
This preprocessing step consists in the 4 main steps embed in the |
124 |
*wf.py* and described bellow. As a preamble, this script computes |
125 |
*samples* *samples_mnase* and *strains* that will be used along the 4 |
126 |
steps. |
127 |
|
128 |
wf.samples = [] |
129 |
|
130 |
List of samples where a sample is identify by an id (key: *id*) and |
131 |
a strain name (key *strain*). |
132 |
|
133 |
wf.samples_mnase = [] |
134 |
|
135 |
List of Mnase samples. |
136 |
|
137 |
wf.strains = [] |
138 |
|
139 |
List of reference strains. |
140 |
|
141 |
|
142 |
Creating Bowtie Index from each Reference Genome |
143 |
------------------------------------------------ |
144 |
|
145 |
For each strain, we need to create bowtie index. Bowtie index of a |
146 |
strain is a tree view of the genemoe reference for this strain. It |
147 |
will be used by bowtie to align reads. This step is performed by the |
148 |
following part of the *wf.py* script: |
149 |
|
150 |
for strain in strains: |
151 |
per_strain_stats[strain] = create_bowtie_index(strain, |
152 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
153 |
config["BOWTIE_BUILD_BIN"]) |
154 |
|
155 |
The following table sum up involved file sizes and process durations |
156 |
concerning this step. |
157 |
|
158 |
+--------+------------------------+------------------------+------------------+ |
159 |
| strain | fasta genome file size | bowtie index file size | process duration | |
160 |
+========+========================+========================+==================+ |
161 |
| BY | 12 Mo | 25 Mo | 11 s. | |
162 |
+--------+------------------------+------------------------+------------------+ |
163 |
| RM | 12 Mo | 24 Mo | 9 s. | |
164 |
+--------+------------------------+------------------------+------------------+ |
165 |
| YJM | 12 Mo | 25 Mo | 11 s. | |
166 |
+--------+------------------------+------------------------+------------------+ |
167 |
|
168 |
|
169 |
Aligning Reads to Reference Genome |
170 |
---------------------------------- |
171 |
|
172 |
Next, we launch bowtie to align reads to the reference genome. It |
173 |
produces a *.sam* file that we convert into a *.bed* file. Binaries |
174 |
for *bowtie*, *samtools* and *bedtools* are wrapped using python |
175 |
*subprocess* class. This step is performed by the followinw part of |
176 |
the *wf.py* script: |
177 |
|
178 |
for sample in samples: |
179 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
180 |
config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], |
181 |
config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], |
182 |
config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"]) |
183 |
|
184 |
|
185 |
Convert Aligned Reads for TemplateFilter |
186 |
---------------------------------------- |
187 |
|
188 |
TemplateFilter use particular input format for reads, so we convert |
189 |
*.bed* file. TemplateFilter expect reads as following: *chr coord |
190 |
strand #read* where: |
191 |
|
192 |
* chr is the number of the chromosome; |
193 |
|
194 |
* coord is the coordinate of the reads; |
195 |
|
196 |
* strand is *F* for forward and *R* for reverse; |
197 |
|
198 |
* #reads the number of reads for this position. |
199 |
|
200 |
Each entry is *tab*-separated. |
201 |
|
202 |
**WARNING** for reverse strand bowtie returns the position of left |
203 |
first nucleotid when TemplateFilter is waiting for right one. So this |
204 |
step takes it into account and add lenght of reads (in our case 50) to |
205 |
reverse reads coordinate. |
206 |
|
207 |
This step is performed by the followinw part of the *wf.py* script: |
208 |
|
209 |
for sample in samples: |
210 |
per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, |
211 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
212 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
213 |
|
214 |
The following table sum up number of reads, involved file sizes and |
215 |
process durations concerning the two last steps. In our case, aligment |
216 |
process have been multuthreaded over over 3 cores. |
217 |
|
218 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
219 |
| id | Illumina reads | aligned and filtred reads | ratio | *.bed* file size | TF input file size | process duration | |
220 |
+====+================+===========================+========+==================+====================+==================+ |
221 |
| 1 | 16436138 | 10199695 | 62,06% | 1064 Mo | 60 Mo | 383 s. | |
222 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
223 |
| 2 | 16911132 | 12512727 | 73,99% | 1298 Mo | 64 Mo | 437 s. | |
224 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
225 |
| 3 | 15946902 | 12340426 | 77,38% | 1280 Mo | 65 Mo | 423 s. | |
226 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
227 |
| 4 | 13765584 | 10381903 | 75,42% | 931 Mo | 59 Mo | 352 s. | |
228 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
229 |
| 5 | 15168268 | 11502855 | 75,83% | 1031 Mo | 64 Mo | 386 s. | |
230 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
231 |
| 6 | 18850820 | 14024905 | 74,40% | 1254 Mo | 69 Mo | 482 s. | |
232 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
233 |
| 7 | 15591124 | 12126623 | 77,78% | 1163 Mo | 72 Mo | 405 s. | |
234 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
235 |
| 8 | 15659905 | 12475664 | 79,67% | 1194 Mo | 71 Mo | 416 s. | |
236 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
237 |
| 9 | 14668641 | 10960565 | 74,72% | 1052 Mo | 70 Mo | 375 s. | |
238 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
239 |
| 10 | 14339179 | 10454451 | 72,91% | 1049 Mo | 51 Mo | 363 s. | |
240 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
241 |
| 11 | 18019895 | 13688774 | 75,96% | 1378 Mo | 59 Mo | 474 s. | |
242 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
243 |
| 12 | 13746796 | 10810022 | 78,64% | 1084 Mo | 54 Mo | 360 s. | |
244 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
245 |
| 13 | 15205065 | 11766016 | 77,38% | 990 Mo | 54 Mo | 381 s. | |
246 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
247 |
| 14 | 17803097 | 13838883 | 77,73% | 1154 Mo | 60 Mo | 452 s. | |
248 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
249 |
| 15 | 15434564 | 12307878 | 79,74% | 1032 Mo | 57 Mo | 394 s. | |
250 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
251 |
| 16 | 16802587 | 12725665 | 75,74% | 1221 Mo | 48 Mo | 438 s. | |
252 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
253 |
| 17 | 16058417 | 12513734 | 77,93% | 1192 Mo | 63 Mo | 422 s. | |
254 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
255 |
| 18 | 16154482 | 13204331 | 81,74% | 1277 Mo | 52 Mo | 430 s. | |
256 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
257 |
| 19 | 21013924 | 17102120 | 81,38% | 1646 Mo | 59 Mo | 555 s. | |
258 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
259 |
| 20 | 17213114 | 14433357 | 83,85% | 1389 Mo | 53 Mo | 459 s. | |
260 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
261 |
| 21 | 17360907 | 14733001 | 84,86% | 1203 Mo | 55 Mo | 450 s. | |
262 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
263 |
| 22 | 18136816 | 15389581 | 84,85% | 1257 Mo | 53 Mo | 469 s. | |
264 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
265 |
| 23 | 14763678 | 12173025 | 82,45% | 1140 Mo | 56 Mo | 393 s. | |
266 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
267 |
| 24 | 15541709 | 12890345 | 82,94% | 1057 Mo | 48 Mo | 398 s. | |
268 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
269 |
| 25 | 16433215 | 13094314 | 79,68% | 1241 Mo | 57 Mo | 433 s. | |
270 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
271 |
| 26 | 17370850 | 14264136 | 82,12% | 1347 Mo | 51 Mo | 466 s. | |
272 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
273 |
| 27 | 14613512 | 8654495 | 59,22% | 887 Mo | 56 Mo | 339 s. | |
274 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
275 |
| 28 | 15248545 | 11367589 | 74,55% | 1166 Mo | 67 Mo | 405 s. | |
276 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
277 |
| 29 | 14316809 | 10767926 | 75,21% | 1103 Mo | 63 Mo | 379 s. | |
278 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
279 |
| 30 | 15178058 | 12265794 | 80,81% | 1030 Mo | 66 Mo | 390 s. | |
280 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
281 |
| 31 | 14968579 | 11876186 | 79,34% | 1009 Mo | 63 Mo | 387 s. | |
282 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
283 |
| 32 | 16912705 | 13550508 | 80,12% | 1143 Mo | 70 Mo | 442 s. | |
284 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
285 |
| 33 | 16782154 | 12755111 | 76,00% | 1227 Mo | 65 Mo | 438 s. | |
286 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
287 |
| 34 | 16741443 | 13168071 | 78,66% | 1260 Mo | 71 Mo | 442 s. | |
288 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
289 |
| 35 | 13096171 | 10367041 | 79,16% | 992 Mo | 62 Mo | 350 s. | |
290 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
291 |
| 36 | 17715118 | 14092985 | 79,55% | 1404 Mo | 68 Mo | 483 s. | |
292 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
293 |
| 37 | 17288466 | 7402082 | 42,82% | 741 Mo | 48 Mo | 339 s. | |
294 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
295 |
| 38 | 16116394 | 13178457 | 81,77% | 1101 Mo | 63 Mo | 420 s. | |
296 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
297 |
| 39 | 14241106 | 10537228 | 73,99% | 880 Mo | 57 Mo | 348 s. | |
298 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
299 |
| 40 | 13784738 | 10598464 | 76,89% | 1005 Mo | 64 Mo | 358 s. | |
300 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
301 |
| 41 | 12438007 | 9620975 | 77,35% | 911 Mo | 60 Mo | 326 s. | |
302 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
303 |
| 42 | 13853959 | 11031238 | 79,63% | 1045 Mo | 64 Mo | 365 s. | |
304 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
305 |
| 43 | 12036162 | 6654780 | 55,29% | 684 Mo | 46 Mo | 268 s. | |
306 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
307 |
| 44 | 13873129 | 10251074 | 73,89% | 1048 Mo | 61 Mo | 365 s. | |
308 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
309 |
| 45 | 19817751 | 14904502 | 75,21% | 1520 Mo | 72 Mo | 528 s. | |
310 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
311 |
| 46 | 13368959 | 10818619 | 80,92% | 912 Mo | 63 Mo | 350 s. | |
312 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
313 |
| 47 | 7566467 | 6139001 | 81,13% | 520 Mo | 44 Mo | 201 s. | |
314 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
315 |
| 48 | 32586928 | 21191363 | 65,03% | 1816 Mo | 82 Mo | 766 s. | |
316 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
317 |
| 49 | 30733184 | 18791373 | 61,14% | 1801 Mo | 89 Mo | 721 s. | |
318 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
319 |
| 50 | 41287616 | 30383875 | 73,59% | 2911 Mo | 112 Mo | 1065 s. | |
320 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
321 |
| 51 | 40439965 | 31177914 | 77,10% | 2981 Mo | 117 Mo | 1070 s. | |
322 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
323 |
| 53 | 40876476 | 33780065 | 82,64% | 3316 Mo | 103 Mo | 1165 s. | |
324 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
325 |
| 55 | 52424414 | 47117107 | 89,88% | 3811 Mo | 119 Mo | 1477 s. | |
326 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
327 |
|
328 |
For some reasons (manipulation efficency, e.g. PCR...), we remove |
329 |
samples 33, 45, 48 and 55. |
330 |
|
331 |
|
332 |
Run TemplateFilter on Mnase Samples |
333 |
----------------------------------- |
334 |
|
335 |
Finally, for each sample we perfome TemplateFilter analysis. |
336 |
|
337 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each |
338 |
nucleosome is define by its center and its width. An odd width leads |
339 |
us to considere non interger lower and upper bound. |
340 |
|
341 |
**WARNING** TemplateFilter is not design to deal with replicate. So we |
342 |
choose to keep a maximum of nucleosome and filter in a second time |
343 |
using the benefit of replicate. To do that we set a low correlation |
344 |
threshold parameter (*0.5*) and a particularly high value of |
345 |
overlaping (*300%*). |
346 |
|
347 |
This step is performed by the followinw part of the *wf.py* script: |
348 |
|
349 |
for sample in samples_mnase: |
350 |
per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, |
351 |
config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], |
352 |
config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], |
353 |
config["TF_MAXW"], config["TF_OL"]) |
354 |
|
355 |
+----+--------+------------+---------------+------------------+ |
356 |
| id | strain | found nucs | nuc file size | process duration | |
357 |
+====+========+============+===============+==================+ |
358 |
| 1 | BY | 96214 | 68 Mo | 1022 s. | |
359 |
+----+--------+------------+---------------+------------------+ |
360 |
| 2 | BY | 91694 | 65 Mo | 1038 s. | |
361 |
+----+--------+------------+---------------+------------------+ |
362 |
| 3 | BY | 91205 | 65 Mo | 1036 s. | |
363 |
+----+--------+------------+---------------+------------------+ |
364 |
| 4 | RM | 88076 | 62 Mo | 984 s. | |
365 |
+----+--------+------------+---------------+------------------+ |
366 |
| 5 | RM | 90141 | 64 Mo | 967 s. | |
367 |
+----+--------+------------+---------------+------------------+ |
368 |
| 6 | RM | 87517 | 62 Mo | 980 s. | |
369 |
+----+--------+------------+---------------+------------------+ |
370 |
| 7 | YJM | 88945 | 64 Mo | 566 s. | |
371 |
+----+--------+------------+---------------+------------------+ |
372 |
| 8 | YJM | 88689 | 64 Mo | 570 s. | |
373 |
+----+--------+------------+---------------+------------------+ |
374 |
| 9 | YJM | 88128 | 63 Mo | 565 s. | |
375 |
+----+--------+------------+---------------+------------------+ |
376 |
|
377 |
|
378 |
Inferring Nucleosome Position and Extracting Read Counts |
379 |
======================================================== |
380 |
|
381 |
This preprocessing step consists in the 4 main steps embed in the |
382 |
*wf.py* and described bellow. As a preamble, this script computes |
383 |
*samples* *samples_mnase* and *strains* that will be used along the 4 |
384 |
steps. |
385 |
|
386 |
The second part of the tutoriel use *R* |
387 |
(http://http://www.r-project.org). It consists in the following main |
388 |
steps: |
389 |
|
390 |
* compute_rois.R |
391 |
|
392 |
* extract_maps.R |
393 |
|
394 |
* compare_common_wp.R |
395 |
|
396 |
* split_samples.R |
397 |
|
398 |
* count_reads.R |
399 |
|
400 |
* get_size_factors |
401 |
|
402 |
* launch_deseq.R |
403 |
|
404 |
|
405 |
Computing Common Genome Region Between Strains |
406 |
---------------------------------------------- |
407 |
|
408 |
R CMD BATCH src/current/compute_rois.R |
409 |
|
410 |
|
411 |
Extracting Maps for Well Positionned and Fuzzy Nucleosomes |
412 |
---------------------------------------------------------- |
413 |
|
414 |
R CMD BATCH src/current/extract_maps.R |
415 |
|
416 |
|
417 |
Compute Distance Between Well Positionned Nucleosomes |
418 |
----------------------------------------------------- |
419 |
|
420 |
R CMD BATCH src/current/compare_common_wp.R |
421 |
|
422 |
|
423 |
Split and Compress Samples According CURs |
424 |
----------------------------------------- |
425 |
|
426 |
R CMD BATCH src/current/split_samples.R |
427 |
|
428 |
|
429 |
Count Reads for Each Nucleosome |
430 |
------------------------------- |
431 |
|
432 |
R CMD BATCH src/current/count_reads.R |
433 |
|
434 |
|
435 |
Get Size Factors Using DESeq |
436 |
---------------------------- |
437 |
|
438 |
R CMD BATCH src/current/get_size_factors.R |
439 |
|
440 |
|
441 |
Performing DESeq Analysis |
442 |
------------------------- |
443 |
|
444 |
R CMD BATCH src/current/launch_deseq.R |
445 |
|
446 |
|
447 |
Results |
448 |
======= |
449 |
|
450 |
|
451 |
Output Files Organisation |
452 |
------------------------- |
453 |
|
454 |
Previous steps produce following 45 files. Each filename is under the |
455 |
form |
456 |
|
457 |
results/current/[combi]_[marker]_[form]_snep.tab |
458 |
|
459 |
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, |
460 |
marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each |
461 |
post translational histone modification and form is in {wp, fuzzy, |
462 |
wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes |
463 |
or both for SNEP computation. |
464 |
|
465 |
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM |
466 |
lower_bound_RM upper_bound_RM index_nuc_RM roi_index form |
467 |
BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4 |
468 |
RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 BY_H3K14ac_37 |
469 |
BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM |
470 |
|
471 |
For each file, there is 1 line per nucleosome and each line is |
472 |
composed of many columns divided into 3 main topics: |
473 |
* nuc information |
474 |
|
475 |
* number opf reads for each sample |
476 |
|
477 |
* DESeq analysis results. |
478 |
|
479 |
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are: |
480 |
* chr_BY, the BY chr involved |
481 |
|
482 |
* lower_bound_BY, the lower bound of the BY nuc |
483 |
|
484 |
* upper_bound_BY, the upper_bound of the BY nuc |
485 |
|
486 |
* index_nuc_BY, the index of the nuc in the entire list of BY |
487 |
nucs |
488 |
|
489 |
* chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM |
490 |
|
491 |
are the same information for the RM strain |
492 |
|
493 |
* roi_index, the index of the region of interrest involved. |
494 |
|
495 |
Next cols concern indicators for each sample. They are labeled |
496 |
[strain]_[marker]_[sample_id] and each value represents the number of |
497 |
reads for the current nuc for the sample *sample_id*. |
498 |
|
499 |
The 5 final columns concern DESeq analysis: |
500 |
* manip[a_manip] strain[a_strain] |
501 |
manip[a_strain]:strain[a_strain], the manip (marker) effect, the |
502 |
strain effect and the snep effect. |
503 |
|
504 |
* pvalsGLM, the pvalue resulting of the comparison of the GLM |
505 |
model considering or the interaction term *marker:strain* |
506 |
|
507 |
* snep_index, a boolean set to TRUE if the *pvalueGLM* value is |
508 |
under the threshold computed with FDR function with a rate set to |
509 |
0.01%. |
510 |
|
511 |
It also produces the file that explicts size factor for each involved |
512 |
sample in differents strain combination and nucleosomal region type: |
513 |
|
514 |
TODO: include this file... |
515 |
/home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt |
516 |
|
517 |
results/current/size_factors.tab |
518 |
|
519 |
|
520 |
Number of SNEPs |
521 |
--------------- |
522 |
|
523 |
Here are the number of computed for each forms. |
524 |
|
525 |
[1] "wp" |
526 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
527 |
BY-RM 30234 520 798 83 3566 26 |
528 |
BY-YJM 31298 303 619 102 103 128 |
529 |
RM-YJM 29863 129 340 46 3177 18 |
530 |
[1] "fuzzy" |
531 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
532 |
BY-RM 10748 294 308 101 1681 42 |
533 |
BY-YJM 10669 122 176 124 93 87 |
534 |
RM-YJM 11478 54 112 41 1389 20 |
535 |
[1] "wpfuzzy" |
536 |
#nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac |
537 |
BY-RM 40982 770 1136 183 5404 73 |
538 |
BY-YJM 41967 439 804 214 198 199 |
539 |
RM-YJM 41341 184 468 87 4687 37 |
540 |
|
541 |
TODO: |
542 |
* Print/study intra/inter strain LODs. |
543 |
|
544 |
* Check the normality of sample using Shapiro–Wilk (Hypothesis |
545 |
for computing LODs) |