root / README @ c25275e2
Historique | Voir | Annoter | Télécharger (29,84 ko)
1 |
|
---|---|
2 |
Welcome to *NucleoMiner2* |
3 |
************************* |
4 |
|
5 |
* Readme / Documentation for *NucleoMiner2* |
6 |
|
7 |
* License |
8 |
|
9 |
* Installation Instructions |
10 |
|
11 |
* Tutorial |
12 |
|
13 |
* Experimental Dataset, Working Directory and Configuration File |
14 |
|
15 |
* Preprocessing Illumina Fastq Reads for Each Sample |
16 |
|
17 |
* Inferring Nucleosome Position and Extracting Read Counts |
18 |
|
19 |
* Results: Number of SNEPs |
20 |
|
21 |
* APPENDICE: Generate .c2c Files |
22 |
|
23 |
* References |
24 |
|
25 |
* Python Reference |
26 |
|
27 |
* R Reference |
28 |
|
29 |
Readme / Documentation for *NucleoMiner2* |
30 |
***************************************** |
31 |
|
32 |
*NucleoMiner2* offers Python API and R package allowing to perform |
33 |
quantitative analysis of epigenetic marks on individual nucleosomes. |
34 |
It was developed to detect natural Single-Nucleosome Epi-Polymorphisms |
35 |
(SNEP) from MNase-seq and ChIP-seq data. |
36 |
|
37 |
|
38 |
License |
39 |
======= |
40 |
|
41 |
Copyright CNRS 2012-2013 |
42 |
|
43 |
* Florent CHUFFART |
44 |
|
45 |
* Jean-Baptiste VEYRIERAS |
46 |
|
47 |
* Gael YVERT |
48 |
|
49 |
This software is a computer program which purpose is to perform |
50 |
quanti- tative analysis of epigenetic marks at single nucleosome |
51 |
resolution. |
52 |
|
53 |
This software is governed by the CeCILL license under French law and |
54 |
abiding by the rules of distribution of free software. You can use, |
55 |
modify and/ or redistribute the software under the terms of the CeCILL |
56 |
license as circulated by CEA, CNRS and INRIA at the following URL |
57 |
"http://www.cecill.info". |
58 |
|
59 |
As a counterpart to the access to the source code and rights to copy, |
60 |
modify and redistribute granted by the license, users are provided |
61 |
only with a limited warranty and the software's author, the holder |
62 |
of the economic rights, and the successive licensors have only |
63 |
limited liability. |
64 |
|
65 |
This software is provided with absolutely NO WARRANTY. The authors can |
66 |
not be held responsible, even partially, for any damage, loss, |
67 |
financial loss or any other undesired facts resulting from the use of |
68 |
the software. In this respect, the user's attention is drawn to the |
69 |
risks associated with loading, using, modifying and/or developing or |
70 |
reproducing the software by the user in light of its specific status |
71 |
of free software, that may mean that it is complicated to manipulate, |
72 |
and that also therefore means that it is reserved for developers |
73 |
and experienced professionals having in-depth computer knowledge. |
74 |
Users are therefore encouraged to load and test the software's |
75 |
suitability as regards their requirements in conditions enabling the |
76 |
security of their systems and/or data to be ensured and, more |
77 |
generally, to use and operate it in the same conditions as regards |
78 |
security. |
79 |
|
80 |
The fact that you are presently reading this means that you have had |
81 |
knowledge of the CeCILL license and that you accept its terms. |
82 |
|
83 |
|
84 |
Installation Instructions |
85 |
========================= |
86 |
|
87 |
|
88 |
Links |
89 |
----- |
90 |
|
91 |
*NucleoMiner2* home page and documentation are available here: |
92 |
|
93 |
* https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer |
94 |
|
95 |
The Yvert lab web page is accessible here: |
96 |
|
97 |
* http://www.ens-lyon.fr/LBMC/gisv/ |
98 |
|
99 |
|
100 |
Installation |
101 |
------------ |
102 |
|
103 |
The first installation step is to retrieve the source code of |
104 |
MyLabStocks. You can do this by typing the following command in a |
105 |
terminal. |
106 |
|
107 |
git clone http://forge.cbp.ens-lyon.fr/git/nucleominer |
108 |
|
109 |
|
110 |
Prerequisites |
111 |
~~~~~~~~~~~~~ |
112 |
|
113 |
To work properly, NucleoMiner2 needs that the following free software |
114 |
are installed and made available on your system: |
115 |
|
116 |
* Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2 |
117 |
|
118 |
* SAMtools http://samtools.sourceforge.net |
119 |
|
120 |
* bedtools http://code.google.com/p/bedtools/ |
121 |
|
122 |
* TemplateFilter |
123 |
http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering |
124 |
|
125 |
It also requires the following R packages to be installed on your |
126 |
system: |
127 |
|
128 |
* fork |
129 |
|
130 |
* rjson |
131 |
|
132 |
* seqinr |
133 |
|
134 |
* plotrix |
135 |
|
136 |
* DESeq |
137 |
|
138 |
* cachecache https://forge.cbp.ens- |
139 |
lyon.fr/redmine/projects/cachecache |
140 |
|
141 |
* bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot |
142 |
|
143 |
* nucleominer https://forge.cbp.ens- |
144 |
lyon.fr/redmine/projects/nucleominer |
145 |
|
146 |
The first packages could be installed by typing the following command |
147 |
in an R console: |
148 |
|
149 |
install.packages(c("fork", "rjson", "seqinr", "plotrix")) |
150 |
source("http://bioconductor.org/biocLite.R") |
151 |
biocLite("DESeq") |
152 |
|
153 |
The last packages are available in the git repository they could be |
154 |
install by typing the following command in your terminal: |
155 |
|
156 |
cd nucleominer |
157 |
R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\ |
158 |
doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\ |
159 |
build/nucleominer_2.3.46.tar.gz |
160 |
|
161 |
Tutorial |
162 |
******** |
163 |
|
164 |
This tutorial describes steps allowing performing quantitative |
165 |
analysis of epigenetic marks on individual nucleosomes. We assume that |
166 |
files are organised according to a given hierarchy and that all |
167 |
command lines are launched from the project’s root directory. |
168 |
|
169 |
This tutorial is divided into two main parts. The first part covers |
170 |
the python script *wf.py* that aligns and converts short sequence |
171 |
reads. The second part covers the R scripts that extracts information |
172 |
(nucleosome position and indicators) from the dataset. |
173 |
|
174 |
|
175 |
Experimental Dataset, Working Directory and Configuration File |
176 |
============================================================== |
177 |
|
178 |
|
179 |
Working Directory Organisation |
180 |
------------------------------ |
181 |
|
182 |
After having install NucleoMiner2 environment (Previous section), go |
183 |
to the root working directory of the tutorial by typing the following |
184 |
command in a terminal: |
185 |
|
186 |
cd doc/Chuffart_NM2_workdir/ |
187 |
|
188 |
|
189 |
Retrieving Experimental Dataset |
190 |
------------------------------- |
191 |
|
192 |
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress |
193 |
(http://www.ebi.ac.uk/arrayexpress/) under accession number |
194 |
E-MTAB-2671. |
195 |
|
196 |
$$$ TODO explain how organise Experimental Dataset into the *data* |
197 |
directory of the working directory. |
198 |
|
199 |
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each |
200 |
strain we performed Mnase-Seq and ChIP-Seq using an antibody |
201 |
recognizing the H3K14ac epigenetic mark. |
202 |
|
203 |
The dataset is composed of 55 files organised as follows: |
204 |
|
205 |
* 3 replicates for BY MNase Seq |
206 |
|
207 |
* sample 1 (5 fastq.gz files) |
208 |
|
209 |
* sample 2 (5 fastq.gz files) |
210 |
|
211 |
* sample 3 (4 fastq.gz files) |
212 |
|
213 |
* 3 replicates for RM MNase Seq |
214 |
|
215 |
* sample 4 (4 fastq.gz files) |
216 |
|
217 |
* sample 5 (4 fastq.gz files) |
218 |
|
219 |
* sample 6 (5 fastq.gz files) |
220 |
|
221 |
* 3 replicates for BY ChIP Seq H3K14ac |
222 |
|
223 |
* sample 36 (5 fastq.gz files) |
224 |
|
225 |
* sample 37 (5 fastq.gz files) |
226 |
|
227 |
* sample 53 (9 fastq.gz files) |
228 |
|
229 |
* 2 replicates for RM ChIP Seq H3K14ac |
230 |
|
231 |
* sample 38 (5 fastq.gz files) |
232 |
|
233 |
* sample 39 (4 fastq.gz files) |
234 |
|
235 |
|
236 |
Python and R Common Configuration File |
237 |
-------------------------------------- |
238 |
|
239 |
First of all we define in one place some configuration variables that |
240 |
will be launched by python and R scripts. These variables are |
241 |
contained in file *configurator.py*. The execution of this python |
242 |
script dumps variables into the *nucleominer_config.json* file that |
243 |
will then be used by both R and python scripts. |
244 |
|
245 |
To do this, go to the root directory of your project and run the |
246 |
following command: |
247 |
|
248 |
python src/current/configurator.py |
249 |
|
250 |
|
251 |
Preprocessing Illumina Fastq Reads for Each Sample |
252 |
================================================== |
253 |
|
254 |
This preprocessing step consists of 4 main steps embedded in the |
255 |
*wf.py* script. They are described bellow. As a preamble, this script |
256 |
computes *samples*, *samples_mnase* and *strains* that will be used |
257 |
along the 4 steps. |
258 |
|
259 |
wf.samples = [] |
260 |
|
261 |
List of samples where a sample is identified by an id (key: *id*) |
262 |
and a strain name (key *strain*). |
263 |
|
264 |
wf.samples_mnase = [] |
265 |
|
266 |
List of Mnase samples. |
267 |
|
268 |
wf.strains = [] |
269 |
|
270 |
List of reference strains. |
271 |
|
272 |
|
273 |
Creating Bowtie Index from each Reference Genome |
274 |
------------------------------------------------ |
275 |
|
276 |
For each strain, we need to create bowtie index. Bowtie index of a |
277 |
strain is a tree view of the genome of this strain. It will be used by |
278 |
bowtie to align reads. This step is performed by the following part of |
279 |
the *wf.py* script: |
280 |
|
281 |
for strain in strains: |
282 |
per_strain_stats[strain] = create_bowtie_index(strain, |
283 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
284 |
config["BOWTIE_BUILD_BIN"]) |
285 |
|
286 |
The following table summarizes the file sizes and process durations |
287 |
concerning this step. |
288 |
|
289 |
+--------+------------------------+------------------------+------------------+ |
290 |
| strain | fasta genome file size | bowtie index file size | process duration | |
291 |
+========+========================+========================+==================+ |
292 |
| BY | 12 Mo | 25 Mo | 11 s. | |
293 |
+--------+------------------------+------------------------+------------------+ |
294 |
| RM | 12 Mo | 24 Mo | 9 s. | |
295 |
+--------+------------------------+------------------------+------------------+ |
296 |
|
297 |
|
298 |
Aligning Reads to Reference Genome |
299 |
---------------------------------- |
300 |
|
301 |
Next, we launch bowtie to align reads to the reference genome. It |
302 |
produces a *.sam* file that we convert into a *.bed* file. Binaries |
303 |
for *bowtie*, *samtools* and *bedtools* are wrapped using python |
304 |
*subprocess* class. This step is performed by the following part of |
305 |
the *wf.py* script: |
306 |
|
307 |
for sample in samples: |
308 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
309 |
config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], |
310 |
config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], |
311 |
config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"]) |
312 |
|
313 |
|
314 |
Convert Aligned Reads into TemplateFilter Format |
315 |
------------------------------------------------ |
316 |
|
317 |
TemplateFilter uses particular input formats for reads, so it is |
318 |
necessary to convert the *.bed* files. TemplateFilter expect reads as |
319 |
follows: *chr*, *coord*, *strand* and *#read* where: |
320 |
|
321 |
* *chr* is the number of the chromosome; |
322 |
|
323 |
* *coord* is the coordinate of the reads; |
324 |
|
325 |
* *strand* is *F* for forward and *R* for reverse; |
326 |
|
327 |
* *#reads* the number of reads covering this position. |
328 |
|
329 |
Each entry is *tab*-separated. |
330 |
|
331 |
**WARNING** for reverse strands, bowtie returns the position of the |
332 |
first nucleotide on the left hand side, whereas TemplateFilter expects |
333 |
the first one on the right hand side. This step takes this into |
334 |
account by adding the read length (in our case 50) to the reverse |
335 |
reads coordinates. |
336 |
|
337 |
This step is performed by the following part of the *wf.py* script: |
338 |
|
339 |
for sample in samples: |
340 |
per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, |
341 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
342 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
343 |
|
344 |
The following table summarises the number of reads, the involved file |
345 |
sizes and process durations concerning the two last steps. In our |
346 |
case, alignment process have been multithreaded over 3 cores. |
347 |
|
348 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
349 |
| id | Illumina reads | aligned and filtred reads | ratio | *.bed* file size | TF input file size | process duration | |
350 |
+====+================+===========================+========+==================+====================+==================+ |
351 |
| 1 | 16436138 | 10199695 | 62,06% | 1064 Mo | 60 Mo | 383 s. | |
352 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
353 |
| 2 | 16911132 | 12512727 | 73,99% | 1298 Mo | 64 Mo | 437 s. | |
354 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
355 |
| 3 | 15946902 | 12340426 | 77,38% | 1280 Mo | 65 Mo | 423 s. | |
356 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
357 |
| 4 | 13765584 | 10381903 | 75,42% | 931 Mo | 59 Mo | 352 s. | |
358 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
359 |
| 5 | 15168268 | 11502855 | 75,83% | 1031 Mo | 64 Mo | 386 s. | |
360 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
361 |
| 6 | 18850820 | 14024905 | 74,40% | 1254 Mo | 69 Mo | 482 s. | |
362 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
363 |
| 36 | 17715118 | 14092985 | 79,55% | 1404 Mo | 68 Mo | 483 s. | |
364 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
365 |
| 37 | 17288466 | 7402082 | 42,82% | 741 Mo | 48 Mo | 339 s. | |
366 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
367 |
| 38 | 16116394 | 13178457 | 81,77% | 1101 Mo | 63 Mo | 420 s. | |
368 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
369 |
| 39 | 14241106 | 10537228 | 73,99% | 880 Mo | 57 Mo | 348 s. | |
370 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
371 |
| 53 | 40876476 | 33780065 | 82,64% | 3316 Mo | 103 Mo | 1165 s. | |
372 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
373 |
|
374 |
|
375 |
Run TemplateFilter on Mnase Samples |
376 |
----------------------------------- |
377 |
|
378 |
Finally, for each sample we perform TemplateFilter analysis. |
379 |
|
380 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each |
381 |
nucleosome is define by its center and its width. An odd width leads |
382 |
us to consider non- integer lower and upper bound. |
383 |
|
384 |
**WARNING** TemplateFilter is not designed to deal with replicates. So |
385 |
we recommend to keep a maximum of nucleosomes and filter the aberrant |
386 |
ones afterwards using the benefits of having replicates. To do this, |
387 |
we set a low correlation threshold parameter (0.5) and a particularly |
388 |
high value of overlap (300%). |
389 |
|
390 |
This step is performed by the following part of the *wf.py* script: |
391 |
|
392 |
for sample in samples_mnase: |
393 |
per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, |
394 |
config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], |
395 |
config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], |
396 |
config["TF_MAXW"], config["TF_OL"]) |
397 |
|
398 |
+----+--------+------------+---------------+------------------+ |
399 |
| id | strain | found nucs | nuc file size | process duration | |
400 |
+====+========+============+===============+==================+ |
401 |
| 1 | BY | 96214 | 68 Mo | 1022 s. | |
402 |
+----+--------+------------+---------------+------------------+ |
403 |
| 2 | BY | 91694 | 65 Mo | 1038 s. | |
404 |
+----+--------+------------+---------------+------------------+ |
405 |
| 3 | BY | 91205 | 65 Mo | 1036 s. | |
406 |
+----+--------+------------+---------------+------------------+ |
407 |
| 4 | RM | 88076 | 62 Mo | 984 s. | |
408 |
+----+--------+------------+---------------+------------------+ |
409 |
| 5 | RM | 90141 | 64 Mo | 967 s. | |
410 |
+----+--------+------------+---------------+------------------+ |
411 |
| 6 | RM | 87517 | 62 Mo | 980 s. | |
412 |
+----+--------+------------+---------------+------------------+ |
413 |
|
414 |
|
415 |
Inferring Nucleosome Position and Extracting Read Counts |
416 |
======================================================== |
417 |
|
418 |
The second part of the tutorial uses R |
419 |
(http://http://www.r-project.org). It consists of a set of R scripts |
420 |
that will be sourced in an R from a console launched at the root of |
421 |
your project. These scripts are: |
422 |
|
423 |
* headers.R |
424 |
|
425 |
* extract_maps.R |
426 |
|
427 |
* translate_common_wp.R |
428 |
|
429 |
* split_samples.R |
430 |
|
431 |
* count_reads.R |
432 |
|
433 |
* get_size_factors |
434 |
|
435 |
* launch_deseq.R |
436 |
|
437 |
|
438 |
The Script headers.R |
439 |
-------------------- |
440 |
|
441 |
The script headers.R is included in each other scripts. It is in |
442 |
charge of: |
443 |
|
444 |
* launching libraries used in the scripts |
445 |
|
446 |
* launching configuration (design, strain, marker...) |
447 |
|
448 |
* computing and caching CURs (caching means storing the |
449 |
information in the computer's memory) |
450 |
|
451 |
Note that you can customize the function “translate”. This function |
452 |
allows you to use the alignments between genomes when performing |
453 |
various tasks. You may be using NucleoMiner2 to analyse data of a |
454 |
single strain, or of several strains. |
455 |
|
456 |
* All the data corresponds to the same strain (e.g. |
457 |
treatment/control, or only few mutations): Then in step 1), the |
458 |
regions to use are entire chromosomes. Instep 2) simply use the |
459 |
default translate function which is neutral. |
460 |
|
461 |
* The data come from two or more strains: In this case, edit a |
462 |
list of regions and customize the translate function which |
463 |
performs the correspondence between the different genomes. How we |
464 |
did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to |
465 |
the Appendice "Generate .c2c Files"), then use it to produce the |
466 |
list of regions and customise “translate”. |
467 |
|
468 |
In your R console, run the following command line: |
469 |
|
470 |
source("src/current/headers.R") |
471 |
|
472 |
|
473 |
The Script extract_maps.R |
474 |
------------------------- |
475 |
|
476 |
This script is in charge of extracting Maps for well-positioned and |
477 |
fuzzy nucleosomes. First of all, this script computes intra and inter- |
478 |
strain nucleosome maps for each CUR. This step is executed in parallel |
479 |
on many cores using the BoT library. Next, it collects results and |
480 |
produces well-positioned, fuzzy and UNR maps. |
481 |
|
482 |
The well-positioned map for BY is collected in the result directory |
483 |
and is called *BY_wp.tab*. It is composed of following columns: |
484 |
|
485 |
* chr, the number of the chromosome |
486 |
|
487 |
* lower_bound, the lower bound of the nucleosome |
488 |
|
489 |
* upper_bound, the upper bound of the nucleosome |
490 |
|
491 |
* cur_index, index of the CUR |
492 |
|
493 |
* index_nuc, the index of the nucleosome in the CUR |
494 |
|
495 |
* wp, 1 if it is a well positioned nucleosome, 0 otherwise |
496 |
|
497 |
* nb_reads, the number of reads that support this nucleosome |
498 |
|
499 |
* nb_nucs, the number of TemplateFilter nucleosome across |
500 |
replicates (= the number of replicates in which it is a well- |
501 |
positioned nucleosome) |
502 |
|
503 |
* llr_1, for a well-positioned nucleosome, it is the LLR1 (log- |
504 |
likelihood ratio) between the first and the second TemplateFilter |
505 |
nucleosome on the chain. |
506 |
|
507 |
* llr_2, for a well-positioned nucleosome, it is the LLR1 between |
508 |
the second and the third TemplateFilter nucleosome on the chain. |
509 |
|
510 |
* wp_llr, for a well-positioned nucleosome, it is the LLR2 that |
511 |
compares consistency of the positioning over all TemplateFilter |
512 |
nucleosomes. |
513 |
|
514 |
* wp_pval, for a well-positioned nucleosome, it is the p-value |
515 |
chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*) |
516 |
|
517 |
* dyad_shift, for a well-positioned nucleosome, it is the shift |
518 |
between the two extreme TemplateFilter nucleosome dyad positions. |
519 |
|
520 |
The fuzzy map for BY is collected in the result directory and is |
521 |
called *BY_fuzzy.tab*. It is composed of following columns: |
522 |
|
523 |
* chr, the number of the chromosome |
524 |
|
525 |
* lower_bound, the lower bound of the nucleosome |
526 |
|
527 |
* upper_bound, the upper bound of the nucleosome |
528 |
|
529 |
* cur_index, index of the CUR |
530 |
|
531 |
The map of common well-positioned nucleosomes aligned between the BY |
532 |
and RM strains is collected in the result directory and is called |
533 |
*BY_RM_common_wp.tab*. It is composed of following columns: |
534 |
|
535 |
* cur_index, the index of the CUR |
536 |
|
537 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
538 |
|
539 |
* index_nuc_RM, the index of the RM nucleosome in the CUR |
540 |
|
541 |
* llr_score, , the LLR3 score that estimates conservation between |
542 |
the positions in BY and RM |
543 |
|
544 |
* common_wp_pval, the p-value chi square test obtained from LLR3 |
545 |
(*1-pchisq(2.LLR3, df=2)*) |
546 |
|
547 |
* diff, the dyads shift between the positions in the two strains |
548 |
|
549 |
The common UNR map for BY and RM strains is collected in the result |
550 |
directory and is called *BY_RM_common_unr.tab*. It is composed of the |
551 |
following columns: |
552 |
|
553 |
* cur_index, the index of the CUR |
554 |
|
555 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
556 |
|
557 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
558 |
|
559 |
To execute this script, run the following command in your R console: |
560 |
|
561 |
source("src/current/extract_maps.R") |
562 |
|
563 |
|
564 |
The Script translate_common_wp.R |
565 |
-------------------------------- |
566 |
|
567 |
This script is used to translate common well-positioned nucleosome |
568 |
maps from a strain to another strain and stores it into a table. |
569 |
|
570 |
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM |
571 |
well-positioned nucleosome translated into the BY genome coordinates. |
572 |
It is composed of following columns: |
573 |
|
574 |
* strain_ref, the reference genome (in which positioned are |
575 |
defined) |
576 |
|
577 |
* begin, the translated lower bound of the nucleosome |
578 |
|
579 |
* end, the translated upper bound of the nucleosome |
580 |
|
581 |
* chr, the number of chromosomes for the reference genome (in |
582 |
which positioned are defined) |
583 |
|
584 |
* length, the length of the nucleosome (could be negative) |
585 |
|
586 |
* cur_index, the index of the CUR |
587 |
|
588 |
* index_nuc, the index of the nucleosome in the CUR |
589 |
|
590 |
To execute this script, run the following command in your R console: |
591 |
|
592 |
source("src/current/translate_common_wp.R") |
593 |
|
594 |
|
595 |
The Script split_samples.R |
596 |
-------------------------- |
597 |
|
598 |
For memory space usage reasons, we split and compress TemplateFilter |
599 |
input files according to their corresponding chromosome. for example, |
600 |
*sample_1_TF.tab* will be split into : |
601 |
|
602 |
* sample_1_chr_1_splited_sample.tab.gz |
603 |
|
604 |
* sample_1_chr_2_splited_sample.tab.gz |
605 |
|
606 |
* ... |
607 |
|
608 |
* sample_1_chr_17_splited_sample.tab.gz |
609 |
|
610 |
To execute this script, run the following command in your R console: |
611 |
|
612 |
source("src/current/split_samples.R") |
613 |
|
614 |
|
615 |
The Script count_reads.R |
616 |
------------------------ |
617 |
|
618 |
To associate a number of observations (read) to each nucleosome we run |
619 |
the script *count_reads.R*. It produces the files |
620 |
*BY_RM_H3K14ac_wp_and_nbreads.tab*, |
621 |
*BY_RM_H3K14ac_unr_and_nbreads.tab* |
622 |
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and |
623 |
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well- |
624 |
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned |
625 |
nucleosomes and Mnase UNRs respectively. |
626 |
|
627 |
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains |
628 |
counted reads for well-positioned nucleosomes with the experimental |
629 |
condition ChIP H3K14ac. It is composed of the following columns: |
630 |
|
631 |
* chr_BY, the number of the chromosome for BY |
632 |
|
633 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
634 |
|
635 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
636 |
|
637 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
638 |
|
639 |
* chr_RM, the number of the chromosome for RM |
640 |
|
641 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
642 |
|
643 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
644 |
|
645 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
646 |
|
647 |
* cur_index, index of the CUR |
648 |
|
649 |
* BY_H3K14ac_36, the number of reads for the current nucleosome |
650 |
for the sample 36 |
651 |
|
652 |
* BY_H3K14ac_37, #reads for sample 37 |
653 |
|
654 |
* BY_H3K14ac_53, #reads for sample 53 |
655 |
|
656 |
* RM_H3K14ac_38, #reads for sample 38 |
657 |
|
658 |
* RM_H3K14ac_39, #reads for sample 39 |
659 |
|
660 |
To execute this script, run the following command in your R console: |
661 |
|
662 |
source("src/current/count_reads.R") |
663 |
|
664 |
|
665 |
The Script get_size_factors.R |
666 |
----------------------------- |
667 |
|
668 |
This script uses the DESeq function *estimateSizeFactors* to compute |
669 |
the size factor of each sample. It corresponds to normalisation of |
670 |
read counts from sample to sample, as determined by DESeq. When a |
671 |
sample has n reads for a nucleosome or a UNR, the normalised count is |
672 |
n/f where f is the factor contained in this file. The script dumps |
673 |
computed size factors into the file *size_factors.tab*. This file has |
674 |
the form: |
675 |
|
676 |
+-----------+---------+---------+---------+ |
677 |
| sample_id | wp | unr | wpunr | |
678 |
+===========+=========+=========+=========+ |
679 |
| 1 | 0.87396 | 0.88097 | 0.87584 | |
680 |
+-----------+---------+---------+---------+ |
681 |
| 2 | 1.07890 | 1.07440 | 1.07760 | |
682 |
+-----------+---------+---------+---------+ |
683 |
| 3 | 1.06400 | 1.05890 | 1.06250 | |
684 |
+-----------+---------+---------+---------+ |
685 |
| 4 | 0.85782 | 0.87948 | 0.86305 | |
686 |
+-----------+---------+---------+---------+ |
687 |
| 5 | 0.97577 | 0.96590 | 0.97307 | |
688 |
+-----------+---------+---------+---------+ |
689 |
| 6 | 1.19630 | 1.18120 | 1.19190 | |
690 |
+-----------+---------+---------+---------+ |
691 |
| 36 | 0.93318 | 0.92762 | 0.93166 | |
692 |
+-----------+---------+---------+---------+ |
693 |
| 37 | 0.48315 | 0.48453 | 0.48350 | |
694 |
+-----------+---------+---------+---------+ |
695 |
| 38 | 1.11240 | 1.11210 | 1.11230 | |
696 |
+-----------+---------+---------+---------+ |
697 |
| 39 | 0.89897 | 0.89917 | 0.89903 | |
698 |
+-----------+---------+---------+---------+ |
699 |
| 53 | 2.22650 | 2.22700 | 2.22660 | |
700 |
+-----------+---------+---------+---------+ |
701 |
|
702 |
sample_id are given in file samples.csv |
703 |
|
704 |
If you don't know which column to use, we recommend using wpunr. |
705 |
|
706 |
If you want the very detailed factors produced by DESeq, here are the |
707 |
information: |
708 |
|
709 |
* unr: factor computed from data of UNR regions. These regions |
710 |
are defined for every pairs of aligned genomes (e.g. BY_RM) |
711 |
|
712 |
* wp: same, but for well-positioned nucleosomes. |
713 |
|
714 |
* wpunr: both types of regions. |
715 |
|
716 |
To execute this script, run the following command in your R console: |
717 |
|
718 |
source("src/current/get_size_factors.R") |
719 |
|
720 |
|
721 |
The Script launch_deseq.R |
722 |
------------------------- |
723 |
|
724 |
Finally, the script *launch_deseq.R* perform statistical analysis on |
725 |
each nucleosome using *DESeq*. It produces files: |
726 |
|
727 |
* results/current/BY_RM_H3K14ac_wp_snep.tab |
728 |
|
729 |
* results/current/BY_RM_H3K14ac_unr_snep.tab |
730 |
|
731 |
* results/current/BY_RM_H3K14ac_wpunr_snep.tab |
732 |
|
733 |
* results/current/BY_RM_H3K14ac_wp_mnase.tab |
734 |
|
735 |
* results/current/BY_RM_H3K14ac_unr_mnase.tab |
736 |
|
737 |
* results/current/BY_RM_H3K14ac_wpunr_mnase.tab |
738 |
|
739 |
These files are organised with the following columns (see file |
740 |
*BY_RM_H3K14ac_wp_snep.tab* for an example): |
741 |
|
742 |
* chr_BY, the number of the chromosome for BY |
743 |
|
744 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
745 |
|
746 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
747 |
|
748 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
749 |
|
750 |
* chr_RM, the number of the chromosome for RM |
751 |
|
752 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
753 |
|
754 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
755 |
|
756 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
757 |
|
758 |
* cur_index, index of the CUR |
759 |
|
760 |
* form |
761 |
|
762 |
* BY_Mnase_Seq_1, the number of reads for the current nucleosome |
763 |
for the sample 1 |
764 |
|
765 |
Next columns concern indicators for each sample: |
766 |
|
767 |
* BY_Mnase_Seq_2, #reads for sample 2 |
768 |
|
769 |
* BY_Mnase_Seq_3, #reads for sample 3 |
770 |
|
771 |
* RM_Mnase_Seq_4, #reads for sample 4 |
772 |
|
773 |
* RM_Mnase_Seq_5, #reads for sample 5 |
774 |
|
775 |
* RM_Mnase_Seq_6, #reads for sample 6 |
776 |
|
777 |
* BY_H3K14ac_36, #reads for sample 36 |
778 |
|
779 |
* BY_H3K14ac_37, #reads for sample 37 |
780 |
|
781 |
* BY_H3K14ac_53, #reads for sample 53 |
782 |
|
783 |
* RM_H3K14ac_38, #reads for sample 38 |
784 |
|
785 |
* RM_H3K14ac_39, #reads for sample 39 |
786 |
|
787 |
The 5 last columns concern DESeq analysis: |
788 |
|
789 |
* manip[a_manip] strain[a_strain] |
790 |
manip[a_strain]:strain[a_strain], the manip (marker) effect, the |
791 |
strain effect and the snep effect. These are the coefficients of |
792 |
the fitted generalized linear model. |
793 |
|
794 |
* pvalsGLM, the pvalue resulting of the comparison of the GLM |
795 |
model considering or not the interaction term marker:strain. This |
796 |
is the statsitcial significance of the interaction term and |
797 |
therefore the statistical significance of the SNEP. |
798 |
|
799 |
* snep_index, a boolean set to TRUE if the pvalueGLM value is |
800 |
under the threshold computed with FDR function with a rate set to |
801 |
0.0001. |
802 |
|
803 |
To execute this script, run the following command |
804 |
|
805 |
To execute this script, run the following command in your R console: |
806 |
|
807 |
source("src/current/launch_deseq.R") |
808 |
|
809 |
|
810 |
Results: Number of SNEPs |
811 |
======================== |
812 |
|
813 |
Here are the number of computed SNEPs for each forms. |
814 |
|
815 |
+-------+---------+-------+---------+ |
816 |
| form | strains | #nucs | H3K14ac | |
817 |
+=======+=========+=======+=========+ |
818 |
| wp | BY-RM | 30464 | 3549 | |
819 |
+-------+---------+-------+---------+ |
820 |
| unr | BY-RM | 9497 | 1559 | |
821 |
+-------+---------+-------+---------+ |
822 |
| wpunr | BY-RM | 39961 | 5240 | |
823 |
+-------+---------+-------+---------+ |
824 |
|
825 |
|
826 |
APPENDICE: Generate .c2c Files |
827 |
============================== |
828 |
|
829 |
The *.c2c* files is a simple table that describes how the genome |
830 |
sequence can be aligned. We generate it using some NucleoMiner 1.0 |
831 |
scripts. |
832 |
|
833 |
To use NucleoMiner 1.0 scripts on your UNIX/LINUX computer you need |
834 |
first to install MUMmer which is a system for rapidly aligning entire |
835 |
genomes, whether in complete or draft form. |
836 |
|
837 |
|
838 |
Installing the MUMmer library |
839 |
----------------------------- |
840 |
|
841 |
Get the last version of MUMmer archive on your computer |
842 |
(MUMmer3.23.tar.gz distributed in the directory deps of your working |
843 |
directory). Copy it in a dedicated directory. Install it locally into |
844 |
the src folder of you working directory by typing (working directory): |
845 |
|
846 |
tar -xvzf gdl-1.0.tar.gz |
847 |
|
848 |
This creates a directory called gdl-1.0. You now need to go into this |
849 |
directory and compile the library, by typing: |
850 |
|
851 |
cd src |
852 |
tar xfvz ../deps/MUMmer3.23.tar.gz |
853 |
cd MUMmer3.23 |
854 |
make check |
855 |
make install |
856 |
|
857 |
|
858 |
Installing NucleoMiner 1.0 scripts |
859 |
---------------------------------- |
860 |
|
861 |
Get the nucleominer-1.0.tar.gz archive on your computer (distributed |
862 |
in the directory deps of your working directory). Install it locally |
863 |
into the src folder of you working directory by typing (working |
864 |
directory): |
865 |
|
866 |
cd src |
867 |
tar xfvz ../deps/nucleominer-1.0.tar.gz |
868 |
cd .. |
869 |
|
870 |
This creates a directory called that contains NucleoMiner 1.0 scripts |
871 |
(src/nucleominer-1.0/scripts). |
872 |
|
873 |
|
874 |
Generate .c2c Files |
875 |
------------------- |
876 |
|
877 |
To generate .c2c files you need to type the following command in a |
878 |
terminal: |
879 |
|
880 |
export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts |
881 |
export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/ |
882 |
NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \ |
883 |
data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \ |
884 |
data/byxrm 2>NMgxcomp.log |
885 |
|
886 |
After execution, the directory *data* will hold the .c2c files. |