root / README @ master
Historique | Voir | Annoter | Télécharger (30,72 ko)
1 |
|
---|---|
2 |
Welcome to *NucleoMiner2* |
3 |
************************* |
4 |
|
5 |
* Readme / Documentation for *NucleoMiner2* |
6 |
|
7 |
* License |
8 |
|
9 |
* Installation Instructions |
10 |
|
11 |
* Tutorial |
12 |
|
13 |
* Experimental Dataset, Working Directory and Configuration File |
14 |
|
15 |
* Preprocessing Illumina Fastq Reads for Each Sample |
16 |
|
17 |
* Inferring Nucleosome Position and Extracting Read Counts |
18 |
|
19 |
* Results: Number of SNEPs |
20 |
|
21 |
* APPENDICE: Generate .c2c Files |
22 |
|
23 |
* References |
24 |
|
25 |
* Python Reference |
26 |
|
27 |
* R Reference |
28 |
|
29 |
Readme / Documentation for *NucleoMiner2* |
30 |
***************************************** |
31 |
|
32 |
*NucleoMiner2* offers Python API and R package allowing to perform |
33 |
quantitative analysis of epigenetic marks on individual nucleosomes. |
34 |
It was developed to detect natural Single-Nucleosome Epi-Polymorphisms |
35 |
(SNEP) from MNase-seq and ChIP-seq data. |
36 |
|
37 |
|
38 |
License |
39 |
======= |
40 |
|
41 |
Copyright CNRS 2012-2013 |
42 |
|
43 |
* Florent CHUFFART |
44 |
|
45 |
* Jean-Baptiste VEYRIERAS |
46 |
|
47 |
* Gael YVERT |
48 |
|
49 |
This software is a computer program which purpose is to perform |
50 |
quanti- tative analysis of epigenetic marks at single nucleosome |
51 |
resolution. |
52 |
|
53 |
This software is governed by the CeCILL license under French law and |
54 |
abiding by the rules of distribution of free software. You can use, |
55 |
modify and/ or redistribute the software under the terms of the CeCILL |
56 |
license as circulated by CEA, CNRS and INRIA at the following URL |
57 |
"http://www.cecill.info". |
58 |
|
59 |
As a counterpart to the access to the source code and rights to copy, |
60 |
modify and redistribute granted by the license, users are provided |
61 |
only with a limited warranty and the software's author, the holder |
62 |
of the economic rights, and the successive licensors have only |
63 |
limited liability. |
64 |
|
65 |
This software is provided with absolutely NO WARRANTY. The authors can |
66 |
not be held responsible, even partially, for any damage, loss, |
67 |
financial loss or any other undesired facts resulting from the use of |
68 |
the software. In this respect, the user's attention is drawn to the |
69 |
risks associated with loading, using, modifying and/or developing or |
70 |
reproducing the software by the user in light of its specific status |
71 |
of free software, that may mean that it is complicated to manipulate, |
72 |
and that also therefore means that it is reserved for developers |
73 |
and experienced professionals having in-depth computer knowledge. |
74 |
Users are therefore encouraged to load and test the software's |
75 |
suitability as regards their requirements in conditions enabling the |
76 |
security of their systems and/or data to be ensured and, more |
77 |
generally, to use and operate it in the same conditions as regards |
78 |
security. |
79 |
|
80 |
The fact that you are presently reading this means that you have had |
81 |
knowledge of the CeCILL license and that you accept its terms. |
82 |
|
83 |
|
84 |
Installation Instructions |
85 |
========================= |
86 |
|
87 |
|
88 |
Links |
89 |
----- |
90 |
|
91 |
*NucleoMiner2* home page and documentation are available here: |
92 |
|
93 |
* https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer |
94 |
|
95 |
The Yvert lab web page is accessible here: |
96 |
|
97 |
* http://www.ens-lyon.fr/LBMC/gisv/ |
98 |
|
99 |
|
100 |
Installation |
101 |
------------ |
102 |
|
103 |
The first installation step is to retrieve the source code of |
104 |
NucleoMiner2. You can do this by typing the following command in a |
105 |
terminal. |
106 |
|
107 |
git clone http://forge.cbp.ens-lyon.fr/git/nucleominer |
108 |
|
109 |
|
110 |
Prerequisites |
111 |
~~~~~~~~~~~~~ |
112 |
|
113 |
To work properly, NucleoMiner2 needs that the following free software |
114 |
are installed and made available on your system: |
115 |
|
116 |
* Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2 |
117 |
|
118 |
* SAMtools http://samtools.sourceforge.net |
119 |
|
120 |
* bedtools http://code.google.com/p/bedtools/ |
121 |
|
122 |
* TemplateFilter |
123 |
http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering |
124 |
|
125 |
It also requires the following R packages to be installed on your |
126 |
system: |
127 |
|
128 |
* fork |
129 |
|
130 |
* rjson |
131 |
|
132 |
* seqinr |
133 |
|
134 |
* plotrix |
135 |
|
136 |
* DESeq |
137 |
|
138 |
These packages can be installed by typing the following command in an |
139 |
R console: |
140 |
|
141 |
install.packages(c("fork", "rjson", "seqinr", "plotrix")) |
142 |
source("http://bioconductor.org/biocLite.R") |
143 |
biocLite("DESeq") |
144 |
|
145 |
Finally,by typing the git command above, you downloaded specific R |
146 |
packages provided with NucleoMiner2 that you now need to install: |
147 |
|
148 |
* cachecache https://forge.cbp.ens- |
149 |
lyon.fr/redmine/projects/cachecache |
150 |
|
151 |
* bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot |
152 |
|
153 |
* nucleominer https://forge.cbp.ens- |
154 |
lyon.fr/redmine/projects/nucleominer |
155 |
|
156 |
To do so, type the following command in your terminal: |
157 |
|
158 |
cd nucleominer |
159 |
R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\ |
160 |
doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\ |
161 |
build/nucleominer_2.3.46.tar.gz |
162 |
|
163 |
Tutorial |
164 |
******** |
165 |
|
166 |
This tutorial describes steps allowing to perform quantitative |
167 |
analysis of epigenetic marks on individual nucleosomes. We assume that |
168 |
files are organised according to a given hierarchy and that all |
169 |
command lines are launched from the project’s root directory. |
170 |
|
171 |
This tutorial is divided into two main parts. The first part covers |
172 |
the python script *wf.py* that aligns and converts short sequence |
173 |
reads. The second part covers the R scripts that extracts nucleosome- |
174 |
level information (nucleosome position and indicators) from the |
175 |
dataset. |
176 |
|
177 |
|
178 |
Experimental Dataset, Working Directory and Configuration File |
179 |
============================================================== |
180 |
|
181 |
|
182 |
Working Directory Organisation |
183 |
------------------------------ |
184 |
|
185 |
After having installed NucleoMiner2 environment (Previous section), go |
186 |
to the root working directory of the tutorial by typing the following |
187 |
command in a terminal: |
188 |
|
189 |
cd doc/Chuffart_NM2_workdir/ |
190 |
|
191 |
|
192 |
Retrieving Experimental Dataset |
193 |
------------------------------- |
194 |
|
195 |
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress |
196 |
(http://www.ebi.ac.uk/arrayexpress/) under accession number |
197 |
E-MTAB-2671. |
198 |
|
199 |
$$$ TODO explain how organise Experimental Dataset into the *data* |
200 |
directory of the working directory. |
201 |
|
202 |
In this tutorial, we want to compare nucleosomes of 2 yeast strains: |
203 |
BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq |
204 |
using an antibody recognizing the H3K14ac epigenetic mark. Illumina |
205 |
sequencing was done in single-read of 50 bp long. |
206 |
|
207 |
The dataset is composed of 55 files organised as follows: |
208 |
|
209 |
* 3 replicates for BY MNase Seq |
210 |
|
211 |
* sample 1 (5 fastq.gz files) |
212 |
|
213 |
* sample 2 (5 fastq.gz files) |
214 |
|
215 |
* sample 3 (4 fastq.gz files) |
216 |
|
217 |
* 3 replicates for RM MNase Seq |
218 |
|
219 |
* sample 4 (4 fastq.gz files) |
220 |
|
221 |
* sample 5 (4 fastq.gz files) |
222 |
|
223 |
* sample 6 (5 fastq.gz files) |
224 |
|
225 |
* 3 replicates for BY ChIP Seq H3K14ac |
226 |
|
227 |
* sample 36 (5 fastq.gz files) |
228 |
|
229 |
* sample 37 (5 fastq.gz files) |
230 |
|
231 |
* sample 53 (9 fastq.gz files) |
232 |
|
233 |
* 2 replicates for RM ChIP Seq H3K14ac |
234 |
|
235 |
* sample 38 (5 fastq.gz files) |
236 |
|
237 |
* sample 39 (4 fastq.gz files) |
238 |
|
239 |
|
240 |
Python and R Common Configuration File |
241 |
-------------------------------------- |
242 |
|
243 |
First, we need to define useful configuration variables that will be |
244 |
passed to python and R scripts. These variables are contained in file |
245 |
*configurator.py*. The execution of this python script dumps variables |
246 |
into the *nucleominer_config.json* file that will then be used by both |
247 |
R and python scripts. |
248 |
|
249 |
The initialization of this variables is done in the configurator.py |
250 |
file. If you need to adapt variable values (path, default |
251 |
parameters...) you need to edit this file. Then, go to the root |
252 |
directory of your project and run the following command to dump the |
253 |
configuration file: |
254 |
|
255 |
python src/current/configurator.py |
256 |
|
257 |
|
258 |
Preprocessing Illumina Fastq Reads for Each Sample |
259 |
================================================== |
260 |
|
261 |
Once variables and design have been specified, the script wf.py will |
262 |
automatically run all the analysis. You don't need to do anything. To |
263 |
run the full analysis, run the following command: |
264 |
|
265 |
python src/current/wf.py |
266 |
|
267 |
The details of the steps performed by this script are explained below. |
268 |
This preprocessing consists of 4 steps embedded in the *wf.py* script. |
269 |
They are described bellow. As a preamble, this script computes |
270 |
*samples*, *samples_mnase* and *strains* that will be used along the 4 |
271 |
steps. |
272 |
|
273 |
wf.samples = [] |
274 |
|
275 |
List of samples where a sample is identified by an id (key: *id*) |
276 |
and a strain name (key *strain*). |
277 |
|
278 |
wf.samples_mnase = [] |
279 |
|
280 |
List of Mnase samples. |
281 |
|
282 |
wf.strains = [] |
283 |
|
284 |
List of reference strains. |
285 |
|
286 |
|
287 |
Creating Bowtie Index from each Reference Genome |
288 |
------------------------------------------------ |
289 |
|
290 |
For each strain, the script *wf.py* then creates bowtie index. Bowtie |
291 |
index of a strain is a tree view of the genome of this strain. It will |
292 |
be used by bowtie to align reads. The part of the script performing |
293 |
this is the following: |
294 |
|
295 |
for strain in strains: |
296 |
per_strain_stats[strain] = create_bowtie_index(strain, |
297 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
298 |
config["BOWTIE_BUILD_BIN"]) |
299 |
|
300 |
As an indication, the following table summarizes the file sizes and |
301 |
process durations that we experienced when running this step on a |
302 |
Linux server***. |
303 |
|
304 |
+--------+------------------------+------------------------+------------------+ |
305 |
| strain | fasta genome file size | bowtie index file size | process duration | |
306 |
+========+========================+========================+==================+ |
307 |
| BY | 12 Mo | 25 Mo | 11 s. | |
308 |
+--------+------------------------+------------------------+------------------+ |
309 |
| RM | 12 Mo | 24 Mo | 9 s. | |
310 |
+--------+------------------------+------------------------+------------------+ |
311 |
|
312 |
|
313 |
Aligning Reads to Reference Genome |
314 |
---------------------------------- |
315 |
|
316 |
Next, the *wf.py* script launches bowtie to align reads to the |
317 |
reference genome. It produces a *.sam* file that is converted into a |
318 |
*.bed* file. Binaries for *bowtie*, *samtools* and *bedtools* are |
319 |
wrapped using python *subprocess* class. This step is performed by the |
320 |
following part of the script: |
321 |
|
322 |
for sample in samples: |
323 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
324 |
config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], |
325 |
config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], |
326 |
config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"]) |
327 |
|
328 |
|
329 |
Convert Aligned Reads into TemplateFilter Format |
330 |
------------------------------------------------ |
331 |
|
332 |
TemplateFilter uses particular input formats for reads, so it is |
333 |
necessary to convert the *.bed* files. TemplateFilter expect reads in |
334 |
the following format: *chr*, *coord*, *strand* and *#read* where: |
335 |
|
336 |
* *chr* is the number of the chromosome; |
337 |
|
338 |
* *coord* is the coordinate of the reads; |
339 |
|
340 |
* *strand* is *F* for forward and *R* for reverse; |
341 |
|
342 |
* *#reads* the number of reads covering this position. |
343 |
|
344 |
Each entry is *tab*-separated. |
345 |
|
346 |
**WARNING** for reverse strands, bowtie returns the position of the |
347 |
first nucleotide on the left hand side, whereas TemplateFilter expects |
348 |
the first one on the right hand side. This is taken into account in |
349 |
NucleoMiner2 by adding the read length (in our case 50) to the reverse |
350 |
reads coordinates. |
351 |
|
352 |
This step is performed by the following part of the *wf.py* script: |
353 |
|
354 |
for sample in samples: |
355 |
per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, |
356 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
357 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
358 |
|
359 |
The following table summarizes the number of reads, the involved file |
360 |
sizes and process durations that we experienced when running the two |
361 |
last steps. In our case, alignment process were multithreaded over 3 |
362 |
cores. |
363 |
|
364 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
365 |
| id | Illumina reads | aligned and filtred reads | ratio | *.bed* file size | TF input file size | process duration | |
366 |
+====+================+===========================+========+==================+====================+==================+ |
367 |
| 1 | 16436138 | 10199695 | 62,06% | 1064 Mo | 60 Mo | 383 s. | |
368 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
369 |
| 2 | 16911132 | 12512727 | 73,99% | 1298 Mo | 64 Mo | 437 s. | |
370 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
371 |
| 3 | 15946902 | 12340426 | 77,38% | 1280 Mo | 65 Mo | 423 s. | |
372 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
373 |
| 4 | 13765584 | 10381903 | 75,42% | 931 Mo | 59 Mo | 352 s. | |
374 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
375 |
| 5 | 15168268 | 11502855 | 75,83% | 1031 Mo | 64 Mo | 386 s. | |
376 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
377 |
| 6 | 18850820 | 14024905 | 74,40% | 1254 Mo | 69 Mo | 482 s. | |
378 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
379 |
| 36 | 17715118 | 14092985 | 79,55% | 1404 Mo | 68 Mo | 483 s. | |
380 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
381 |
| 37 | 17288466 | 7402082 | 42,82% | 741 Mo | 48 Mo | 339 s. | |
382 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
383 |
| 38 | 16116394 | 13178457 | 81,77% | 1101 Mo | 63 Mo | 420 s. | |
384 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
385 |
| 39 | 14241106 | 10537228 | 73,99% | 880 Mo | 57 Mo | 348 s. | |
386 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
387 |
| 53 | 40876476 | 33780065 | 82,64% | 3316 Mo | 103 Mo | 1165 s. | |
388 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
389 |
|
390 |
|
391 |
Run TemplateFilter on Mnase Samples |
392 |
----------------------------------- |
393 |
|
394 |
Finally, for each sample we perform TemplateFilter analysis. |
395 |
|
396 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each |
397 |
nucleosome is defined by its center and its width. An odd width leads |
398 |
us to consider non- integer lower and upper bound. |
399 |
|
400 |
**WARNING** TemplateFilter was not designed to handle replicates. So |
401 |
we recommend to keep a maximum of nucleosomes and filter the aberrant |
402 |
ones afterwards using the benefits of having replicates. To do this, |
403 |
we set a low correlation threshold parameter (0.5) and a particularly |
404 |
high value of overlap (300%). |
405 |
|
406 |
This step is performed by the following part of the *wf.py* script: |
407 |
|
408 |
for sample in samples_mnase: |
409 |
per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, |
410 |
config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], |
411 |
config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], |
412 |
config["TF_MAXW"], config["TF_OL"]) |
413 |
|
414 |
+----+--------+------------+---------------+------------------+ |
415 |
| id | strain | found nucs | nuc file size | process duration | |
416 |
+====+========+============+===============+==================+ |
417 |
| 1 | BY | 96214 | 68 Mo | 1022 s. | |
418 |
+----+--------+------------+---------------+------------------+ |
419 |
| 2 | BY | 91694 | 65 Mo | 1038 s. | |
420 |
+----+--------+------------+---------------+------------------+ |
421 |
| 3 | BY | 91205 | 65 Mo | 1036 s. | |
422 |
+----+--------+------------+---------------+------------------+ |
423 |
| 4 | RM | 88076 | 62 Mo | 984 s. | |
424 |
+----+--------+------------+---------------+------------------+ |
425 |
| 5 | RM | 90141 | 64 Mo | 967 s. | |
426 |
+----+--------+------------+---------------+------------------+ |
427 |
| 6 | RM | 87517 | 62 Mo | 980 s. | |
428 |
+----+--------+------------+---------------+------------------+ |
429 |
|
430 |
|
431 |
Inferring Nucleosome Position and Extracting Read Counts |
432 |
======================================================== |
433 |
|
434 |
The second part of the tutorial uses R |
435 |
(http://http://www.r-project.org). NucleoMiner2 contains a set of R |
436 |
scripts that will be sourced in R from a console launched at the root |
437 |
of your project. These scripts are: |
438 |
|
439 |
* headers.R |
440 |
|
441 |
* extract_maps.R |
442 |
|
443 |
* translate_common_wp.R |
444 |
|
445 |
* split_samples.R |
446 |
|
447 |
* count_reads.R |
448 |
|
449 |
* get_size_factors |
450 |
|
451 |
* launch_deseq.R |
452 |
|
453 |
|
454 |
The Script headers.R |
455 |
-------------------- |
456 |
|
457 |
The script headers.R is included in all other R scripts. It is in |
458 |
charge of: |
459 |
|
460 |
* launching libraries used in the scripts |
461 |
|
462 |
* launching configuration (design, strain, marker...) |
463 |
|
464 |
* computing and caching Common Uinterrupted Regions (CURs). |
465 |
Caching means storing the information in the computer's memory. |
466 |
|
467 |
Note that you can customize the function “translate”. This function |
468 |
allows you to use the alignments between genomes when performing |
469 |
various tasks. |
470 |
|
471 |
* You may want to analyze data of a single strain (e.g. |
472 |
treatment/control, or only few mutations). In this case, the |
473 |
genome is identical across all samples and you do not need to |
474 |
define particular CURs (CURs are chromosomes). Simply use the |
475 |
default translate function which is neutral. |
476 |
|
477 |
* If you are analyzing data from two or more strains (as |
478 |
NucleoMiner2 was designed for), then you need to translate |
479 |
coordinates of one genome into the coordinates of another one. |
480 |
You must do this by aligning the two genomes, which will produce |
481 |
a .c2c file (see Appendice "Generate .c2c Files"). thenuse it to |
482 |
produce the list of regions and customise “translate”. |
483 |
|
484 |
In our tutorial, we are in the second case and to perform all these |
485 |
steps run the following command line in your R console: |
486 |
|
487 |
source("src/current/headers.R") |
488 |
|
489 |
|
490 |
The Script extract_maps.R |
491 |
------------------------- |
492 |
|
493 |
This script is in charge of extracting Maps for well-positioned and |
494 |
sensitive nucleosomes. First of all, this script computes intra and |
495 |
inter-strain matches of nucleosome maps for each CUR. This step can be |
496 |
executed in parallel on many cores using the BoT library. Next, it |
497 |
collects results and produces maps of well-positioned nucleosomes, |
498 |
sensitive nucleosomes and Unaligned Nucleosomal Regions . |
499 |
|
500 |
The map of well-positioned nucleosomes for BY is collected in the |
501 |
result directory and is called *BY_wp.tab*. It is composed of |
502 |
following columns: |
503 |
|
504 |
* chr, the number of the chromosome |
505 |
|
506 |
* lower_bound, the lower bound of the nucleosome |
507 |
|
508 |
* upper_bound, the upper bound of the nucleosome |
509 |
|
510 |
* cur_index, index of the CUR |
511 |
|
512 |
* index_nuc, the index of the nucleosome in the CUR |
513 |
|
514 |
* wp, 1 if it is a well positioned nucleosome, 0 otherwise |
515 |
|
516 |
* nb_reads, the number of reads that support this nucleosome |
517 |
|
518 |
* nb_nucs, the number of TemplateFilter nucleosome across |
519 |
replicates (= the number of replicates in which it is a well- |
520 |
positioned nucleosome) |
521 |
|
522 |
* llr_1, for a well-positioned nucleosome, it is the LLR1 (log- |
523 |
likelihood ratio) between the first and the second TemplateFilter |
524 |
nucleosome on the chain. |
525 |
|
526 |
* llr_2, for a well-positioned nucleosome, it is the LLR1 between |
527 |
the second and the third TemplateFilter nucleosome on the chain. |
528 |
|
529 |
* wp_llr, for a well-positioned nucleosome, it is the LLR2 that |
530 |
compares consistency of the positioning over all TemplateFilter |
531 |
nucleosomes. |
532 |
|
533 |
* wp_pval, for a well-positioned nucleosome, it is the p-value |
534 |
chi square test obtained from LLR2 (*1-pchisq(2.LLR2, df=4)*) |
535 |
|
536 |
* dyad_shift, for a well-positioned nucleosome, it is the shift |
537 |
between the two extreme TemplateFilter nucleosome dyad positions. |
538 |
|
539 |
The sensitive map for BY is collected in the result directory and is |
540 |
called *BY_fuzzy.tab*. It is composed of following columns: |
541 |
|
542 |
* chr, the number of the chromosome |
543 |
|
544 |
* lower_bound, the lower bound of the nucleosome |
545 |
|
546 |
* upper_bound, the upper bound of the nucleosome |
547 |
|
548 |
* cur_index, index of the CUR |
549 |
|
550 |
The map of common well-positioned nucleosomes aligned between the BY |
551 |
and RM strains is collected in the result directory and is called |
552 |
*BY_RM_common_wp.tab*. It is composed of following columns: |
553 |
|
554 |
* cur_index, the index of the CUR |
555 |
|
556 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
557 |
|
558 |
* index_nuc_RM, the index of the RM nucleosome in the CUR |
559 |
|
560 |
* llr_score, , the LLR3 score that estimates conservation between |
561 |
the positions in BY and RM |
562 |
|
563 |
* common_wp_pval, the p-value chi square test obtained from LLR3 |
564 |
(*1-pchisq(2.LLR3, df=2)*) |
565 |
|
566 |
* diff, the dyads shift between the positions in the two strains |
567 |
(in bp) |
568 |
|
569 |
The common UNR map for BY and RM strains is collected in the result |
570 |
directory and is called *BY_RM_common_unr.tab*. It is composed of the |
571 |
following columns: |
572 |
|
573 |
* cur_index, the index of the CUR |
574 |
|
575 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
576 |
|
577 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
578 |
|
579 |
To execute this script, run the following command in your R console: |
580 |
|
581 |
source("src/current/extract_maps.R") |
582 |
|
583 |
|
584 |
The Script translate_common_wp.R |
585 |
-------------------------------- |
586 |
|
587 |
This script is used to translate common well-positioned nucleosome |
588 |
positions from a strain to another strain and stores it into a table. |
589 |
|
590 |
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM |
591 |
well-positioned nucleosomes translated into the BY genome coordinates. |
592 |
It is composed of following columns: |
593 |
|
594 |
* strain_ref, the reference genome (in which positioned are |
595 |
defined) |
596 |
|
597 |
* begin, the translated lower bound of the nucleosome |
598 |
|
599 |
* end, the translated upper bound of the nucleosome |
600 |
|
601 |
* chr, the number of chromosomes for the reference genome (in |
602 |
which positioned are defined) |
603 |
|
604 |
* length, the length of the nucleosome (could be negative) |
605 |
|
606 |
* cur_index, the index of the CUR |
607 |
|
608 |
* index_nuc, the index of the nucleosome in the CUR |
609 |
|
610 |
To execute this script, run the following command in your R console: |
611 |
|
612 |
source("src/current/translate_common_wp.R") |
613 |
|
614 |
|
615 |
The Script split_samples.R |
616 |
-------------------------- |
617 |
|
618 |
To optimize memory space usage, we split and compress TemplateFilter |
619 |
input files according to their corresponding chromosome. for example, |
620 |
*sample_1_TF.tab* will be split into : |
621 |
|
622 |
* sample_1_chr_1_splited_sample.tab.gz |
623 |
|
624 |
* sample_1_chr_2_splited_sample.tab.gz |
625 |
|
626 |
* ... |
627 |
|
628 |
* sample_1_chr_17_splited_sample.tab.gz |
629 |
|
630 |
To execute this script, run the following command in your R console: |
631 |
|
632 |
source("src/current/split_samples.R") |
633 |
|
634 |
|
635 |
The Script count_reads.R |
636 |
------------------------ |
637 |
|
638 |
To associate a number of observations (read) to each nucleosome we run |
639 |
the script *count_reads.R*. It produces the files |
640 |
*BY_RM_H3K14ac_wp_and_nbreads.tab*, |
641 |
*BY_RM_H3K14ac_unr_and_nbreads.tab* |
642 |
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and |
643 |
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well- |
644 |
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned |
645 |
nucleosomes and Mnase UNRs respectively. |
646 |
|
647 |
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains |
648 |
counted reads for well-positioned nucleosomes with the experimental |
649 |
condition ChIP H3K14ac. It is composed of the following columns: |
650 |
|
651 |
* chr_BY, the number of the chromosome for BY |
652 |
|
653 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
654 |
|
655 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
656 |
|
657 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
658 |
|
659 |
* chr_RM, the number of the chromosome for RM |
660 |
|
661 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
662 |
|
663 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
664 |
|
665 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
666 |
|
667 |
* cur_index, index of the CUR |
668 |
|
669 |
* BY_H3K14ac_36, the number of reads for the current nucleosome |
670 |
for the sample 36 |
671 |
|
672 |
* BY_H3K14ac_37, #reads for sample 37 |
673 |
|
674 |
* BY_H3K14ac_53, #reads for sample 53 |
675 |
|
676 |
* RM_H3K14ac_38, #reads for sample 38 |
677 |
|
678 |
* RM_H3K14ac_39, #reads for sample 39 |
679 |
|
680 |
To execute this script, run the following command in your R console: |
681 |
|
682 |
source("src/current/count_reads.R") |
683 |
|
684 |
|
685 |
The Script get_size_factors.R |
686 |
----------------------------- |
687 |
|
688 |
This script uses the DESeq function *estimateSizeFactors* to compute |
689 |
the size factor of each sample. It corresponds to normalisation of |
690 |
read counts from sample to sample, as determined by DESeq. When a |
691 |
sample has n reads for a nucleosome or a UNR, the normalised count is |
692 |
n/f where f is the factor contained in this file. The script dumps |
693 |
computed size factors into the file *size_factors.tab*. This file has |
694 |
the form: |
695 |
|
696 |
+-----------+---------+---------+---------+ |
697 |
| sample_id | wp | unr | wpunr | |
698 |
+===========+=========+=========+=========+ |
699 |
| 1 | 0.87396 | 0.88097 | 0.87584 | |
700 |
+-----------+---------+---------+---------+ |
701 |
| 2 | 1.07890 | 1.07440 | 1.07760 | |
702 |
+-----------+---------+---------+---------+ |
703 |
| 3 | 1.06400 | 1.05890 | 1.06250 | |
704 |
+-----------+---------+---------+---------+ |
705 |
| 4 | 0.85782 | 0.87948 | 0.86305 | |
706 |
+-----------+---------+---------+---------+ |
707 |
| 5 | 0.97577 | 0.96590 | 0.97307 | |
708 |
+-----------+---------+---------+---------+ |
709 |
| 6 | 1.19630 | 1.18120 | 1.19190 | |
710 |
+-----------+---------+---------+---------+ |
711 |
| 36 | 0.93318 | 0.92762 | 0.93166 | |
712 |
+-----------+---------+---------+---------+ |
713 |
| 37 | 0.48315 | 0.48453 | 0.48350 | |
714 |
+-----------+---------+---------+---------+ |
715 |
| 38 | 1.11240 | 1.11210 | 1.11230 | |
716 |
+-----------+---------+---------+---------+ |
717 |
| 39 | 0.89897 | 0.89917 | 0.89903 | |
718 |
+-----------+---------+---------+---------+ |
719 |
| 53 | 2.22650 | 2.22700 | 2.22660 | |
720 |
+-----------+---------+---------+---------+ |
721 |
|
722 |
sample_id are given in file samples.csv |
723 |
|
724 |
If you don't know which column to use for normalization, we recommend |
725 |
using wpunr. |
726 |
|
727 |
Here are the details of the factors produced: |
728 |
|
729 |
* unr: factor computed from data of UNR regions. These regions |
730 |
are defined for every pairs of aligned genomes (e.g. BY_RM) |
731 |
|
732 |
* wp: same, but for well-positioned nucleosomes. |
733 |
|
734 |
* wpunr: both types of regions. |
735 |
|
736 |
To execute this script, run the following command in your R console: |
737 |
|
738 |
source("src/current/get_size_factors.R") |
739 |
|
740 |
|
741 |
The Script launch_deseq.R |
742 |
------------------------- |
743 |
|
744 |
Finally, the script *launch_deseq.R* perform statistical analysis on |
745 |
each nucleosome using *DESeq*. It produces files: |
746 |
|
747 |
* results/current/BY_RM_H3K14ac_wp_snep.tab |
748 |
|
749 |
* results/current/BY_RM_H3K14ac_unr_snep.tab |
750 |
|
751 |
* results/current/BY_RM_H3K14ac_wpunr_snep.tab |
752 |
|
753 |
* results/current/BY_RM_H3K14ac_wp_mnase.tab |
754 |
|
755 |
* results/current/BY_RM_H3K14ac_unr_mnase.tab |
756 |
|
757 |
* results/current/BY_RM_H3K14ac_wpunr_mnase.tab |
758 |
|
759 |
These files are organised with the following columns (see file |
760 |
*BY_RM_H3K14ac_wp_snep.tab* for an example): |
761 |
|
762 |
* chr_BY, the number of the chromosome for BY |
763 |
|
764 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
765 |
|
766 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
767 |
|
768 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
769 |
|
770 |
* chr_RM, the number of the chromosome for RM |
771 |
|
772 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
773 |
|
774 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
775 |
|
776 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
777 |
|
778 |
* cur_index, index of the CUR |
779 |
|
780 |
* form |
781 |
|
782 |
* BY_Mnase_Seq_1, the number of reads for the current nucleosome |
783 |
for the sample 1 |
784 |
|
785 |
Next columns concern indicators for each sample: |
786 |
|
787 |
* BY_Mnase_Seq_2, #reads for sample 2 |
788 |
|
789 |
* BY_Mnase_Seq_3, #reads for sample 3 |
790 |
|
791 |
* RM_Mnase_Seq_4, #reads for sample 4 |
792 |
|
793 |
* RM_Mnase_Seq_5, #reads for sample 5 |
794 |
|
795 |
* RM_Mnase_Seq_6, #reads for sample 6 |
796 |
|
797 |
* BY_H3K14ac_36, #reads for sample 36 |
798 |
|
799 |
* BY_H3K14ac_37, #reads for sample 37 |
800 |
|
801 |
* BY_H3K14ac_53, #reads for sample 53 |
802 |
|
803 |
* RM_H3K14ac_38, #reads for sample 38 |
804 |
|
805 |
* RM_H3K14ac_39, #reads for sample 39 |
806 |
|
807 |
The 5 last columns concern DESeq analysis: |
808 |
|
809 |
* manip[a_manip] strain[a_strain] |
810 |
manip[a_strain]:strain[a_strain], the manip (marker) effect, the |
811 |
strain effect and the snep effect. These are the coefficients of |
812 |
the fitted generalized linear model. |
813 |
|
814 |
* pvalsGLM, the pvalue resulting from the comparison of the GLM |
815 |
model considering the interaction term *marker:strain* to the GLM |
816 |
model that does not consider it. This is the statsitcial |
817 |
significance of the interaction term and therefore the |
818 |
statistical significance of the SNEP. |
819 |
|
820 |
* snep_index, a boolean set to TRUE if the pvalueGLM value is |
821 |
under the threshold computed with FDR function with a rate set to |
822 |
0.0001. |
823 |
|
824 |
To execute this script, run the following command in your R console: |
825 |
|
826 |
source("src/current/launch_deseq.R") |
827 |
|
828 |
|
829 |
Results: Number of SNEPs |
830 |
======================== |
831 |
|
832 |
Here are the number of computed SNEPs for each forms. |
833 |
|
834 |
+-------+---------+-------+---------+ |
835 |
| form | strains | #nucs | H3K14ac | |
836 |
+=======+=========+=======+=========+ |
837 |
| wp | BY-RM | 30464 | 3549 | |
838 |
+-------+---------+-------+---------+ |
839 |
| unr | BY-RM | 9497 | 1559 | |
840 |
+-------+---------+-------+---------+ |
841 |
| wpunr | BY-RM | 39961 | 5240 | |
842 |
+-------+---------+-------+---------+ |
843 |
|
844 |
|
845 |
APPENDICE: Generate .c2c Files |
846 |
============================== |
847 |
|
848 |
The *.c2c* files is a simple table that describes how two genome |
849 |
sequences are aligned. This file can be generated by using scripts |
850 |
that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics |
851 |
2010) and which we provide in this release of NucleoMiner2. |
852 |
|
853 |
To use these scripts on your UNIX/LINUX computer you need first to |
854 |
install MUMmer which is designed to rapidly align entire genomes, |
855 |
whether in complete or draft form. |
856 |
|
857 |
|
858 |
Installing MUMmer |
859 |
----------------- |
860 |
|
861 |
Get the last version of MUMmer archive on your computer |
862 |
(MUMmer3.23.tar.gz is provided in the directory deps of your working |
863 |
directory). Copy it in a dedicated directory. Install it locally into |
864 |
the src folder of you working directory by typing (working directory): |
865 |
|
866 |
tar -xvzf MUMmer3.23.tar.gz |
867 |
|
868 |
cd src |
869 |
tar xfvz ../deps/MUMmer3.23.tar.gz |
870 |
cd MUMmer3.23 |
871 |
make check |
872 |
make install |
873 |
|
874 |
|
875 |
Installing NucleoMiner 1.0 scripts |
876 |
---------------------------------- |
877 |
|
878 |
Get the nucleominer-1.0.tar.gz archive on your computer (this archive |
879 |
is provided in the directory deps of your working directory). Install |
880 |
it locally into the src folder of you working directory by typing |
881 |
(working directory): |
882 |
|
883 |
cd src |
884 |
tar xfvz ../deps/nucleominer-1.0.tar.gz |
885 |
cd .. |
886 |
|
887 |
This creates a directory that contains NucleoMiner 1.0 scripts |
888 |
(src/nucleominer-1.0/scripts). |
889 |
|
890 |
|
891 |
Generate .c2c Files |
892 |
------------------- |
893 |
|
894 |
To generate .c2c files you need to type the following command in a |
895 |
terminal: |
896 |
|
897 |
export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts |
898 |
export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/ |
899 |
NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \ |
900 |
data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \ |
901 |
data/byxrm 2>NMgxcomp.log |
902 |
|
903 |
After execution, the directory *data* will hold the .c2c files. |