Statistiques
| Branche: | Révision :

root / README @ master

Historique | Voir | Annoter | Télécharger (30,72 ko)

1

    
2
Welcome to *NucleoMiner2*
3
*************************
4

    
5
* Readme / Documentation for *NucleoMiner2*
6

    
7
  * License
8

    
9
  * Installation Instructions
10

    
11
* Tutorial
12

    
13
  * Experimental Dataset, Working Directory and Configuration File
14

    
15
  * Preprocessing Illumina Fastq Reads for Each Sample
16

    
17
  * Inferring Nucleosome Position and Extracting Read Counts
18

    
19
  * Results: Number of SNEPs
20

    
21
  * APPENDICE: Generate .c2c Files
22

    
23
* References
24

    
25
  * Python Reference
26

    
27
  * R Reference
28

    
29
Readme / Documentation for *NucleoMiner2*
30
*****************************************
31

    
32
*NucleoMiner2* offers Python API and R package allowing to perform
33
quantitative analysis of epigenetic marks on individual nucleosomes.
34
It was developed to detect natural Single-Nucleosome Epi-Polymorphisms
35
(SNEP) from MNase-seq and ChIP-seq data.
36

    
37

    
38
License
39
=======
40

    
41
Copyright CNRS 2012-2013
42

    
43
* Florent CHUFFART
44

    
45
* Jean-Baptiste VEYRIERAS
46

    
47
* Gael YVERT
48

    
49
This software is a computer program which purpose is to perform
50
quanti- tative analysis of epigenetic marks at single nucleosome
51
resolution.
52

    
53
This software is governed by the CeCILL license under French law and
54
abiding by the rules of distribution of free software.  You can  use,
55
modify and/ or redistribute the software under the terms of the CeCILL
56
license as circulated by CEA, CNRS and INRIA at the following URL
57
"http://www.cecill.info".
58

    
59
As a counterpart to the access to the source code and  rights to copy,
60
modify and redistribute granted by the license, users are provided
61
only with a limited warranty  and the software's author,  the holder
62
of the economic rights,  and the successive licensors  have only
63
limited liability.
64

    
65
This software is provided with absolutely NO WARRANTY. The authors can
66
not be held responsible, even partially, for any damage, loss,
67
financial loss or any other undesired facts resulting from the use of
68
the software. In this respect, the user's attention is drawn to the
69
risks associated with loading,  using,  modifying and/or developing or
70
reproducing the software by the user in light of its specific status
71
of free software, that may mean  that it is complicated to manipulate,
72
and  that  also therefore means  that it is reserved for developers
73
and  experienced professionals having in-depth computer knowledge.
74
Users are therefore encouraged to load and test the software's
75
suitability as regards their requirements in conditions enabling the
76
security of their systems and/or data to be ensured and,  more
77
generally, to use and operate it in the same conditions as regards
78
security.
79

    
80
The fact that you are presently reading this means that you have had
81
knowledge of the CeCILL license and that you accept its terms.
82

    
83

    
84
Installation Instructions
85
=========================
86

    
87

    
88
Links
89
-----
90

    
91
*NucleoMiner2* home page and documentation are available here:
92

    
93
   * https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer
94

    
95
The Yvert lab web page is accessible here:
96

    
97
   * http://www.ens-lyon.fr/LBMC/gisv/
98

    
99

    
100
Installation
101
------------
102

    
103
The first installation step is to retrieve the source code of
104
NucleoMiner2. You can do this by typing the following command in a
105
terminal.
106

    
107
   git clone http://forge.cbp.ens-lyon.fr/git/nucleominer
108

    
109

    
110
Prerequisites
111
~~~~~~~~~~~~~
112

    
113
To work properly, NucleoMiner2 needs that the following free software
114
are installed and made available on your system:
115

    
116
   * Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2
117

    
118
   * SAMtools http://samtools.sourceforge.net
119

    
120
   * bedtools http://code.google.com/p/bedtools/
121

    
122
   * TemplateFilter
123
     http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering
124

    
125
It also requires the following R packages to be installed on your
126
system:
127

    
128
   * fork
129

    
130
   * rjson
131

    
132
   * seqinr
133

    
134
   * plotrix
135

    
136
   * DESeq
137

    
138
These packages can be installed by typing the following command in an
139
R console:
140

    
141
   install.packages(c("fork", "rjson", "seqinr", "plotrix"))
142
   source("http://bioconductor.org/biocLite.R")
143
   biocLite("DESeq")
144

    
145
Finally,by typing the git command above, you downloaded specific R
146
packages provided with NucleoMiner2 that you now need to install:
147

    
148
   * cachecache https://forge.cbp.ens-
149
     lyon.fr/redmine/projects/cachecache
150

    
151
   * bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot
152

    
153
   * nucleominer https://forge.cbp.ens-
154
     lyon.fr/redmine/projects/nucleominer
155

    
156
To do so, type the following command in your terminal:
157

    
158
   cd nucleominer
159
   R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
160
       doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\
161
       build/nucleominer_2.3.46.tar.gz
162

    
163
Tutorial
164
********
165

    
166
This tutorial describes steps allowing to perform quantitative
167
analysis of epigenetic marks on individual nucleosomes. We assume that
168
files are organised according to a given hierarchy and that all
169
command lines are launched from the project’s root directory.
170

    
171
This tutorial is divided into two main parts. The first part covers
172
the python script *wf.py* that aligns and converts short sequence
173
reads. The second part covers the R scripts that extracts nucleosome-
174
level information (nucleosome position and indicators) from the
175
dataset.
176

    
177

    
178
Experimental Dataset, Working Directory and Configuration File
179
==============================================================
180

    
181

    
182
Working Directory Organisation
183
------------------------------
184

    
185
After having installed NucleoMiner2 environment (Previous section), go
186
to the root working directory of the tutorial by typing the following
187
command in a terminal:
188

    
189
   cd doc/Chuffart_NM2_workdir/
190

    
191

    
192
Retrieving Experimental Dataset
193
-------------------------------
194

    
195
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress
196
(http://www.ebi.ac.uk/arrayexpress/) under accession number
197
E-MTAB-2671.
198

    
199
$$$ TODO explain how organise Experimental Dataset into the *data*
200
directory of the working directory.
201

    
202
In this tutorial, we want to compare nucleosomes of 2 yeast strains:
203
BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq
204
using an antibody recognizing the H3K14ac epigenetic mark. Illumina
205
sequencing was done in single-read of 50 bp long.
206

    
207
The dataset is composed of 55 files organised as follows:
208

    
209
   * 3 replicates for BY MNase Seq
210

    
211
     * sample 1 (5 fastq.gz files)
212

    
213
     * sample 2 (5 fastq.gz files)
214

    
215
     * sample 3 (4 fastq.gz files)
216

    
217
   * 3 replicates for RM MNase Seq
218

    
219
     * sample 4 (4 fastq.gz files)
220

    
221
     * sample 5 (4 fastq.gz files)
222

    
223
     * sample 6 (5 fastq.gz files)
224

    
225
   * 3 replicates for BY ChIP Seq H3K14ac
226

    
227
     * sample 36 (5 fastq.gz files)
228

    
229
     * sample 37 (5 fastq.gz files)
230

    
231
     * sample 53 (9 fastq.gz files)
232

    
233
   * 2 replicates for RM ChIP Seq H3K14ac
234

    
235
     * sample 38 (5 fastq.gz files)
236

    
237
     * sample 39 (4 fastq.gz files)
238

    
239

    
240
Python and R Common Configuration File
241
--------------------------------------
242

    
243
First, we need to define useful configuration variables that will be
244
passed to python and R scripts. These variables are contained in file
245
*configurator.py*. The execution of this python script dumps variables
246
into the *nucleominer_config.json* file that will then be used by both
247
R and python scripts.
248

    
249
The initialization of this variables is done in the configurator.py
250
file. If you need to adapt variable values (path, default
251
parameters...) you need to edit this file. Then, go to the root
252
directory of your project and run the following command to dump the
253
configuration file:
254

    
255
   python src/current/configurator.py
256

    
257

    
258
Preprocessing Illumina Fastq Reads for Each Sample
259
==================================================
260

    
261
Once variables and design have been specified, the script wf.py will
262
automatically run all the analysis. You don't need to do anything. To
263
run the full analysis, run the following command:
264

    
265
   python src/current/wf.py
266

    
267
The details of the steps performed by this script are explained below.
268
This preprocessing consists of 4 steps embedded in the *wf.py* script.
269
They are described bellow.  As a preamble, this script computes
270
*samples*, *samples_mnase* and *strains* that will be used along the 4
271
steps.
272

    
273
wf.samples = []
274

    
275
   List of samples where a sample is identified by an id (key: *id*)
276
   and a strain name (key *strain*).
277

    
278
wf.samples_mnase = []
279

    
280
   List of Mnase samples.
281

    
282
wf.strains = []
283

    
284
   List of reference strains.
285

    
286

    
287
Creating Bowtie Index from each Reference Genome
288
------------------------------------------------
289

    
290
For each strain, the script *wf.py* then creates bowtie index. Bowtie
291
index of a strain is a tree view of the genome of this strain. It will
292
be used by bowtie to align reads. The part of the script performing
293
this is the following:
294

    
295
     for strain in strains:
296
       per_strain_stats[strain] = create_bowtie_index(strain, 
297
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
298
         config["BOWTIE_BUILD_BIN"])
299

    
300
As an indication, the following table summarizes the file sizes and
301
process durations that we experienced when running this step on a
302
Linux server***.
303

    
304
+--------+------------------------+------------------------+------------------+
305
| strain | fasta genome file size | bowtie index file size | process duration |
306
+========+========================+========================+==================+
307
| BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
308
+--------+------------------------+------------------------+------------------+
309
| RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
310
+--------+------------------------+------------------------+------------------+
311

    
312

    
313
Aligning Reads to Reference Genome
314
----------------------------------
315

    
316
Next, the *wf.py* script launches bowtie to align reads to the
317
reference genome. It produces a *.sam* file that is converted into a
318
*.bed* file. Binaries for *bowtie*, *samtools* and *bedtools* are
319
wrapped using python *subprocess* class. This step is performed by the
320
following part of the script:
321

    
322
     for sample in samples:
323
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
324
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
325
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
326
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
327

    
328

    
329
Convert Aligned Reads into TemplateFilter Format
330
------------------------------------------------
331

    
332
TemplateFilter uses particular input formats for reads, so it is
333
necessary to convert the *.bed* files. TemplateFilter expect reads in
334
the following format: *chr*, *coord*, *strand* and *#read* where:
335

    
336
* *chr* is the number of the chromosome;
337

    
338
* *coord* is the coordinate of the reads;
339

    
340
* *strand* is *F* for forward and *R* for reverse;
341

    
342
* *#reads* the number of reads covering this position.
343

    
344
Each entry is *tab*-separated.
345

    
346
**WARNING** for reverse strands, bowtie returns the position of the
347
first nucleotide on the left hand side, whereas TemplateFilter expects
348
the first one on the right hand side.  This is taken into account in
349
NucleoMiner2 by adding the read length (in our case 50) to the reverse
350
reads coordinates.
351

    
352
This step is performed by the following part of the *wf.py* script:
353

    
354
     for sample in samples:
355
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
356
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
357
         config["READ_LENGTH"],config["MAPQ_THRES"])
358

    
359
The following table summarizes the number of reads, the involved file
360
sizes and process durations that we experienced when running the two
361
last steps. In our case, alignment process were multithreaded over 3
362
cores.
363

    
364
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
365
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
366
+====+================+===========================+========+==================+====================+==================+
367
| 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
368
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
369
| 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
370
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
371
| 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
372
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
373
| 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
374
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
375
| 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
376
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
377
| 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
378
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
379
| 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
380
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
381
| 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
382
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
383
| 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
384
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
385
| 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
386
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
387
| 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
388
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
389

    
390

    
391
Run TemplateFilter on Mnase Samples
392
-----------------------------------
393

    
394
Finally, for each sample we perform TemplateFilter analysis.
395

    
396
**WARNING** TemplateFilter returns a list of nucleosomes. Each
397
nucleosome is defined by its center and its width. An odd width leads
398
us to consider non- integer lower and upper bound.
399

    
400
**WARNING** TemplateFilter was not designed to handle replicates. So
401
we recommend to keep a maximum of nucleosomes and filter the aberrant
402
ones afterwards using the benefits of having replicates. To do this,
403
we set a low correlation threshold parameter (0.5) and a particularly
404
high value of overlap (300%).
405

    
406
This step is performed by the following part of the *wf.py* script:
407

    
408
     for sample in samples_mnase:
409
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
410
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
411
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
412
         config["TF_MAXW"], config["TF_OL"])  
413

    
414
+----+--------+------------+---------------+------------------+
415
| id | strain | found nucs | nuc file size | process duration |
416
+====+========+============+===============+==================+
417
| 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
418
+----+--------+------------+---------------+------------------+
419
| 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
420
+----+--------+------------+---------------+------------------+
421
| 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
422
+----+--------+------------+---------------+------------------+
423
| 4  | RM     | 88076      | 62 Mo         | 984 s.           |
424
+----+--------+------------+---------------+------------------+
425
| 5  | RM     | 90141      | 64 Mo         | 967 s.           |
426
+----+--------+------------+---------------+------------------+
427
| 6  | RM     | 87517      | 62 Mo         | 980 s.           |
428
+----+--------+------------+---------------+------------------+
429

    
430

    
431
Inferring Nucleosome Position and Extracting Read Counts
432
========================================================
433

    
434
The second part of the tutorial uses R
435
(http://http://www.r-project.org). NucleoMiner2 contains a set of R
436
scripts that will be sourced in R from a console launched at the root
437
of your project. These scripts are:
438

    
439
   * headers.R
440

    
441
   * extract_maps.R
442

    
443
   * translate_common_wp.R
444

    
445
   * split_samples.R
446

    
447
   * count_reads.R
448

    
449
   * get_size_factors
450

    
451
   * launch_deseq.R
452

    
453

    
454
The Script headers.R
455
--------------------
456

    
457
The script headers.R is included in all other R scripts. It is in
458
charge of:
459

    
460
   * launching libraries used in the scripts
461

    
462
   * launching configuration (design, strain, marker...)
463

    
464
   * computing and caching Common Uinterrupted Regions (CURs).
465
     Caching means storing the information in the computer's memory.
466

    
467
Note that you can customize the function “translate”. This function
468
allows you to use the alignments between genomes when performing
469
various tasks.
470

    
471
   * You may want to analyze data of a single strain (e.g.
472
     treatment/control, or only few mutations). In this case, the
473
     genome is identical across all samples and you do not need to
474
     define particular CURs (CURs are chromosomes). Simply use the
475
     default translate function which is neutral.
476

    
477
   * If you are analyzing data from two or more strains (as
478
     NucleoMiner2 was designed for), then you need to translate
479
     coordinates of one genome into the coordinates of another one.
480
     You must do this by aligning the two genomes, which will produce
481
     a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to
482
     produce the list of regions and customise “translate”.
483

    
484
In our tutorial, we are in the second case and to perform all these
485
steps run the following command line in your R console:
486

    
487
   source("src/current/headers.R")
488

    
489

    
490
The Script extract_maps.R
491
-------------------------
492

    
493
This script is in charge of extracting Maps for well-positioned and
494
sensitive nucleosomes. First of all, this script computes intra and
495
inter-strain matches of nucleosome maps for each CUR. This step can be
496
executed in parallel on many cores using the BoT library. Next, it
497
collects results and produces maps of  well-positioned nucleosomes,
498
sensitive nucleosomes and Unaligned Nucleosomal Regions .
499

    
500
The map of well-positioned nucleosomes for BY is collected in the
501
result directory and is called *BY_wp.tab*. It is composed of
502
following columns:
503

    
504
   * chr, the number of the chromosome
505

    
506
   * lower_bound, the lower bound of the nucleosome
507

    
508
   * upper_bound, the upper bound of the nucleosome
509

    
510
   * cur_index, index of the CUR
511

    
512
   * index_nuc, the index of the nucleosome in the CUR
513

    
514
   * wp, 1 if it is a well positioned nucleosome, 0 otherwise
515

    
516
   * nb_reads, the number of reads that support this nucleosome
517

    
518
   * nb_nucs, the number of TemplateFilter nucleosome across
519
     replicates (= the number of replicates in which it is a well-
520
     positioned nucleosome)
521

    
522
   * llr_1, for a well-positioned nucleosome, it is the LLR1 (log-
523
     likelihood ratio) between the first and the second TemplateFilter
524
     nucleosome on the chain.
525

    
526
   * llr_2, for a well-positioned nucleosome, it is the LLR1 between
527
     the second and the third TemplateFilter nucleosome on the chain.
528

    
529
   * wp_llr, for a well-positioned nucleosome, it is the LLR2 that
530
     compares consistency of the positioning over all TemplateFilter
531
     nucleosomes.
532

    
533
   * wp_pval, for a well-positioned nucleosome, it is the p-value
534
     chi square test obtained from LLR2 (*1-pchisq(2.LLR2, df=4)*)
535

    
536
   * dyad_shift, for a well-positioned nucleosome, it is the shift
537
     between the two extreme TemplateFilter nucleosome dyad positions.
538

    
539
The sensitive map for BY is collected in the result directory and is
540
called *BY_fuzzy.tab*. It is composed of following columns:
541

    
542
   * chr, the number of the chromosome
543

    
544
   * lower_bound, the lower bound of the nucleosome
545

    
546
   * upper_bound, the upper bound of the nucleosome
547

    
548
   * cur_index, index of the CUR
549

    
550
The map of common well-positioned nucleosomes aligned between the BY
551
and RM strains is collected in the result directory and is called
552
*BY_RM_common_wp.tab*. It is composed of following columns:
553

    
554
   * cur_index, the index of the CUR
555

    
556
   * index_nuc_BY, the index of the BY nucleosome in the CUR
557

    
558
   * index_nuc_RM, the index of the RM nucleosome in the CUR
559

    
560
   * llr_score, , the LLR3 score that estimates conservation between
561
     the positions in BY and RM
562

    
563
   * common_wp_pval,  the p-value chi square test obtained from LLR3
564
     (*1-pchisq(2.LLR3, df=2)*)
565

    
566
   * diff, the dyads shift between the positions in the two strains
567
     (in bp)
568

    
569
The common UNR map for BY and RM strains is collected in the result
570
directory and is called *BY_RM_common_unr.tab*. It is composed of the
571
following columns:
572

    
573
   * cur_index, the index of the CUR
574

    
575
   * index_nuc_BY, the index of the BY nucleosome in the CUR
576

    
577
   * index_nuc_RM,the index of the RM nucleosome in the CUR
578

    
579
To execute this script, run the following command in your R console:
580

    
581
   source("src/current/extract_maps.R")
582

    
583

    
584
The Script translate_common_wp.R
585
--------------------------------
586

    
587
This script is used to translate common well-positioned nucleosome
588
positions from a strain to another strain and stores it into a table.
589

    
590
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM
591
well-positioned nucleosomes translated into the BY genome coordinates.
592
It is composed of following columns:
593

    
594
   * strain_ref, the reference genome (in which positioned are
595
     defined)
596

    
597
   * begin, the translated lower bound of the nucleosome
598

    
599
   * end, the translated upper bound of the nucleosome
600

    
601
   * chr, the number of chromosomes for the reference genome (in
602
     which positioned are defined)
603

    
604
   * length, the length of the nucleosome (could be negative)
605

    
606
   * cur_index, the index of the CUR
607

    
608
   * index_nuc, the index of the nucleosome in the CUR
609

    
610
To execute this script, run the following command in your R console:
611

    
612
   source("src/current/translate_common_wp.R")
613

    
614

    
615
The Script split_samples.R
616
--------------------------
617

    
618
To optimize memory space usage, we split and compress TemplateFilter
619
input files according to their corresponding  chromosome. for example,
620
*sample_1_TF.tab* will be split into :
621

    
622
   * sample_1_chr_1_splited_sample.tab.gz
623

    
624
   * sample_1_chr_2_splited_sample.tab.gz
625

    
626
   * ...
627

    
628
   * sample_1_chr_17_splited_sample.tab.gz
629

    
630
To execute this script, run the following command in your R console:
631

    
632
   source("src/current/split_samples.R")
633

    
634

    
635
The Script count_reads.R
636
------------------------
637

    
638
To associate a number of observations (read) to each nucleosome we run
639
the script *count_reads.R*. It produces the files
640
*BY_RM_H3K14ac_wp_and_nbreads.tab*,
641
*BY_RM_H3K14ac_unr_and_nbreads.tab*
642
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and
643
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well-
644
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned
645
nucleosomes and Mnase UNRs respectively.
646

    
647
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains
648
counted reads for well-positioned nucleosomes with the experimental
649
condition ChIP H3K14ac. It is composed of the following columns:
650

    
651
   * chr_BY, the number of the chromosome for BY
652

    
653
   * lower_bound_BY, the lower bound of the nucleosome for BY
654

    
655
   * upper_bound_BY, the upper bound of the nucleosome  for BY
656

    
657
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
658

    
659
   * chr_RM, the number of the chromosome for RM
660

    
661
   * lower_bound_RM, the lower bound of the nucleosome for RM
662

    
663
   * upper_bound_RM, the upper bound of the nucleosome  for RM
664

    
665
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
666

    
667
   * cur_index, index of the CUR
668

    
669
   * BY_H3K14ac_36, the number of reads for the current nucleosome
670
     for the sample 36
671

    
672
   * BY_H3K14ac_37, #reads for sample 37
673

    
674
   * BY_H3K14ac_53, #reads for sample 53
675

    
676
   * RM_H3K14ac_38, #reads for sample 38
677

    
678
   * RM_H3K14ac_39, #reads for sample 39
679

    
680
To execute this script, run the following command in your R console:
681

    
682
   source("src/current/count_reads.R")
683

    
684

    
685
The Script get_size_factors.R
686
-----------------------------
687

    
688
This script uses the DESeq function *estimateSizeFactors* to compute
689
the size factor of each sample. It corresponds to normalisation of
690
read counts from sample to sample, as determined by DESeq. When a
691
sample has n reads for a nucleosome or a UNR, the normalised count is
692
n/f where f is the factor contained in this file. The script dumps
693
computed size factors into the file *size_factors.tab*. This file has
694
the form:
695

    
696
+-----------+---------+---------+---------+
697
| sample_id | wp      | unr     | wpunr   |
698
+===========+=========+=========+=========+
699
| 1         | 0.87396 | 0.88097 | 0.87584 |
700
+-----------+---------+---------+---------+
701
| 2         | 1.07890 | 1.07440 | 1.07760 |
702
+-----------+---------+---------+---------+
703
| 3         | 1.06400 | 1.05890 | 1.06250 |
704
+-----------+---------+---------+---------+
705
| 4         | 0.85782 | 0.87948 | 0.86305 |
706
+-----------+---------+---------+---------+
707
| 5         | 0.97577 | 0.96590 | 0.97307 |
708
+-----------+---------+---------+---------+
709
| 6         | 1.19630 | 1.18120 | 1.19190 |
710
+-----------+---------+---------+---------+
711
| 36        | 0.93318 | 0.92762 | 0.93166 |
712
+-----------+---------+---------+---------+
713
| 37        | 0.48315 | 0.48453 | 0.48350 |
714
+-----------+---------+---------+---------+
715
| 38        | 1.11240 | 1.11210 | 1.11230 |
716
+-----------+---------+---------+---------+
717
| 39        | 0.89897 | 0.89917 | 0.89903 |
718
+-----------+---------+---------+---------+
719
| 53        | 2.22650 | 2.22700 | 2.22660 |
720
+-----------+---------+---------+---------+
721

    
722
sample_id are given in file samples.csv
723

    
724
If you don't know which column to use for normalization, we recommend
725
using wpunr.
726

    
727
Here are the details of the factors produced:
728

    
729
   * unr: factor computed from data of UNR regions. These regions
730
     are defined for every pairs of aligned genomes (e.g. BY_RM)
731

    
732
   * wp: same, but for well-positioned nucleosomes.
733

    
734
   * wpunr: both types of regions.
735

    
736
To execute this script, run the following command in your R console:
737

    
738
   source("src/current/get_size_factors.R")
739

    
740

    
741
The Script launch_deseq.R
742
-------------------------
743

    
744
Finally, the script *launch_deseq.R* perform statistical analysis on
745
each nucleosome using *DESeq*. It produces files:
746

    
747
   * results/current/BY_RM_H3K14ac_wp_snep.tab
748

    
749
   * results/current/BY_RM_H3K14ac_unr_snep.tab
750

    
751
   * results/current/BY_RM_H3K14ac_wpunr_snep.tab
752

    
753
   * results/current/BY_RM_H3K14ac_wp_mnase.tab
754

    
755
   * results/current/BY_RM_H3K14ac_unr_mnase.tab
756

    
757
   * results/current/BY_RM_H3K14ac_wpunr_mnase.tab
758

    
759
These files are organised with the following columns (see file
760
*BY_RM_H3K14ac_wp_snep.tab* for an example):
761

    
762
   * chr_BY, the number of the chromosome for BY
763

    
764
   * lower_bound_BY, the lower bound of the nucleosome for BY
765

    
766
   * upper_bound_BY, the upper bound of the nucleosome  for BY
767

    
768
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
769

    
770
   * chr_RM, the number of the chromosome for RM
771

    
772
   * lower_bound_RM, the lower bound of the nucleosome for RM
773

    
774
   * upper_bound_RM, the upper bound of the nucleosome  for RM
775

    
776
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
777

    
778
   * cur_index, index of the CUR
779

    
780
   * form
781

    
782
   * BY_Mnase_Seq_1, the number of reads for the current nucleosome
783
     for the sample 1
784

    
785
Next columns concern indicators for each sample:
786

    
787
   * BY_Mnase_Seq_2, #reads for sample 2
788

    
789
   * BY_Mnase_Seq_3, #reads for sample 3
790

    
791
   * RM_Mnase_Seq_4, #reads for sample 4
792

    
793
   * RM_Mnase_Seq_5, #reads for sample 5
794

    
795
   * RM_Mnase_Seq_6, #reads for sample 6
796

    
797
   * BY_H3K14ac_36, #reads for sample 36
798

    
799
   * BY_H3K14ac_37, #reads for sample 37
800

    
801
   * BY_H3K14ac_53, #reads for sample 53
802

    
803
   * RM_H3K14ac_38, #reads for sample 38
804

    
805
   * RM_H3K14ac_39, #reads for sample 39
806

    
807
The 5 last columns concern DESeq analysis:
808

    
809
   * manip[a_manip] strain[a_strain]
810
     manip[a_strain]:strain[a_strain], the manip (marker) effect, the
811
     strain effect and the snep effect. These are the coefficients of
812
     the fitted generalized linear model.
813

    
814
   * pvalsGLM, the pvalue resulting from the comparison of the GLM
815
     model considering the interaction term *marker:strain* to the GLM
816
     model that does not consider it. This is the statsitcial
817
     significance of the interaction term and therefore the
818
     statistical significance of the SNEP.
819

    
820
   * snep_index, a boolean set to TRUE if the pvalueGLM value is
821
     under the threshold computed with FDR function with a rate set to
822
     0.0001.
823

    
824
To execute this script, run the following command in your R console:
825

    
826
   source("src/current/launch_deseq.R")
827

    
828

    
829
Results: Number of SNEPs
830
========================
831

    
832
Here are the number of computed SNEPs for each forms.
833

    
834
+-------+---------+-------+---------+
835
| form  | strains | #nucs | H3K14ac |
836
+=======+=========+=======+=========+
837
| wp    | BY-RM   | 30464 | 3549    |
838
+-------+---------+-------+---------+
839
| unr   | BY-RM   | 9497  | 1559    |
840
+-------+---------+-------+---------+
841
| wpunr | BY-RM   | 39961 | 5240    |
842
+-------+---------+-------+---------+
843

    
844

    
845
APPENDICE: Generate .c2c Files
846
==============================
847

    
848
The *.c2c* files is a simple table that describes how two genome
849
sequences are aligned. This file can be generated by using scripts
850
that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics
851
2010) and which we provide in this release of NucleoMiner2.
852

    
853
To use these scripts on your UNIX/LINUX computer you need first to
854
install MUMmer which is designed to rapidly align entire genomes,
855
whether in complete or draft form.
856

    
857

    
858
Installing MUMmer
859
-----------------
860

    
861
Get the last version of MUMmer archive on your computer
862
(MUMmer3.23.tar.gz is provided in the directory deps of your working
863
directory). Copy it in a dedicated directory. Install it locally into
864
the src folder of you working directory by typing (working directory):
865

    
866
tar -xvzf MUMmer3.23.tar.gz
867

    
868
   cd src
869
   tar xfvz ../deps/MUMmer3.23.tar.gz
870
   cd MUMmer3.23
871
   make check
872
   make install
873

    
874

    
875
Installing NucleoMiner 1.0 scripts
876
----------------------------------
877

    
878
Get the nucleominer-1.0.tar.gz archive on your computer (this archive
879
is provided in the directory deps of your working directory). Install
880
it locally into the src folder of you working directory by typing
881
(working directory):
882

    
883
   cd src
884
   tar xfvz ../deps/nucleominer-1.0.tar.gz
885
   cd ..
886

    
887
This creates a directory that contains  NucleoMiner 1.0 scripts
888
(src/nucleominer-1.0/scripts).
889

    
890

    
891
Generate .c2c Files
892
-------------------
893

    
894
To generate .c2c files you need to type the following command in a
895
terminal:
896

    
897
   export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
898
   export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
899
   NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
900
     data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
901
     data/byxrm 2>NMgxcomp.log
902

    
903
After execution, the directory *data* will hold the .c2c files.