Statistiques
| Branche: | Révision :

root / README @ 3c88abd0

Historique | Voir | Annoter | Télécharger (30,28 ko)

1

    
2
Welcome to *NucleoMiner2*
3
*************************
4

    
5
* Readme / Documentation for *NucleoMiner2*
6

    
7
  * License
8

    
9
  * Installation Instructions
10

    
11
* Tutorial
12

    
13
  * Experimental Dataset, Working Directory and Configuration File
14

    
15
  * Preprocessing Illumina Fastq Reads for Each Sample
16

    
17
  * Inferring Nucleosome Position and Extracting Read Counts
18

    
19
  * Results: Number of SNEPs
20

    
21
  * APPENDICE: Generate .c2c Files
22

    
23
* References
24

    
25
  * Python Reference
26

    
27
  * R Reference
28

    
29
Readme / Documentation for *NucleoMiner2*
30
*****************************************
31

    
32
*NucleoMiner2* offers Python API and R package allowing to perform
33
quantitative analysis of epigenetic marks on individual nucleosomes.
34
It was developed to detect natural Single-Nucleosome Epi-Polymorphisms
35
(SNEP) from MNase-seq and ChIP-seq data.
36

    
37

    
38
License
39
=======
40

    
41
Copyright CNRS 2012-2013
42

    
43
* Florent CHUFFART
44

    
45
* Jean-Baptiste VEYRIERAS
46

    
47
* Gael YVERT
48

    
49
This software is a computer program which purpose is to perform
50
quanti- tative analysis of epigenetic marks at single nucleosome
51
resolution.
52

    
53
This software is governed by the CeCILL license under French law and
54
abiding by the rules of distribution of free software.  You can  use,
55
modify and/ or redistribute the software under the terms of the CeCILL
56
license as circulated by CEA, CNRS and INRIA at the following URL
57
"http://www.cecill.info".
58

    
59
As a counterpart to the access to the source code and  rights to copy,
60
modify and redistribute granted by the license, users are provided
61
only with a limited warranty  and the software's author,  the holder
62
of the economic rights,  and the successive licensors  have only
63
limited liability.
64

    
65
This software is provided with absolutely NO WARRANTY. The authors can
66
not be held responsible, even partially, for any damage, loss,
67
financial loss or any other undesired facts resulting from the use of
68
the software. In this respect, the user's attention is drawn to the
69
risks associated with loading,  using,  modifying and/or developing or
70
reproducing the software by the user in light of its specific status
71
of free software, that may mean  that it is complicated to manipulate,
72
and  that  also therefore means  that it is reserved for developers
73
and  experienced professionals having in-depth computer knowledge.
74
Users are therefore encouraged to load and test the software's
75
suitability as regards their requirements in conditions enabling the
76
security of their systems and/or data to be ensured and,  more
77
generally, to use and operate it in the same conditions as regards
78
security.
79

    
80
The fact that you are presently reading this means that you have had
81
knowledge of the CeCILL license and that you accept its terms.
82

    
83

    
84
Installation Instructions
85
=========================
86

    
87

    
88
Links
89
-----
90

    
91
*NucleoMiner2* home page and documentation are available here:
92

    
93
   * https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer
94

    
95
The Yvert lab web page is accessible here:
96

    
97
   * http://www.ens-lyon.fr/LBMC/gisv/
98

    
99

    
100
Installation
101
------------
102

    
103
The first installation step is to retrieve the source code of
104
MyLabStocks. You can do this by typing the following command in a
105
terminal.
106

    
107
   git clone http://forge.cbp.ens-lyon.fr/git/nucleominer
108

    
109

    
110
Prerequisites
111
~~~~~~~~~~~~~
112

    
113
To work properly, NucleoMiner2 needs that the following free software
114
are installed and made available on your system:
115

    
116
   * Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2
117

    
118
   * SAMtools http://samtools.sourceforge.net
119

    
120
   * bedtools http://code.google.com/p/bedtools/
121

    
122
   * TemplateFilter
123
     http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering
124

    
125
It also requires the following R packages to be installed on your
126
system:
127

    
128
   * fork
129

    
130
   * rjson
131

    
132
   * seqinr
133

    
134
   * plotrix
135

    
136
   * DESeq
137

    
138
   * cachecache https://forge.cbp.ens-
139
     lyon.fr/redmine/projects/cachecache
140

    
141
   * bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot
142

    
143
   * nucleominer https://forge.cbp.ens-
144
     lyon.fr/redmine/projects/nucleominer
145

    
146
The first packages could be installed by typing the following command
147
in an R console:
148

    
149
   install.packages(c("fork", "rjson", "seqinr", "plotrix"))
150
   source("http://bioconductor.org/biocLite.R")
151
   biocLite("DESeq")
152

    
153
The last packages are available in the git repository they could be
154
install by typing the following command in your terminal:
155

    
156
   cd nucleominer
157
   R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
158
       doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\
159
       build/nucleominer_2.3.46.tar.gz
160

    
161
Tutorial
162
********
163

    
164
This tutorial describes steps allowing performing quantitative
165
analysis of epigenetic marks on individual nucleosomes. We assume that
166
files are organised according to a given hierarchy and that all
167
command lines are launched from the project’s root directory.
168

    
169
This tutorial is divided into two main parts. The first part covers
170
the python script *wf.py* that aligns and converts short sequence
171
reads. The second part covers the R scripts that extracts information
172
(nucleosome position and indicators) from the dataset.
173

    
174

    
175
Experimental Dataset, Working Directory and Configuration File
176
==============================================================
177

    
178

    
179
Working Directory Organisation
180
------------------------------
181

    
182
After having install NucleoMiner2 environment (Previous section), go
183
to the root working directory of the tutorial by typing the following
184
command in a terminal:
185

    
186
   cd doc/Chuffart_NM2_workdir/
187

    
188

    
189
Retrieving Experimental Dataset
190
-------------------------------
191

    
192
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress
193
(http://www.ebi.ac.uk/arrayexpress/) under accession number
194
E-MTAB-2671.
195

    
196
$$$ TODO explain how organise Experimental Dataset into the *data*
197
directory of the working directory.
198

    
199
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each
200
strain we performed Mnase-Seq and ChIP-Seq using an antibody
201
recognizing the H3K14ac epigenetic mark.
202

    
203
The dataset is composed of 55 files organised as follows:
204

    
205
   * 3 replicates for BY MNase Seq
206

    
207
     * sample 1 (5 fastq.gz files)
208

    
209
     * sample 2 (5 fastq.gz files)
210

    
211
     * sample 3 (4 fastq.gz files)
212

    
213
   * 3 replicates for RM MNase Seq
214

    
215
     * sample 4 (4 fastq.gz files)
216

    
217
     * sample 5 (4 fastq.gz files)
218

    
219
     * sample 6 (5 fastq.gz files)
220

    
221
   * 3 replicates for BY ChIP Seq H3K14ac
222

    
223
     * sample 36 (5 fastq.gz files)
224

    
225
     * sample 37 (5 fastq.gz files)
226

    
227
     * sample 53 (9 fastq.gz files)
228

    
229
   * 2 replicates for RM ChIP Seq H3K14ac
230

    
231
     * sample 38 (5 fastq.gz files)
232

    
233
     * sample 39 (4 fastq.gz files)
234

    
235

    
236
Python and R Common Configuration File
237
--------------------------------------
238

    
239
First of all we define in one place some configuration variables that
240
will be launched by python and R scripts. These variables are
241
contained in file *configurator.py*. The execution of this python
242
script dumps variables into the *nucleominer_config.json* file that
243
will then be used by both R and python scripts.
244

    
245
To do this, go to the root directory of your project and run the
246
following command:
247

    
248
   python src/current/configurator.py
249

    
250

    
251
Preprocessing Illumina Fastq Reads for Each Sample
252
==================================================
253

    
254
This preprocessing step consists of 4 main steps embedded in the
255
*wf.py* script. They are described bellow. As a preamble, this script
256
computes *samples*, *samples_mnase* and *strains* that will be used
257
along the 4 steps.
258

    
259
wf.samples = []
260

    
261
   List of samples where a sample is identified by an id (key: *id*)
262
   and a strain name (key *strain*).
263

    
264
wf.samples_mnase = []
265

    
266
   List of Mnase samples.
267

    
268
wf.strains = []
269

    
270
   List of reference strains.
271

    
272

    
273
Creating Bowtie Index from each Reference Genome
274
------------------------------------------------
275

    
276
For each strain, we need to create bowtie index. Bowtie index of a
277
strain is a tree view of the genome of this strain. It will be used by
278
bowtie to align reads. This step is performed by the following part of
279
the *wf.py* script:
280

    
281
     for strain in strains:
282
       per_strain_stats[strain] = create_bowtie_index(strain, 
283
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
284
         config["BOWTIE_BUILD_BIN"])
285

    
286
The following table summarizes the file sizes and process durations
287
concerning this step.
288

    
289
+--------+------------------------+------------------------+------------------+
290
| strain | fasta genome file size | bowtie index file size | process duration |
291
+========+========================+========================+==================+
292
| BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
293
+--------+------------------------+------------------------+------------------+
294
| RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
295
+--------+------------------------+------------------------+------------------+
296

    
297

    
298
Aligning Reads to Reference Genome
299
----------------------------------
300

    
301
Next, we launch bowtie to align reads to the reference genome. It
302
produces a *.sam* file that we convert into a *.bed* file. Binaries
303
for *bowtie*, *samtools* and *bedtools* are wrapped using python
304
*subprocess* class. This step is performed by the following part of
305
the *wf.py* script:
306

    
307
     for sample in samples:
308
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
309
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
310
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
311
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
312

    
313

    
314
Convert Aligned Reads into TemplateFilter Format
315
------------------------------------------------
316

    
317
TemplateFilter uses particular input formats for reads, so it is
318
necessary to convert the *.bed* files. TemplateFilter expect reads as
319
follows: *chr*, *coord*, *strand* and *#read* where:
320

    
321
* *chr* is the number of the chromosome;
322

    
323
* *coord* is the coordinate of the reads;
324

    
325
* *strand* is *F* for forward and *R* for reverse;
326

    
327
* *#reads* the number of reads covering this position.
328

    
329
Each entry is *tab*-separated.
330

    
331
**WARNING** for reverse strands, bowtie returns the position of the
332
first nucleotide on the left hand side, whereas TemplateFilter expects
333
the first one on the right hand side.  This step takes this into
334
account by adding the read length (in our case 50) to the reverse
335
reads coordinates.
336

    
337
This step is performed by the following part of the *wf.py* script:
338

    
339
     for sample in samples:
340
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
341
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
342
         config["READ_LENGTH"],config["MAPQ_THRES"])
343

    
344
The following table summarises the number of reads, the involved file
345
sizes and process durations concerning the two last steps. In our
346
case, alignment process have been multithreaded over 3 cores.
347

    
348
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
349
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
350
+====+================+===========================+========+==================+====================+==================+
351
| 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
352
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
353
| 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
354
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
355
| 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
356
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
357
| 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
358
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
359
| 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
360
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
361
| 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
362
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
363
| 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
364
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
365
| 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
366
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
367
| 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
368
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
369
| 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
370
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
371
| 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
372
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
373

    
374

    
375
Run TemplateFilter on Mnase Samples
376
-----------------------------------
377

    
378
Finally, for each sample we perform TemplateFilter analysis.
379

    
380
**WARNING** TemplateFilter returns a list of nucleosomes. Each
381
nucleosome is define by its center and its width. An odd width leads
382
us to consider non- integer lower and upper bound.
383

    
384
**WARNING** TemplateFilter is not designed to deal with replicates. So
385
we recommend to keep a maximum of nucleosomes and filter the aberrant
386
ones afterwards using the benefits of having replicates. To do this,
387
we set a low correlation threshold parameter (0.5) and a particularly
388
high value of overlap (300%).
389

    
390
This step is performed by the following part of the *wf.py* script:
391

    
392
     for sample in samples_mnase:
393
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
394
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
395
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
396
         config["TF_MAXW"], config["TF_OL"])  
397

    
398
+----+--------+------------+---------------+------------------+
399
| id | strain | found nucs | nuc file size | process duration |
400
+====+========+============+===============+==================+
401
| 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
402
+----+--------+------------+---------------+------------------+
403
| 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
404
+----+--------+------------+---------------+------------------+
405
| 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
406
+----+--------+------------+---------------+------------------+
407
| 4  | RM     | 88076      | 62 Mo         | 984 s.           |
408
+----+--------+------------+---------------+------------------+
409
| 5  | RM     | 90141      | 64 Mo         | 967 s.           |
410
+----+--------+------------+---------------+------------------+
411
| 6  | RM     | 87517      | 62 Mo         | 980 s.           |
412
+----+--------+------------+---------------+------------------+
413

    
414

    
415
Inferring Nucleosome Position and Extracting Read Counts
416
========================================================
417

    
418
The second part of the tutorial uses R
419
(http://http://www.r-project.org). It consists of a set of R scripts
420
that will be sourced in an R from a console launched at the root of
421
your project. These scripts are:
422

    
423
   * headers.R
424

    
425
   * extract_maps.R
426

    
427
   * translate_common_wp.R
428

    
429
   * split_samples.R
430

    
431
   * count_reads.R
432

    
433
   * get_size_factors
434

    
435
   * launch_deseq.R
436

    
437

    
438
The Script headers.R
439
--------------------
440

    
441
The script headers.R is included in each other scripts. It is in
442
charge of:
443

    
444
   * launching libraries used in the scripts
445

    
446
   * launching configuration (design, strain, marker...)
447

    
448
   * computing and caching CURs (caching means storing the
449
     information in the computer's memory)
450

    
451
Note that you can customize the function “translate”. This function
452
allows you to use the alignments between genomes when performing
453
various tasks. You may be using NucleoMiner2 to analyse data of a
454
single strain, or of several strains.
455

    
456
   * All the data corresponds to the same strain (e.g.
457
     treatment/control, or only few mutations): Then in step 1), the
458
     regions to use are entire chromosomes. Instep 2) simply use the
459
     default translate function which is neutral.
460

    
461
   * The data come from two or more strains: In this case, edit a
462
     list of regions and customize the translate function which
463
     performs the correspondence between the different genomes. How we
464
     did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to
465
     the Appendice "Generate .c2c Files"), then use it to produce the
466
     list of regions and customise “translate”.
467

    
468
In your R console, run the following command line:
469

    
470
   source("src/current/headers.R")
471

    
472

    
473
The Script extract_maps.R
474
-------------------------
475

    
476
This script is in charge of extracting Maps for well-positioned and
477
fuzzy nucleosomes. First of all, this script computes intra and inter-
478
strain nucleosome maps for each CUR. This step is executed in parallel
479
on many cores using the BoT library. Next, it collects results and
480
produces well-positioned, fuzzy and UNR maps.
481

    
482
The well-positioned map for BY is collected in the result directory
483
and is called *BY_wp.tab*. It is composed of following columns:
484

    
485
   * chr, the number of the chromosome
486

    
487
   * lower_bound, the lower bound of the nucleosome
488

    
489
   * upper_bound, the upper bound of the nucleosome
490

    
491
   * cur_index, index of the CUR
492

    
493
   * index_nuc, the index of the nucleosome in the CUR
494

    
495
   * wp, 1 if it is a well positioned nucleosome, 0 otherwise
496

    
497
   * nb_reads, the number of reads that support this nucleosome
498

    
499
   * nb_nucs, the number of TemplateFilter nucleosome across
500
     replicates (= the number of replicates in which it is a well-
501
     positioned nucleosome)
502

    
503
   * llr_1, for a well-positioned nucleosome, it is the LLR1 (log-
504
     likelihood ratio) between the first and the second TemplateFilter
505
     nucleosome on the chain.
506

    
507
   * llr_2, for a well-positioned nucleosome, it is the LLR1 between
508
     the second and the third TemplateFilter nucleosome on the chain.
509

    
510
   * wp_llr, for a well-positioned nucleosome, it is the LLR2 that
511
     compares consistency of the positioning over all TemplateFilter
512
     nucleosomes.
513

    
514
   * wp_pval, for a well-positioned nucleosome, it is the p-value
515
     chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*)
516

    
517
   * dyad_shift, for a well-positioned nucleosome, it is the shift
518
     between the two extreme TemplateFilter nucleosome dyad positions.
519

    
520
The fuzzy map for BY is collected in the result directory and is
521
called *BY_fuzzy.tab*. It is composed of following columns:
522

    
523
   * chr, the number of the chromosome
524

    
525
   * lower_bound, the lower bound of the nucleosome
526

    
527
   * upper_bound, the upper bound of the nucleosome
528

    
529
   * cur_index, index of the CUR
530

    
531
The map of common well-positioned nucleosomes aligned between the BY
532
and RM strains is collected in the result directory and is called
533
*BY_RM_common_wp.tab*. It is composed of following columns:
534

    
535
   * cur_index, the index of the CUR
536

    
537
   * index_nuc_BY, the index of the BY nucleosome in the CUR
538

    
539
   * index_nuc_RM, the index of the RM nucleosome in the CUR
540

    
541
   * llr_score, , the LLR3 score that estimates conservation between
542
     the positions in BY and RM
543

    
544
   * common_wp_pval,  the p-value chi square test obtained from LLR3
545
     (*1-pchisq(2.LLR3, df=2)*)
546

    
547
   * diff, the dyads shift between the positions in the two strains
548

    
549
The common UNR map for BY and RM strains is collected in the result
550
directory and is called *BY_RM_common_unr.tab*. It is composed of the
551
following columns:
552

    
553
   * cur_index, the index of the CUR
554

    
555
   * index_nuc_BY, the index of the BY nucleosome in the CUR
556

    
557
   * index_nuc_RM,the index of the RM nucleosome in the CUR
558

    
559
To execute this script, run the following command in your R console:
560

    
561
   source("src/current/extract_maps.R")
562

    
563

    
564
The Script translate_common_wp.R
565
--------------------------------
566

    
567
This script is used to translate common well-positioned nucleosome
568
maps from a strain to another strain and stores it into a table.
569

    
570
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM
571
well-positioned nucleosome translated into the BY genome coordinates.
572
It is composed of following columns:
573

    
574
   * strain_ref, the reference genome (in which positioned are
575
     defined)
576

    
577
   * begin, the translated lower bound of the nucleosome
578

    
579
   * end, the translated upper bound of the nucleosome
580

    
581
   * chr, the number of chromosomes for the reference genome (in
582
     which positioned are defined)
583

    
584
   * length, the length of the nucleosome (could be negative)
585

    
586
   * cur_index, the index of the CUR
587

    
588
   * index_nuc, the index of the nucleosome in the CUR
589

    
590
To execute this script, run the following command in your R console:
591

    
592
   source("src/current/translate_common_wp.R")
593

    
594

    
595
The Script split_samples.R
596
--------------------------
597

    
598
For memory space usage reasons, we split and compress TemplateFilter
599
input files according to their corresponding  chromosome. for example,
600
*sample_1_TF.tab* will be split into :
601

    
602
   * sample_1_chr_1_splited_sample.tab.gz
603

    
604
   * sample_1_chr_2_splited_sample.tab.gz
605

    
606
   * ...
607

    
608
   * sample_1_chr_17_splited_sample.tab.gz
609

    
610
To execute this script, run the following command in your R console:
611

    
612
   source("src/current/split_samples.R")
613

    
614

    
615
The Script count_reads.R
616
------------------------
617

    
618
To associate a number of observations (read) to each nucleosome we run
619
the script *count_reads.R*. It produces the files
620
*BY_RM_H3K14ac_wp_and_nbreads.tab*,
621
*BY_RM_H3K14ac_unr_and_nbreads.tab*
622
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and
623
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well-
624
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned
625
nucleosomes and Mnase UNRs respectively.
626

    
627
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains
628
counted reads for well-positioned nucleosomes with the experimental
629
condition ChIP H3K14ac. It is composed of the following columns:
630

    
631
   * chr_BY, the number of the chromosome for BY
632

    
633
   * lower_bound_BY, the lower bound of the nucleosome for BY
634

    
635
   * upper_bound_BY, the upper bound of the nucleosome  for BY
636

    
637
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
638

    
639
   * chr_RM, the number of the chromosome for RM
640

    
641
   * lower_bound_RM, the lower bound of the nucleosome for RM
642

    
643
   * upper_bound_RM, the upper bound of the nucleosome  for RM
644

    
645
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
646

    
647
   * cur_index, index of the CUR
648

    
649
   * BY_H3K14ac_36, the number of reads for the current nucleosome
650
     for the sample 36
651

    
652
   * BY_H3K14ac_37, #reads for sample 37
653

    
654
   * BY_H3K14ac_53, #reads for sample 53
655

    
656
   * RM_H3K14ac_38, #reads for sample 38
657

    
658
   * RM_H3K14ac_39, #reads for sample 39
659

    
660
To execute this script, run the following command in your R console:
661

    
662
   source("src/current/count_reads.R")
663

    
664

    
665
The Script get_size_factors.R
666
-----------------------------
667

    
668
This script uses the DESeq function *estimateSizeFactors* to compute
669
the size factor of each sample. It corresponds to normalisation of
670
read counts from sample to sample, as determined by DESeq. When a
671
sample has n reads for a nucleosome or a UNR, the normalised count is
672
n/f where f is the factor contained in this file. The script dumps
673
computed size factors into the file *size_factors.tab*. This file has
674
the form:
675

    
676
+-----------+---------+---------+---------+
677
| sample_id | wp      | unr     | wpunr   |
678
+===========+=========+=========+=========+
679
| 1         | 0.87396 | 0.88097 | 0.87584 |
680
+-----------+---------+---------+---------+
681
| 2         | 1.07890 | 1.07440 | 1.07760 |
682
+-----------+---------+---------+---------+
683
| 3         | 1.06400 | 1.05890 | 1.06250 |
684
+-----------+---------+---------+---------+
685
| 4         | 0.85782 | 0.87948 | 0.86305 |
686
+-----------+---------+---------+---------+
687
| 5         | 0.97577 | 0.96590 | 0.97307 |
688
+-----------+---------+---------+---------+
689
| 6         | 1.19630 | 1.18120 | 1.19190 |
690
+-----------+---------+---------+---------+
691
| 36        | 0.93318 | 0.92762 | 0.93166 |
692
+-----------+---------+---------+---------+
693
| 37        | 0.48315 | 0.48453 | 0.48350 |
694
+-----------+---------+---------+---------+
695
| 38        | 1.11240 | 1.11210 | 1.11230 |
696
+-----------+---------+---------+---------+
697
| 39        | 0.89897 | 0.89917 | 0.89903 |
698
+-----------+---------+---------+---------+
699
| 53        | 2.22650 | 2.22700 | 2.22660 |
700
+-----------+---------+---------+---------+
701

    
702
sample_id are given in file samples.csv
703

    
704
If you don't know which column to use, we recommend using wpunr.
705

    
706
If you want the very detailed factors produced by DESeq, here are the
707
information:
708

    
709
   * unr: factor computed from data of UNR regions. These regions
710
     are defined for every pairs of aligned genomes (e.g. BY_RM)
711

    
712
   * wp: same, but for well-positioned nucleosomes.
713

    
714
   * wpunr: both types of regions.
715

    
716
To execute this script, run the following command in your R console:
717

    
718
   source("src/current/get_size_factors.R")
719

    
720

    
721
The Script launch_deseq.R
722
-------------------------
723

    
724
Finally, the script *launch_deseq.R* perform statistical analysis on
725
each nucleosome using *DESeq*. It produces files:
726

    
727
   * results/current/BY_RM_H3K14ac_wp_snep.tab
728

    
729
   * results/current/BY_RM_H3K14ac_unr_snep.tab
730

    
731
   * results/current/BY_RM_H3K14ac_wpunr_snep.tab
732

    
733
   * results/current/BY_RM_H3K14ac_wp_mnase.tab
734

    
735
   * results/current/BY_RM_H3K14ac_unr_mnase.tab
736

    
737
   * results/current/BY_RM_H3K14ac_wpunr_mnase.tab
738

    
739
These files are organised with the following columns (see file
740
*BY_RM_H3K14ac_wp_snep.tab* for an example):
741

    
742
   * chr_BY, the number of the chromosome for BY
743

    
744
   * lower_bound_BY, the lower bound of the nucleosome for BY
745

    
746
   * upper_bound_BY, the upper bound of the nucleosome  for BY
747

    
748
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
749

    
750
   * chr_RM, the number of the chromosome for RM
751

    
752
   * lower_bound_RM, the lower bound of the nucleosome for RM
753

    
754
   * upper_bound_RM, the upper bound of the nucleosome  for RM
755

    
756
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
757

    
758
   * cur_index, index of the CUR
759

    
760
   * form
761

    
762
   * BY_Mnase_Seq_1, the number of reads for the current nucleosome
763
     for the sample 1
764

    
765
Next columns concern indicators for each sample:
766

    
767
   * BY_Mnase_Seq_2, #reads for sample 2
768

    
769
   * BY_Mnase_Seq_3, #reads for sample 3
770

    
771
   * RM_Mnase_Seq_4, #reads for sample 4
772

    
773
   * RM_Mnase_Seq_5, #reads for sample 5
774

    
775
   * RM_Mnase_Seq_6, #reads for sample 6
776

    
777
   * BY_H3K14ac_36, #reads for sample 36
778

    
779
   * BY_H3K14ac_37, #reads for sample 37
780

    
781
   * BY_H3K14ac_53, #reads for sample 53
782

    
783
   * RM_H3K14ac_38, #reads for sample 38
784

    
785
   * RM_H3K14ac_39, #reads for sample 39
786

    
787
The 5 last columns concern DESeq analysis:
788

    
789
   * manip[a_manip] strain[a_strain]
790
     manip[a_strain]:strain[a_strain], the manip (marker) effect, the
791
     strain effect and the snep effect. These are the coefficients of
792
     the fitted generalized linear model.
793

    
794
   * pvalsGLM, the pvalue resulting of the comparison of the GLM
795
     model considering or not the interaction term marker:strain. This
796
     is the statsitcial significance of the interaction term and
797
     therefore the statistical significance of the SNEP.
798

    
799
   * snep_index, a boolean set to TRUE if the pvalueGLM value is
800
     under the threshold computed with FDR function with a rate set to
801
     0.0001.
802

    
803
To execute this script, run the following command
804

    
805
To execute this script, run the following command in your R console:
806

    
807
   source("src/current/launch_deseq.R")
808

    
809

    
810
Results: Number of SNEPs
811
========================
812

    
813
Here are the number of computed SNEPs for each forms.
814

    
815
+-------+---------+-------+---------+
816
| form  | strains | #nucs | H3K14ac |
817
+=======+=========+=======+=========+
818
| wp    | BY-RM   | 30464 | 3549    |
819
+-------+---------+-------+---------+
820
| unr   | BY-RM   | 9497  | 1559    |
821
+-------+---------+-------+---------+
822
| wpunr | BY-RM   | 39961 | 5240    |
823
+-------+---------+-------+---------+
824

    
825

    
826
APPENDICE: Generate .c2c Files
827
==============================
828

    
829
$$$ TODO make it works properly. working directory.
830

    
831
The *.c2c* files is a simple table that describes how the genome
832
sequence can be aligned. We generate it using NucleoMiner 1.0.
833

    
834
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first
835
to install the Genetic Data analysis Library (GDL), which is a dynamic
836
library of useful C functions derived from the GNU Scientific Library.
837

    
838

    
839
Installing the GDL library
840
--------------------------
841

    
842
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps
843
of your working directory). Copy it in a dedicated directory. Go into
844
this directory using the cd command, and then unfold the archive by
845
typing:
846

    
847
tar -xvzf gdl-1.0.tar.gz
848

    
849
This creates a directory called gdl-1.0. You now need to go into this
850
directory and compile the library, by typing:
851

    
852
   mkdir tmp_c2c_workdir
853
   cd tmp_c2c_workdir
854
   cp ../deps/gdl-1.0.tar.gz .
855
   tar -xvzf gdl-1.0.tar.gz
856
   cd gdl-1.0
857
   ./configure
858
   make
859

    
860
   cd ..
861

    
862
Now you need to install the library on your system. This needs root
863
priviledges:
864

    
865
   sudo make install
866

    
867

    
868
Installing NucleoMiner 1.0
869
--------------------------
870

    
871
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a
872
dedicated directory. Go into this directory using the cd command, and
873
then unfold the archive by typing:
874

    
875
This creates a directory called nucleominer-1.0. You now need to go
876
into this directory and compile the library, by typing:
877

    
878
   cp ../deps/nucleominer-1.0.tar.gz .
879
   tar -xvzf nucleominer-1.0.tar.gz
880
   cd nucleominer-1.0
881
   ln -s ../gdl-1.0/gdl
882
   ./configure
883
   make
884

    
885
You can then use the binaries dircetly from this folder (best then is
886
to add the path to this folder in your PATH environment variable). If
887
you want to install nucleominer at the system's level (useful if
888
mutiple users will need it) then type, with root priviledges:
889

    
890
   sudo make install
891

    
892

    
893
Generate .c2c Files
894
-------------------
895

    
896
To generate .c2c files you need to type the following command in a
897
terminal:
898

    
899
   mkdir dir_4_c2c
900
   NMgxcomp ../data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta\
901
            ../data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta\
902
            dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log
903

    
904
After execution, the directory *dir_4_c2c* will hold the .c2c files.