Statistiques
| Branche: | Révision :

root / doc / sphinx_doc / tuto.rst @ 3961deb6

Historique | Voir | Annoter | Télécharger (25,93 ko)

1
Tutorial
2
========
3

    
4
This tutorial describes steps allowing to perform quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
5

    
6
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts nucleosome-level information (nucleosome position and indicators) from the dataset.
7

    
8

    
9

    
10

    
11
Experimental Dataset, Working Directory and Configuration File 
12
--------------------------------------------------------------
13

    
14
Working Directory Organisation
15
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16

    
17
After having installed NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
18

    
19
.. code:: bash
20

    
21
  cd doc/Chuffart_NM2_workdir/
22

    
23

    
24
Retrieving Experimental Dataset
25
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26

    
27
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
28

    
29
$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
30

    
31

    
32
In this tutorial, we want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark. Illumina sequencing was done in single-read of 50 bp long.
33

    
34
The dataset is composed of 55 files organised as follows: 
35

    
36
  - 3 replicates for BY MNase Seq
37
  
38
    - sample 1 (5 fastq.gz files)
39
    - sample 2 (5 fastq.gz files)
40
    - sample 3 (4 fastq.gz files)
41
    
42
  - 3 replicates for RM MNase Seq
43
  
44
    - sample 4 (4 fastq.gz files)
45
    - sample 5 (4 fastq.gz files)
46
    - sample 6 (5 fastq.gz files)
47
    
48
  - 3 replicates for BY ChIP Seq H3K14ac
49
  
50
    - sample 36 (5 fastq.gz files)
51
    - sample 37 (5 fastq.gz files)
52
    - sample 53 (9 fastq.gz files)
53
    
54
  - 2 replicates for RM ChIP Seq H3K14ac
55
  
56
    - sample 38 (5 fastq.gz files)
57
    - sample 39 (4 fastq.gz files)
58
    
59

    
60

    
61

    
62
Python and R Common Configuration File
63
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64

    
65
First, we need to define useful configuration variables that will be passed to python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
66

    
67
The initialization of this variables is done in the configurator.py file. If you need to adapt variable values (path, default parameters...) you need to edit this file. Then, go to the root directory of your project and run the following command to dump the configuration file:
68

    
69
.. code:: bash
70

    
71
  python src/current/configurator.py
72
  
73

    
74

    
75

    
76

    
77
Preprocessing Illumina Fastq Reads for Each Sample
78
--------------------------------------------------
79

    
80
Once variables and design have been specified, the script wf.py will automatically run all the analysis. You don't need to do anything. 
81
To run the full analysis, run the following command:
82

    
83
.. code:: bash
84

    
85
  python src/current/wf.py
86

    
87
The details of the steps performed by this script are explained below.
88
This preprocessing consists of 4 steps embedded in the `wf.py` script. They are described bellow.  As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
89

    
90

    
91
.. autodata:: wf.samples
92
    :noindex: 
93

    
94
.. autodata:: wf.samples_mnase
95
    :noindex: 
96

    
97
.. autodata:: wf.strains
98
    :noindex: 
99

    
100

    
101
Creating Bowtie Index from each Reference Genome
102
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
103

    
104
For each strain, the script *wf.py* then creates bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. The part of the script performing this is the following:
105

    
106
.. literalinclude:: ../../../snep/src/current/wf.py
107
   :start-after: # _STARTOF_ step_1
108
   :end-before: # _ENDOF_ step_1
109
   :language: python
110

    
111
As an indication, the following table summarizes the file sizes and process durations that we experienced when running this step on a Linux server***.
112

    
113
======  ======================  ======================  ================
114
strain  fasta genome file size  bowtie index file size  process duration
115
======  ======================  ======================  ================
116
BY      12 Mo                          25 Mo                    11 s.
117
RM      12 Mo                          24 Mo                    9 s.
118
======  ======================  ======================  ================
119

    
120

    
121

    
122

    
123
Aligning Reads to Reference Genome 
124
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125

    
126
Next, the *wf.py* script launches bowtie to align reads to the reference genome. It produces a `.sam` file that is converted into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the script:
127

    
128
.. literalinclude:: ../../../snep/src/current/wf.py
129
   :start-after: # _STARTOF_ step_2
130
   :end-before: # _ENDOF_ step_2
131
   :language: python
132

    
133
Convert Aligned Reads into TemplateFilter Format
134
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135

    
136
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads in the following format: `chr`, `coord`, `strand` and `#read` where:
137

    
138
- `chr` is the number of the chromosome;
139
- `coord` is the coordinate of the reads;
140
- `strand` is `F` for forward and `R` for reverse;
141
- `#reads` the number of reads covering this position.
142

    
143
Each entry is *tab*-separated.
144

    
145
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This is taken into account in NucleoMiner2 by adding the read length (in our case 50) to the reverse reads coordinates.
146

    
147
This step is performed by the following part of the *wf.py* script:
148

    
149
.. literalinclude:: ../../../snep/src/current/wf.py
150
   :start-after: # _STARTOF_ step_3
151
   :end-before: # _ENDOF_ step_3
152
   :language: python
153

    
154
The following table summarizes the number of reads, the involved file sizes and process durations that we experienced when running the two last steps. In our case, alignment process were multithreaded over 3 cores.
155

    
156
==  ==============  =========================  ======  ================  ==================  ================  
157
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
158
==  ==============  =========================  ======  ================  ==================  ================
159
1   16436138        10199695                   62,06%  1064 Mo           60  Mo              383   s.
160
2   16911132        12512727                   73,99%  1298 Mo           64  Mo              437   s.
161
3   15946902        12340426                   77,38%  1280 Mo           65  Mo              423   s.
162
4   13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
163
5   15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
164
6   18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
165
36  17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
166
37  17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
167
38  16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
168
39  14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
169
53  40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
170
==  ==============  =========================  ======  ================  ==================  ================  
171

    
172
Run TemplateFilter on Mnase Samples
173
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174

    
175
Finally, for each sample we perform TemplateFilter analysis. 
176

    
177
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
178
defined by its center and its width. An odd width leads us to consider non-
179
integer lower and upper bound.
180

    
181
**WARNING** TemplateFilter was not designed to handle replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
182

    
183
This step is performed by the following part of the `wf.py` script:
184

    
185
.. literalinclude:: ../../../snep/src/current/wf.py
186
   :start-after: # _STARTOF_ step_4
187
   :end-before: # _ENDOF_ step_4
188
   :language: python
189

    
190
==  ======  ==========  =============  ================
191
id  strain  found nucs  nuc file size  process duration
192
==  ======  ==========  =============  ================
193
1    BY     96214       68 Mo          1022 s.                     
194
2    BY     91694       65 Mo          1038 s.                      
195
3    BY     91205       65 Mo          1036 s.                       
196
4    RM     88076       62 Mo          984 s.                      
197
5    RM     90141       64 Mo          967 s.                      
198
6    RM     87517       62 Mo          980 s.                      
199
==  ======  ==========  =============  ================
200

    
201

    
202

    
203

    
204

    
205

    
206

    
207

    
208

    
209

    
210

    
211

    
212

    
213

    
214

    
215

    
216

    
217

    
218

    
219

    
220

    
221

    
222

    
223

    
224

    
225

    
226

    
227

    
228

    
229

    
230

    
231

    
232

    
233

    
234

    
235

    
236

    
237

    
238

    
239

    
240

    
241

    
242

    
243

    
244

    
245

    
246

    
247

    
248

    
249

    
250

    
251

    
252

    
253

    
254
..
255
..
256
.. - libcoverage.py
257
.. - wf.py
258
..
259
..
260
..
261
..
262
..
263
..
264
.. In order to simplify the design of experiment, we consider Mnase as a marker.
265
.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
266
.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
267
.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
268
.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
269
..
270
.. .. autodata:: configurator.CSV_SAMPLE_FILE
271
..     :noindex:
272
..
273
.. We use a convention to link sample and Illumina fastq outputs. Illumina output
274
.. files of the sample `ID` will be stored in the directory
275
.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
276
.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
277
..
278
.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
279
..     :noindex:
280
..
281
.. For BY (resp. RM and YJM) we use following reference genome
282
.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
283
.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
284
.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
285
.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
286
..
287
.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
288
..     :noindex:
289
..
290
.. Each chromosome/contig is identify in the fasta file by an obscure identifier.
291
.. For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when
292
.. TemplateFilter is waiting for an integer. So, we translate it. The index
293
.. `FASTA_INDEXES` stores this translation.
294
..
295
.. .. autodata:: configurator.FASTA_INDEXES
296
..     :noindex:
297
..
298
.. From a pragamatical point of view we discard some part of the genome (repeated
299
.. sequence etc...). The list of the black listed area is explicitely detailled in
300
.. `AREA_BLACK_LIST`.
301
..
302
.. .. autodata:: configurator.AREA_BLACK_LIST
303
..     :noindex:
304
..
305
.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
306
.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
307
.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
308
.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
309
.. version of `NucleoMiner2`
310
.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
311
..
312
.. .. autodata:: configurator.C2C_FILES
313
..     :noindex:
314
..
315
.. `nucleominer` uses specific directory to work in, these are described in
316
.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
317
..
318
.. Finally, `nucleominer` use external ressources, the path to these resspources
319
.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
320
.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
321
..
322
.. All paths, prefixes and indexes could be change in the
323
.. `src/current/nucleominer_config.json` file.
324
..
325
.. .. autodata:: wf.json_conf_file
326
..     :noindex:
327
..
328

    
329

    
330

    
331

    
332

    
333

    
334

    
335

    
336

    
337

    
338

    
339

    
340

    
341

    
342

    
343

    
344

    
345

    
346

    
347

    
348

    
349

    
350

    
351

    
352

    
353

    
354

    
355

    
356

    
357

    
358

    
359

    
360

    
361

    
362

    
363

    
364

    
365

    
366

    
367

    
368

    
369

    
370
Inferring Nucleosome Position and Extracting Read Counts
371
--------------------------------------------------------
372

    
373

    
374

    
375
The second part of the tutorial uses R (http://http://www.r-project.org). NucleoMiner2 contains a set of R scripts that will be sourced in R from a console launched at the root of your project. These scripts are:
376

    
377
  - headers.R
378
  - extract_maps.R
379
  - translate_common_wp.R
380
  - split_samples.R
381
  - count_reads.R
382
  - get_size_factors  
383
  - launch_deseq.R
384

    
385
The Script headers.R
386
^^^^^^^^^^^^^^^^^^^^
387

    
388
The script headers.R is included in all other R scripts. It is in charge of: 
389

    
390
  - launching libraries used in the scripts
391
  - launching configuration (design, strain, marker...)
392
  - computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory.
393

    
394
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. 
395

    
396
  -  You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral.
397

    
398
  - If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to produce the list of regions and customise “translate”.
399

    
400
In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console:
401

    
402
.. code:: bash
403

    
404
  source("src/current/headers.R")
405

    
406

    
407
The Script extract_maps.R
408
^^^^^^^^^^^^^^^^^^^^^^^^^
409
This script is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of  well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions .
410

    
411
The map of well-positioned nucleosomes for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
412

    
413
 - chr, the number of the chromosome 
414
 - lower_bound, the lower bound of the nucleosome
415
 - upper_bound, the upper bound of the nucleosome 
416
 - cur_index, index of the CUR
417
 - index_nuc, the index of the nucleosome in the CUR
418
 - wp, 1 if it is a well positioned nucleosome, 0 otherwise
419
 - nb_reads, the number of reads that support this nucleosome
420
 - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
421
 - llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
422
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
423
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
424
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (`1-pchisq(2.LLR2, df=4)`)
425
 - dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions. 
426

    
427
The sensitive map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
428

    
429
 - chr, the number of the chromosome 
430
 - lower_bound, the lower bound of the nucleosome
431
 - upper_bound, the upper bound of the nucleosome 
432
 - cur_index, index of the CUR
433

    
434
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
435

    
436
 - cur_index, the index of the CUR
437
 - index_nuc_BY, the index of the BY nucleosome in the CUR
438
 - index_nuc_RM, the index of the RM nucleosome in the CUR
439
 - llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM 
440
 - common_wp_pval,  the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
441
 - diff, the dyads shift between the positions in the two strains (in bp)
442

    
443
The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
444

    
445
 - cur_index, the index of the CUR
446
 - index_nuc_BY, the index of the BY nucleosome in the CUR
447
 - index_nuc_RM,the index of the RM nucleosome in the CUR
448

    
449
To execute this script, run the following command in your R console:
450

    
451
.. code:: bash
452

    
453
  source("src/current/extract_maps.R")
454

    
455

    
456
The Script translate_common_wp.R
457
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
458

    
459
This script is used to translate common well-positioned nucleosome positions from a strain to another strain and stores it into a table. 
460

    
461
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosomes translated into the BY genome coordinates. It is composed of following columns:
462

    
463
 - strain_ref, the reference genome (in which positioned are defined)
464
 - begin, the translated lower bound of the nucleosome
465
 - end, the translated upper bound of the nucleosome
466
 - chr, the number of chromosomes for the reference genome (in which positioned are defined)
467
 - length, the length of the nucleosome (could be negative)
468
 - cur_index, the index of the CUR
469
 - index_nuc, the index of the nucleosome in the CUR
470

    
471
To execute this script, run the following command in your R console:
472

    
473
.. code:: bash
474

    
475
  source("src/current/translate_common_wp.R")
476

    
477

    
478
The Script split_samples.R
479
^^^^^^^^^^^^^^^^^^^^^^^^^^
480

    
481
To optimize memory space usage, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
482

    
483
  - sample_1_chr_1_splited_sample.tab.gz
484
  - sample_1_chr_2_splited_sample.tab.gz
485
  - ...
486
  - sample_1_chr_17_splited_sample.tab.gz
487
  
488

    
489
To execute this script, run the following command in your R console:
490

    
491
.. code:: bash
492

    
493
  source("src/current/split_samples.R")
494

    
495

    
496
The Script count_reads.R
497
^^^^^^^^^^^^^^^^^^^^^^^^
498

    
499
To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`  
500
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively. 
501

    
502
For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
503

    
504
  - chr_BY, the number of the chromosome for BY
505
  - lower_bound_BY, the lower bound of the nucleosome for BY
506
  - upper_bound_BY, the upper bound of the nucleosome  for BY
507
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
508
  - chr_RM, the number of the chromosome for RM
509
  - lower_bound_RM, the lower bound of the nucleosome for RM
510
  - upper_bound_RM, the upper bound of the nucleosome  for RM
511
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
512
  - cur_index, index of the CUR
513
  - BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
514
  - BY_H3K14ac_37, #reads for sample 37
515
  - BY_H3K14ac_53, #reads for sample 53
516
  - RM_H3K14ac_38, #reads for sample 38
517
  - RM_H3K14ac_39, #reads for sample 39
518

    
519
To execute this script, run the following command in your R console:
520

    
521
.. code:: bash
522

    
523
  source("src/current/count_reads.R")
524

    
525

    
526
The Script get_size_factors.R
527
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
528

    
529

    
530
This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
531
the normalised count is n/f where f is the factor contained in this file.
532
The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
533

    
534
========= ======= ======= =======
535
sample_id      wp     unr   wpunr
536
========= ======= ======= =======
537
        1 0.87396 0.88097 0.87584
538
        2 1.07890 1.07440 1.07760
539
        3 1.06400 1.05890 1.06250
540
        4 0.85782 0.87948 0.86305
541
        5 0.97577 0.96590 0.97307
542
        6 1.19630 1.18120 1.19190
543
       36 0.93318 0.92762 0.93166
544
       37 0.48315 0.48453 0.48350
545
       38 1.11240 1.11210 1.11230
546
       39 0.89897 0.89917 0.89903
547
       53 2.22650 2.22700 2.22660
548
========= ======= ======= =======
549

    
550
sample_id are given in file samples.csv
551

    
552
If you don't know which column to use for normalization, we recommend using wpunr.
553

    
554
Here are the details of the factors produced:
555

    
556
  - unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
557
  - wp: same, but for well-positioned nucleosomes.
558
  - wpunr: both types of regions.
559

    
560
To execute this script, run the following command in your R console:
561

    
562
.. code:: bash
563

    
564
  source("src/current/get_size_factors.R")
565

    
566

    
567
The Script launch_deseq.R
568
^^^^^^^^^^^^^^^^^^^^^^^^^
569

    
570
Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
571
 
572
  - results/current/BY_RM_H3K14ac_wp_snep.tab
573
  - results/current/BY_RM_H3K14ac_unr_snep.tab
574
  - results/current/BY_RM_H3K14ac_wpunr_snep.tab
575
  - results/current/BY_RM_H3K14ac_wp_mnase.tab
576
  - results/current/BY_RM_H3K14ac_unr_mnase.tab
577
  - results/current/BY_RM_H3K14ac_wpunr_mnase.tab
578

    
579
These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
580

    
581
  - chr_BY, the number of the chromosome for BY
582
  - lower_bound_BY, the lower bound of the nucleosome for BY
583
  - upper_bound_BY, the upper bound of the nucleosome  for BY
584
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
585
  - chr_RM, the number of the chromosome for RM
586
  - lower_bound_RM, the lower bound of the nucleosome for RM
587
  - upper_bound_RM, the upper bound of the nucleosome  for RM
588
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
589
  - cur_index, index of the CUR
590
  - form 
591
  - BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1  
592

    
593
Next columns concern indicators for each sample:
594

    
595
  - BY_Mnase_Seq_2, #reads for sample 2
596
  - BY_Mnase_Seq_3, #reads for sample 3  
597
  - RM_Mnase_Seq_4, #reads for sample 4  
598
  - RM_Mnase_Seq_5, #reads for sample 5
599
  - RM_Mnase_Seq_6, #reads for sample 6
600
  - BY_H3K14ac_36, #reads for sample 36
601
  - BY_H3K14ac_37, #reads for sample 37
602
  - BY_H3K14ac_53, #reads for sample 53
603
  - RM_H3K14ac_38, #reads for sample 38
604
  - RM_H3K14ac_39, #reads for sample 39
605
  
606
The 5 last columns concern DESeq analysis:
607

    
608
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
609
  - pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term *marker:strain* to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
610
  - snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
611

    
612
To execute this script, run the following command in your R console:
613

    
614
.. code:: bash
615

    
616
  source("src/current/launch_deseq.R")
617

    
618

    
619
Results: Number of SNEPs
620
------------------------
621

    
622
Here are the number of computed SNEPs for each forms.
623

    
624
===== ======= ===== =======
625
 form strains #nucs H3K14ac
626
===== ======= ===== =======
627
   wp   BY-RM 30464    3549
628
  unr   BY-RM  9497    1559
629
wpunr   BY-RM 39961    5240
630
===== ======= ===== =======
631
  
632

    
633

    
634

    
635

    
636

    
637
APPENDICE: Generate .c2c Files
638
------------------------------
639

    
640
The `.c2c` files is a simple table that describes how two genome
641
sequences are aligned. This file can be generated by using scripts that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics 2010) and which we provide in this release of NucleoMiner2.
642

    
643

    
644
To use these scripts on your UNIX/LINUX computer you need first to install MUMmer which is designed to rapidly align entire genomes, whether in complete or draft form.
645

    
646
Installing MUMmer
647
^^^^^^^^^^^^^^^^^
648

    
649
Get the last version of MUMmer archive on your computer (MUMmer3.23.tar.gz is provided in the directory deps of your working directory). Copy it in a dedicated directory. Install it locally into the src folder of you working directory by typing (working directory):
650

    
651
tar -xvzf MUMmer3.23.tar.gz
652

    
653

    
654
.. code:: bash
655

    
656
  cd src 
657
  tar xfvz ../deps/MUMmer3.23.tar.gz
658
  cd MUMmer3.23
659
  make check
660
  make install
661

    
662
Installing NucleoMiner 1.0 scripts
663
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
664

    
665
Get the nucleominer-1.0.tar.gz archive on your computer (this archive is provided in the directory deps of your working directory). Install it locally into the src folder of you working directory by typing (working directory):
666

    
667

    
668
.. code:: bash
669

    
670
  cd src 
671
  tar xfvz ../deps/nucleominer-1.0.tar.gz 
672
  cd ..
673

    
674
This creates a directory that contains  NucleoMiner 1.0 scripts (src/nucleominer-1.0/scripts). 
675

    
676

    
677
Generate .c2c Files
678
^^^^^^^^^^^^^^^^^^^
679

    
680
To generate .c2c files you need to type the following command in a terminal:
681

    
682
.. code:: bash
683

    
684
  export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
685
  export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
686
  NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
687
    data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
688
    data/byxrm 2>NMgxcomp.log
689
            
690
After execution, the directory `data` will hold the .c2c files.