Statistiques
| Branche: | Révision :

root / doc / sphinx_doc / tuto.rst @ 3bfae55d

Historique | Voir | Annoter | Télécharger (25,57 ko)

1
Tutorial
2
========
3

    
4
This tutorial describes steps allowing performing quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
5

    
6
This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts information (nucleosome position and indicators) from the dataset.
7

    
8

    
9

    
10

    
11
Experimental Dataset, Working Directory and Configuration File 
12
--------------------------------------------------------------
13

    
14
Working Directory Organisation
15
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16

    
17
After having install NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
18

    
19
.. code:: bash
20

    
21
  cd doc/Chuffart_NM2_workdir/
22

    
23

    
24
Retrieving Experimental Dataset
25
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26

    
27
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
28

    
29
$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
30

    
31

    
32
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain we performed Mnase-Seq and ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark.
33

    
34
The dataset is composed of 55 files organised as follows: 
35

    
36
  - 3 replicates for BY MNase Seq
37
  
38
    - sample 1 (5 fastq.gz files)
39
    - sample 2 (5 fastq.gz files)
40
    - sample 3 (4 fastq.gz files)
41
    
42
  - 3 replicates for RM MNase Seq
43
  
44
    - sample 4 (4 fastq.gz files)
45
    - sample 5 (4 fastq.gz files)
46
    - sample 6 (5 fastq.gz files)
47
    
48
  - 3 replicates for BY ChIP Seq H3K14ac
49
  
50
    - sample 36 (5 fastq.gz files)
51
    - sample 37 (5 fastq.gz files)
52
    - sample 53 (9 fastq.gz files)
53
    
54
  - 2 replicates for RM ChIP Seq H3K14ac
55
  
56
    - sample 38 (5 fastq.gz files)
57
    - sample 39 (4 fastq.gz files)
58
    
59

    
60

    
61

    
62
Python and R Common Configuration File
63
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64

    
65
First of all we define in one place some configuration variables that will be launched by python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
66

    
67
To do this, go to the root directory of your project and run the following command:
68

    
69
.. code:: bash
70

    
71
  python src/current/configurator.py
72
  
73

    
74

    
75

    
76

    
77

    
78

    
79
Preprocessing Illumina Fastq Reads for Each Sample
80
--------------------------------------------------
81

    
82
This preprocessing step consists of 4 main steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
83

    
84

    
85
.. autodata:: wf.samples
86
    :noindex: 
87

    
88
.. autodata:: wf.samples_mnase
89
    :noindex: 
90

    
91
.. autodata:: wf.strains
92
    :noindex: 
93

    
94

    
95
Creating Bowtie Index from each Reference Genome
96
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97

    
98
For each strain, we need to create bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. This step is performed by the following part of the `wf.py` script:
99

    
100
.. literalinclude:: ../../../snep/src/current/wf.py
101
   :start-after: # _STARTOF_ step_1
102
   :end-before: # _ENDOF_ step_1
103
   :language: python
104

    
105
The following table summarizes the file sizes and process durations concerning this step.
106

    
107
======  ======================  ======================  ================
108
strain  fasta genome file size  bowtie index file size  process duration
109
======  ======================  ======================  ================
110
BY      12 Mo                          25 Mo                    11 s.
111
RM      12 Mo                          24 Mo                    9 s.
112
======  ======================  ======================  ================
113

    
114

    
115

    
116

    
117
Aligning Reads to Reference Genome 
118
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119

    
120
Next, we launch bowtie to align reads to the reference genome. It produces a 
121
`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the `wf.py` script:
122

    
123
.. literalinclude:: ../../../snep/src/current/wf.py
124
   :start-after: # _STARTOF_ step_2
125
   :end-before: # _ENDOF_ step_2
126
   :language: python
127

    
128
Convert Aligned Reads into TemplateFilter Format
129
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
130

    
131
TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads as follows: `chr`, `coord`, `strand` and `#read` where:
132

    
133
- `chr` is the number of the chromosome;
134
- `coord` is the coordinate of the reads;
135
- `strand` is `F` for forward and `R` for reverse;
136
- `#reads` the number of reads covering this position.
137

    
138
Each entry is *tab*-separated.
139

    
140
**WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This step takes this into account by adding the read length (in our case 50) to the reverse reads coordinates.
141

    
142
This step is performed by the following part of the `wf.py` script:
143

    
144
.. literalinclude:: ../../../snep/src/current/wf.py
145
   :start-after: # _STARTOF_ step_3
146
   :end-before: # _ENDOF_ step_3
147
   :language: python
148

    
149
The following table summarises the number of reads, the involved file sizes and process durations concerning the two last steps. In our case, alignment process have been multithreaded over 3 cores.
150

    
151
==  ==============  =========================  ======  ================  ==================  ================  
152
id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
153
==  ==============  =========================  ======  ================  ==================  ================
154
1   16436138        10199695                   62,06%  1064 Mo           60  Mo              383   s.
155
2   16911132        12512727                   73,99%  1298 Mo           64  Mo              437   s.
156
3   15946902        12340426                   77,38%  1280 Mo           65  Mo              423   s.
157
4   13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
158
5   15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
159
6   18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
160
36  17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
161
37  17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
162
38  16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
163
39  14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
164
53  40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
165
==  ==============  =========================  ======  ================  ==================  ================  
166

    
167
Run TemplateFilter on Mnase Samples
168
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169

    
170
Finally, for each sample we perform TemplateFilter analysis. 
171

    
172
**WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is 
173
define by its center and its width. An odd width leads us to consider non-
174
integer lower and upper bound.
175

    
176
**WARNING** TemplateFilter is not designed to deal with replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
177

    
178
This step is performed by the following part of the `wf.py` script:
179

    
180
.. literalinclude:: ../../../snep/src/current/wf.py
181
   :start-after: # _STARTOF_ step_4
182
   :end-before: # _ENDOF_ step_4
183
   :language: python
184

    
185
==  ======  ==========  =============  ================
186
id  strain  found nucs  nuc file size  process duration
187
==  ======  ==========  =============  ================
188
1    BY     96214       68 Mo          1022 s.                     
189
2    BY     91694       65 Mo          1038 s.                      
190
3    BY     91205       65 Mo          1036 s.                       
191
4    RM     88076       62 Mo          984 s.                      
192
5    RM     90141       64 Mo          967 s.                      
193
6    RM     87517       62 Mo          980 s.                      
194
==  ======  ==========  =============  ================
195

    
196

    
197

    
198

    
199

    
200

    
201

    
202

    
203

    
204

    
205

    
206

    
207

    
208

    
209

    
210

    
211

    
212

    
213

    
214

    
215

    
216

    
217

    
218

    
219

    
220

    
221

    
222

    
223

    
224

    
225

    
226

    
227

    
228

    
229

    
230

    
231

    
232

    
233

    
234

    
235

    
236

    
237

    
238

    
239

    
240

    
241

    
242

    
243

    
244

    
245

    
246

    
247

    
248

    
249
..
250
..
251
.. - libcoverage.py
252
.. - wf.py
253
..
254
..
255
..
256
..
257
..
258
..
259
.. In order to simplify the design of experiment, we consider Mnase as a marker.
260
.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
261
.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
262
.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
263
.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
264
..
265
.. .. autodata:: configurator.CSV_SAMPLE_FILE
266
..     :noindex:
267
..
268
.. We use a convention to link sample and Illumina fastq outputs. Illumina output
269
.. files of the sample `ID` will be stored in the directory
270
.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
271
.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
272
..
273
.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
274
..     :noindex:
275
..
276
.. For BY (resp. RM and YJM) we use following reference genome
277
.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
278
.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
279
.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
280
.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
281
..
282
.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
283
..     :noindex:
284
..
285
.. Each chromosome/contig is identify in the fasta file by an obscure identifier.
286
.. For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when
287
.. TemplateFilter is waiting for an integer. So, we translate it. The index
288
.. `FASTA_INDEXES` stores this translation.
289
..
290
.. .. autodata:: configurator.FASTA_INDEXES
291
..     :noindex:
292
..
293
.. From a pragamatical point of view we discard some part of the genome (repeated
294
.. sequence etc...). The list of the black listed area is explicitely detailled in
295
.. `AREA_BLACK_LIST`.
296
..
297
.. .. autodata:: configurator.AREA_BLACK_LIST
298
..     :noindex:
299
..
300
.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
301
.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
302
.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
303
.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
304
.. version of `NucleoMiner2`
305
.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
306
..
307
.. .. autodata:: configurator.C2C_FILES
308
..     :noindex:
309
..
310
.. `nucleominer` uses specific directory to work in, these are described in
311
.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
312
..
313
.. Finally, `nucleominer` use external ressources, the path to these resspources
314
.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
315
.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
316
..
317
.. All paths, prefixes and indexes could be change in the
318
.. `src/current/nucleominer_config.json` file.
319
..
320
.. .. autodata:: wf.json_conf_file
321
..     :noindex:
322
..
323

    
324

    
325

    
326

    
327

    
328

    
329

    
330

    
331

    
332

    
333

    
334

    
335

    
336

    
337

    
338

    
339

    
340

    
341

    
342

    
343

    
344

    
345

    
346

    
347

    
348

    
349

    
350

    
351

    
352

    
353

    
354

    
355

    
356

    
357

    
358

    
359

    
360

    
361

    
362

    
363

    
364

    
365
Inferring Nucleosome Position and Extracting Read Counts
366
--------------------------------------------------------
367

    
368

    
369

    
370
The second part of the tutorial uses R (http://http://www.r-project.org). It consists of a set of R scripts that will be sourced in an R from a console launched at the root of your project. These scripts are:
371

    
372
  - headers.R
373
  - extract_maps.R
374
  - translate_common_wp.R
375
  - split_samples.R
376
  - count_reads.R
377
  - get_size_factors  
378
  - launch_deseq.R
379

    
380
The Script headers.R
381
^^^^^^^^^^^^^^^^^^^^
382

    
383
The script headers.R is included in each other scripts. It is in charge of: 
384

    
385
  - launching libraries used in the scripts
386
  - launching configuration (design, strain, marker...)
387
  - computing and caching CURs (caching means storing the information in the computer's memory)
388

    
389
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. You may be using NucleoMiner2 to analyse data of a single strain, or of several strains. 
390

    
391
  - All the data corresponds to the same strain (e.g. treatment/control, or only few mutations): Then in step 1), the  regions to use are entire chromosomes. Instep 2) simply use the default translate function which is neutral.
392

    
393
  - The data come from two or more strains: In this case, edit a list of regions and customize the translate function which performs the correspondence between the different genomes. How we did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to the Appendice "Generate .c2c Files"), then use it to produce the list of regions and customise “translate”.
394

    
395

    
396

    
397

    
398
In your R console, run the following command line:
399

    
400
.. code:: bash
401

    
402
  source("src/current/headers.R")
403

    
404

    
405
The Script extract_maps.R
406
^^^^^^^^^^^^^^^^^^^^^^^^^
407
This script is in charge of extracting Maps for well-positioned and fuzzy nucleosomes. First of all, this script computes intra and inter-strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well-positioned, fuzzy and UNR maps.
408

    
409
The well-positioned map for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
410

    
411
 - chr, the number of the chromosome 
412
 - lower_bound, the lower bound of the nucleosome
413
 - upper_bound, the upper bound of the nucleosome 
414
 - cur_index, index of the CUR
415
 - index_nuc, the index of the nucleosome in the CUR
416
 - wp, 1 if it is a well positioned nucleosome, 0 otherwise
417
 - nb_reads, the number of reads that support this nucleosome
418
 - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
419
 - llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
420
 - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
421
 - wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
422
 - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (`1-pchisq(2.LLR2, df=4)`)
423
 - dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions. 
424

    
425
The fuzzy map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
426

    
427
 - chr, the number of the chromosome 
428
 - lower_bound, the lower bound of the nucleosome
429
 - upper_bound, the upper bound of the nucleosome 
430
 - cur_index, index of the CUR
431

    
432
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
433

    
434
 - cur_index, the index of the CUR
435
 - index_nuc_BY, the index of the BY nucleosome in the CUR
436
 - index_nuc_RM, the index of the RM nucleosome in the CUR
437
 - llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM 
438
 - common_wp_pval,  the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
439
 - diff, the dyads shift between the positions in the two strains
440

    
441
The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
442

    
443
 - cur_index, the index of the CUR
444
 - index_nuc_BY, the index of the BY nucleosome in the CUR
445
 - index_nuc_RM,the index of the RM nucleosome in the CUR
446

    
447
To execute this script, run the following command in your R console:
448

    
449
.. code:: bash
450

    
451
  source("src/current/extract_maps.R")
452

    
453

    
454
The Script translate_common_wp.R
455
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
456

    
457
This script is used to translate common well-positioned nucleosome maps from a strain to another strain and stores it into a table. 
458

    
459
For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosome translated into the BY genome coordinates. It is composed of following columns:
460

    
461
 - strain_ref, the reference genome (in which positioned are defined)
462
 - begin, the translated lower bound of the nucleosome
463
 - end, the translated upper bound of the nucleosome
464
 - chr, the number of chromosomes for the reference genome (in which positioned are defined)
465
 - length, the length of the nucleosome (could be negative)
466
 - cur_index, the index of the CUR
467
 - index_nuc, the index of the nucleosome in the CUR
468

    
469
To execute this script, run the following command in your R console:
470

    
471
.. code:: bash
472

    
473
  source("src/current/translate_common_wp.R")
474

    
475

    
476
The Script split_samples.R
477
^^^^^^^^^^^^^^^^^^^^^^^^^^
478

    
479
For memory space usage reasons, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
480

    
481
  - sample_1_chr_1_splited_sample.tab.gz
482
  - sample_1_chr_2_splited_sample.tab.gz
483
  - ...
484
  - sample_1_chr_17_splited_sample.tab.gz
485
  
486

    
487
To execute this script, run the following command in your R console:
488

    
489
.. code:: bash
490

    
491
  source("src/current/split_samples.R")
492

    
493

    
494
The Script count_reads.R
495
^^^^^^^^^^^^^^^^^^^^^^^^
496

    
497
To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`  
498
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively. 
499

    
500
For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
501

    
502
  - chr_BY, the number of the chromosome for BY
503
  - lower_bound_BY, the lower bound of the nucleosome for BY
504
  - upper_bound_BY, the upper bound of the nucleosome  for BY
505
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
506
  - chr_RM, the number of the chromosome for RM
507
  - lower_bound_RM, the lower bound of the nucleosome for RM
508
  - upper_bound_RM, the upper bound of the nucleosome  for RM
509
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
510
  - cur_index, index of the CUR
511
  - BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
512
  - BY_H3K14ac_37, #reads for sample 37
513
  - BY_H3K14ac_53, #reads for sample 53
514
  - RM_H3K14ac_38, #reads for sample 38
515
  - RM_H3K14ac_39, #reads for sample 39
516

    
517
To execute this script, run the following command in your R console:
518

    
519
.. code:: bash
520

    
521
  source("src/current/count_reads.R")
522

    
523

    
524
The Script get_size_factors.R
525
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
526

    
527

    
528
This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
529
the normalised count is n/f where f is the factor contained in this file.
530
The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
531

    
532
========= ======= ======= =======
533
sample_id      wp     unr   wpunr
534
========= ======= ======= =======
535
        1 0.87396 0.88097 0.87584
536
        2 1.07890 1.07440 1.07760
537
        3 1.06400 1.05890 1.06250
538
        4 0.85782 0.87948 0.86305
539
        5 0.97577 0.96590 0.97307
540
        6 1.19630 1.18120 1.19190
541
       36 0.93318 0.92762 0.93166
542
       37 0.48315 0.48453 0.48350
543
       38 1.11240 1.11210 1.11230
544
       39 0.89897 0.89917 0.89903
545
       53 2.22650 2.22700 2.22660
546
========= ======= ======= =======
547

    
548
sample_id are given in file samples.csv
549

    
550
If you don't know which column to use, we recommend using wpunr.
551

    
552
If you want the very detailed factors produced by DESeq, here are the information:
553

    
554
  - unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
555
  - wp: same, but for well-positioned nucleosomes.
556
  - wpunr: both types of regions.
557

    
558
To execute this script, run the following command in your R console:
559

    
560
.. code:: bash
561

    
562
  source("src/current/get_size_factors.R")
563

    
564

    
565
The Script launch_deseq.R
566
^^^^^^^^^^^^^^^^^^^^^^^^^
567

    
568
Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
569
 
570
  - results/current/BY_RM_H3K14ac_wp_snep.tab
571
  - results/current/BY_RM_H3K14ac_unr_snep.tab
572
  - results/current/BY_RM_H3K14ac_wpunr_snep.tab
573
  - results/current/BY_RM_H3K14ac_wp_mnase.tab
574
  - results/current/BY_RM_H3K14ac_unr_mnase.tab
575
  - results/current/BY_RM_H3K14ac_wpunr_mnase.tab
576

    
577
These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
578

    
579
  - chr_BY, the number of the chromosome for BY
580
  - lower_bound_BY, the lower bound of the nucleosome for BY
581
  - upper_bound_BY, the upper bound of the nucleosome  for BY
582
  - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
583
  - chr_RM, the number of the chromosome for RM
584
  - lower_bound_RM, the lower bound of the nucleosome for RM
585
  - upper_bound_RM, the upper bound of the nucleosome  for RM
586
  - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
587
  - cur_index, index of the CUR
588
  - form 
589
  - BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1  
590

    
591
Next columns concern indicators for each sample:
592

    
593
  - BY_Mnase_Seq_2, #reads for sample 2
594
  - BY_Mnase_Seq_3, #reads for sample 3  
595
  - RM_Mnase_Seq_4, #reads for sample 4  
596
  - RM_Mnase_Seq_5, #reads for sample 5
597
  - RM_Mnase_Seq_6, #reads for sample 6
598
  - BY_H3K14ac_36, #reads for sample 36
599
  - BY_H3K14ac_37, #reads for sample 37
600
  - BY_H3K14ac_53, #reads for sample 53
601
  - RM_H3K14ac_38, #reads for sample 38
602
  - RM_H3K14ac_39, #reads for sample 39
603
  
604
The 5 last columns concern DESeq analysis:
605

    
606
  - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
607
  - pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or not the interaction term marker:strain. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
608
  - snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
609
To execute this script, run the following command 
610

    
611
To execute this script, run the following command in your R console:
612

    
613
.. code:: bash
614

    
615
  source("src/current/launch_deseq.R")
616

    
617

    
618
Results: Number of SNEPs
619
------------------------
620

    
621
Here are the number of computed SNEPs for each forms.
622

    
623
===== ======= ===== =======
624
 form strains #nucs H3K14ac
625
===== ======= ===== =======
626
   wp   BY-RM 30464    3549
627
  unr   BY-RM  9497    1559
628
wpunr   BY-RM 39961    5240
629
===== ======= ===== =======
630
  
631

    
632

    
633

    
634

    
635

    
636
APPENDICE: Generate .c2c Files
637
------------------------------
638

    
639
$$$ TODO make it works properly.
640
working directory.
641

    
642

    
643
The `.c2c` files is a simple table that describes how the genome sequence can be aligned. We generate it using NucleoMiner 1.0.
644

    
645
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first to install the Genetic Data analysis Library (GDL), which is a dynamic library of useful C functions derived from the GNU Scientific Library.
646

    
647
Installing the GDL library
648
^^^^^^^^^^^^^^^^^^^^^^^^^^
649

    
650
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps of your working directory). Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
651

    
652
tar -xvzf gdl-1.0.tar.gz
653

    
654
This creates a directory called gdl-1.0. You now need to go into this directory and compile the library, by typing:
655

    
656
.. code:: bash
657

    
658
  mkdir tmp_c2c_workdir
659
  cd tmp_c2c_workdir
660
  cp ../deps/gdl-1.0.tar.gz .
661
  tar -xvzf gdl-1.0.tar.gz
662
  cd gdl-1.0
663
  ./configure
664
  make
665
  
666
  cd ..
667
  
668

    
669
Now you need to install the library on your system. This needs root priviledges:
670

    
671
.. code:: bash
672

    
673
  sudo make install
674

    
675
Installing NucleoMiner 1.0
676
^^^^^^^^^^^^^^^^^^^^^^^^^^
677

    
678
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
679

    
680
This creates a directory called nucleominer-1.0. You now need to go into this directory and compile the library, by typing:
681

    
682
.. code:: bash
683

    
684
  cp ../deps/nucleominer-1.0.tar.gz .
685
  tar -xvzf nucleominer-1.0.tar.gz
686
  cd nucleominer-1.0
687
  ln -s ../gdl-1.0/gdl
688
  ./configure
689
  make
690

    
691
You can then use the binaries dircetly from this folder (best then is to add the path to this folder in your PATH environment variable). If you want to install nucleominer at the system's level (useful if mutiple users will need it) then type, with root priviledges:
692

    
693
.. code:: bash
694

    
695
  sudo make install
696

    
697
Generate .c2c Files
698
^^^^^^^^^^^^^^^^^^^
699

    
700
To generate .c2c files you need to type the following command in a terminal:
701

    
702
.. code:: bash
703

    
704
  mkdir dir_4_c2c
705
  NMgxcomp ../data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta\
706
           ../data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta\
707
           dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log
708
            
709
After execution, the directory `dir_4_c2c` will hold the .c2c files.