Révision 3c88abd0

b/README
1 1

  
2
Welcome to *NucleoMiner2*
3
*************************
4

  
5
* Readme / Documentation for *NucleoMiner2*
6

  
7
  * License
8

  
9
  * Installation Instructions
10

  
11
* Tutorial
12

  
13
  * Experimental Dataset, Working Directory and Configuration File
14

  
15
  * Preprocessing Illumina Fastq Reads for Each Sample
16

  
17
  * Inferring Nucleosome Position and Extracting Read Counts
18

  
19
  * Results: Number of SNEPs
20

  
21
  * APPENDICE: Generate .c2c Files
22

  
23
* References
24

  
25
  * Python Reference
26

  
27
  * R Reference
28

  
2 29
Readme / Documentation for *NucleoMiner2*
3 30
*****************************************
4 31

  
......
130 157
   R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
131 158
       doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\
132 159
       build/nucleominer_2.3.46.tar.gz
160

  
161
Tutorial
162
********
163

  
164
This tutorial describes steps allowing performing quantitative
165
analysis of epigenetic marks on individual nucleosomes. We assume that
166
files are organised according to a given hierarchy and that all
167
command lines are launched from the project’s root directory.
168

  
169
This tutorial is divided into two main parts. The first part covers
170
the python script *wf.py* that aligns and converts short sequence
171
reads. The second part covers the R scripts that extracts information
172
(nucleosome position and indicators) from the dataset.
173

  
174

  
175
Experimental Dataset, Working Directory and Configuration File
176
==============================================================
177

  
178

  
179
Working Directory Organisation
180
------------------------------
181

  
182
After having install NucleoMiner2 environment (Previous section), go
183
to the root working directory of the tutorial by typing the following
184
command in a terminal:
185

  
186
   cd doc/Chuffart_NM2_workdir/
187

  
188

  
189
Retrieving Experimental Dataset
190
-------------------------------
191

  
192
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress
193
(http://www.ebi.ac.uk/arrayexpress/) under accession number
194
E-MTAB-2671.
195

  
196
$$$ TODO explain how organise Experimental Dataset into the *data*
197
directory of the working directory.
198

  
199
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each
200
strain we performed Mnase-Seq and ChIP-Seq using an antibody
201
recognizing the H3K14ac epigenetic mark.
202

  
203
The dataset is composed of 55 files organised as follows:
204

  
205
   * 3 replicates for BY MNase Seq
206

  
207
     * sample 1 (5 fastq.gz files)
208

  
209
     * sample 2 (5 fastq.gz files)
210

  
211
     * sample 3 (4 fastq.gz files)
212

  
213
   * 3 replicates for RM MNase Seq
214

  
215
     * sample 4 (4 fastq.gz files)
216

  
217
     * sample 5 (4 fastq.gz files)
218

  
219
     * sample 6 (5 fastq.gz files)
220

  
221
   * 3 replicates for BY ChIP Seq H3K14ac
222

  
223
     * sample 36 (5 fastq.gz files)
224

  
225
     * sample 37 (5 fastq.gz files)
226

  
227
     * sample 53 (9 fastq.gz files)
228

  
229
   * 2 replicates for RM ChIP Seq H3K14ac
230

  
231
     * sample 38 (5 fastq.gz files)
232

  
233
     * sample 39 (4 fastq.gz files)
234

  
235

  
236
Python and R Common Configuration File
237
--------------------------------------
238

  
239
First of all we define in one place some configuration variables that
240
will be launched by python and R scripts. These variables are
241
contained in file *configurator.py*. The execution of this python
242
script dumps variables into the *nucleominer_config.json* file that
243
will then be used by both R and python scripts.
244

  
245
To do this, go to the root directory of your project and run the
246
following command:
247

  
248
   python src/current/configurator.py
249

  
250

  
251
Preprocessing Illumina Fastq Reads for Each Sample
252
==================================================
253

  
254
This preprocessing step consists of 4 main steps embedded in the
255
*wf.py* script. They are described bellow. As a preamble, this script
256
computes *samples*, *samples_mnase* and *strains* that will be used
257
along the 4 steps.
258

  
259
wf.samples = []
260

  
261
   List of samples where a sample is identified by an id (key: *id*)
262
   and a strain name (key *strain*).
263

  
264
wf.samples_mnase = []
265

  
266
   List of Mnase samples.
267

  
268
wf.strains = []
269

  
270
   List of reference strains.
271

  
272

  
273
Creating Bowtie Index from each Reference Genome
274
------------------------------------------------
275

  
276
For each strain, we need to create bowtie index. Bowtie index of a
277
strain is a tree view of the genome of this strain. It will be used by
278
bowtie to align reads. This step is performed by the following part of
279
the *wf.py* script:
280

  
281
     for strain in strains:
282
       per_strain_stats[strain] = create_bowtie_index(strain, 
283
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
284
         config["BOWTIE_BUILD_BIN"])
285

  
286
The following table summarizes the file sizes and process durations
287
concerning this step.
288

  
289
+--------+------------------------+------------------------+------------------+
290
| strain | fasta genome file size | bowtie index file size | process duration |
291
+========+========================+========================+==================+
292
| BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
293
+--------+------------------------+------------------------+------------------+
294
| RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
295
+--------+------------------------+------------------------+------------------+
296

  
297

  
298
Aligning Reads to Reference Genome
299
----------------------------------
300

  
301
Next, we launch bowtie to align reads to the reference genome. It
302
produces a *.sam* file that we convert into a *.bed* file. Binaries
303
for *bowtie*, *samtools* and *bedtools* are wrapped using python
304
*subprocess* class. This step is performed by the following part of
305
the *wf.py* script:
306

  
307
     for sample in samples:
308
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
309
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
310
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
311
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
312

  
313

  
314
Convert Aligned Reads into TemplateFilter Format
315
------------------------------------------------
316

  
317
TemplateFilter uses particular input formats for reads, so it is
318
necessary to convert the *.bed* files. TemplateFilter expect reads as
319
follows: *chr*, *coord*, *strand* and *#read* where:
320

  
321
* *chr* is the number of the chromosome;
322

  
323
* *coord* is the coordinate of the reads;
324

  
325
* *strand* is *F* for forward and *R* for reverse;
326

  
327
* *#reads* the number of reads covering this position.
328

  
329
Each entry is *tab*-separated.
330

  
331
**WARNING** for reverse strands, bowtie returns the position of the
332
first nucleotide on the left hand side, whereas TemplateFilter expects
333
the first one on the right hand side.  This step takes this into
334
account by adding the read length (in our case 50) to the reverse
335
reads coordinates.
336

  
337
This step is performed by the following part of the *wf.py* script:
338

  
339
     for sample in samples:
340
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
341
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
342
         config["READ_LENGTH"],config["MAPQ_THRES"])
343

  
344
The following table summarises the number of reads, the involved file
345
sizes and process durations concerning the two last steps. In our
346
case, alignment process have been multithreaded over 3 cores.
347

  
348
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
349
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
350
+====+================+===========================+========+==================+====================+==================+
351
| 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
352
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
353
| 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
354
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
355
| 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
356
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
357
| 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
358
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
359
| 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
360
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
361
| 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
362
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
363
| 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
364
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
365
| 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
366
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
367
| 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
368
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
369
| 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
370
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
371
| 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
372
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
373

  
374

  
375
Run TemplateFilter on Mnase Samples
376
-----------------------------------
377

  
378
Finally, for each sample we perform TemplateFilter analysis.
379

  
380
**WARNING** TemplateFilter returns a list of nucleosomes. Each
381
nucleosome is define by its center and its width. An odd width leads
382
us to consider non- integer lower and upper bound.
383

  
384
**WARNING** TemplateFilter is not designed to deal with replicates. So
385
we recommend to keep a maximum of nucleosomes and filter the aberrant
386
ones afterwards using the benefits of having replicates. To do this,
387
we set a low correlation threshold parameter (0.5) and a particularly
388
high value of overlap (300%).
389

  
390
This step is performed by the following part of the *wf.py* script:
391

  
392
     for sample in samples_mnase:
393
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
394
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
395
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
396
         config["TF_MAXW"], config["TF_OL"])  
397

  
398
+----+--------+------------+---------------+------------------+
399
| id | strain | found nucs | nuc file size | process duration |
400
+====+========+============+===============+==================+
401
| 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
402
+----+--------+------------+---------------+------------------+
403
| 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
404
+----+--------+------------+---------------+------------------+
405
| 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
406
+----+--------+------------+---------------+------------------+
407
| 4  | RM     | 88076      | 62 Mo         | 984 s.           |
408
+----+--------+------------+---------------+------------------+
409
| 5  | RM     | 90141      | 64 Mo         | 967 s.           |
410
+----+--------+------------+---------------+------------------+
411
| 6  | RM     | 87517      | 62 Mo         | 980 s.           |
412
+----+--------+------------+---------------+------------------+
413

  
414

  
415
Inferring Nucleosome Position and Extracting Read Counts
416
========================================================
417

  
418
The second part of the tutorial uses R
419
(http://http://www.r-project.org). It consists of a set of R scripts
420
that will be sourced in an R from a console launched at the root of
421
your project. These scripts are:
422

  
423
   * headers.R
424

  
425
   * extract_maps.R
426

  
427
   * translate_common_wp.R
428

  
429
   * split_samples.R
430

  
431
   * count_reads.R
432

  
433
   * get_size_factors
434

  
435
   * launch_deseq.R
436

  
437

  
438
The Script headers.R
439
--------------------
440

  
441
The script headers.R is included in each other scripts. It is in
442
charge of:
443

  
444
   * launching libraries used in the scripts
445

  
446
   * launching configuration (design, strain, marker...)
447

  
448
   * computing and caching CURs (caching means storing the
449
     information in the computer's memory)
450

  
451
Note that you can customize the function “translate”. This function
452
allows you to use the alignments between genomes when performing
453
various tasks. You may be using NucleoMiner2 to analyse data of a
454
single strain, or of several strains.
455

  
456
   * All the data corresponds to the same strain (e.g.
457
     treatment/control, or only few mutations): Then in step 1), the
458
     regions to use are entire chromosomes. Instep 2) simply use the
459
     default translate function which is neutral.
460

  
461
   * The data come from two or more strains: In this case, edit a
462
     list of regions and customize the translate function which
463
     performs the correspondence between the different genomes. How we
464
     did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to
465
     the Appendice "Generate .c2c Files"), then use it to produce the
466
     list of regions and customise “translate”.
467

  
468
In your R console, run the following command line:
469

  
470
   source("src/current/headers.R")
471

  
472

  
473
The Script extract_maps.R
474
-------------------------
475

  
476
This script is in charge of extracting Maps for well-positioned and
477
fuzzy nucleosomes. First of all, this script computes intra and inter-
478
strain nucleosome maps for each CUR. This step is executed in parallel
479
on many cores using the BoT library. Next, it collects results and
480
produces well-positioned, fuzzy and UNR maps.
481

  
482
The well-positioned map for BY is collected in the result directory
483
and is called *BY_wp.tab*. It is composed of following columns:
484

  
485
   * chr, the number of the chromosome
486

  
487
   * lower_bound, the lower bound of the nucleosome
488

  
489
   * upper_bound, the upper bound of the nucleosome
490

  
491
   * cur_index, index of the CUR
492

  
493
   * index_nuc, the index of the nucleosome in the CUR
494

  
495
   * wp, 1 if it is a well positioned nucleosome, 0 otherwise
496

  
497
   * nb_reads, the number of reads that support this nucleosome
498

  
499
   * nb_nucs, the number of TemplateFilter nucleosome across
500
     replicates (= the number of replicates in which it is a well-
501
     positioned nucleosome)
502

  
503
   * llr_1, for a well-positioned nucleosome, it is the LLR1 (log-
504
     likelihood ratio) between the first and the second TemplateFilter
505
     nucleosome on the chain.
506

  
507
   * llr_2, for a well-positioned nucleosome, it is the LLR1 between
508
     the second and the third TemplateFilter nucleosome on the chain.
509

  
510
   * wp_llr, for a well-positioned nucleosome, it is the LLR2 that
511
     compares consistency of the positioning over all TemplateFilter
512
     nucleosomes.
513

  
514
   * wp_pval, for a well-positioned nucleosome, it is the p-value
515
     chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*)
516

  
517
   * dyad_shift, for a well-positioned nucleosome, it is the shift
518
     between the two extreme TemplateFilter nucleosome dyad positions.
519

  
520
The fuzzy map for BY is collected in the result directory and is
521
called *BY_fuzzy.tab*. It is composed of following columns:
522

  
523
   * chr, the number of the chromosome
524

  
525
   * lower_bound, the lower bound of the nucleosome
526

  
527
   * upper_bound, the upper bound of the nucleosome
528

  
529
   * cur_index, index of the CUR
530

  
531
The map of common well-positioned nucleosomes aligned between the BY
532
and RM strains is collected in the result directory and is called
533
*BY_RM_common_wp.tab*. It is composed of following columns:
534

  
535
   * cur_index, the index of the CUR
536

  
537
   * index_nuc_BY, the index of the BY nucleosome in the CUR
538

  
539
   * index_nuc_RM, the index of the RM nucleosome in the CUR
540

  
541
   * llr_score, , the LLR3 score that estimates conservation between
542
     the positions in BY and RM
543

  
544
   * common_wp_pval,  the p-value chi square test obtained from LLR3
545
     (*1-pchisq(2.LLR3, df=2)*)
546

  
547
   * diff, the dyads shift between the positions in the two strains
548

  
549
The common UNR map for BY and RM strains is collected in the result
550
directory and is called *BY_RM_common_unr.tab*. It is composed of the
551
following columns:
552

  
553
   * cur_index, the index of the CUR
554

  
555
   * index_nuc_BY, the index of the BY nucleosome in the CUR
556

  
557
   * index_nuc_RM,the index of the RM nucleosome in the CUR
558

  
559
To execute this script, run the following command in your R console:
560

  
561
   source("src/current/extract_maps.R")
562

  
563

  
564
The Script translate_common_wp.R
565
--------------------------------
566

  
567
This script is used to translate common well-positioned nucleosome
568
maps from a strain to another strain and stores it into a table.
569

  
570
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM
571
well-positioned nucleosome translated into the BY genome coordinates.
572
It is composed of following columns:
573

  
574
   * strain_ref, the reference genome (in which positioned are
575
     defined)
576

  
577
   * begin, the translated lower bound of the nucleosome
578

  
579
   * end, the translated upper bound of the nucleosome
580

  
581
   * chr, the number of chromosomes for the reference genome (in
582
     which positioned are defined)
583

  
584
   * length, the length of the nucleosome (could be negative)
585

  
586
   * cur_index, the index of the CUR
587

  
588
   * index_nuc, the index of the nucleosome in the CUR
589

  
590
To execute this script, run the following command in your R console:
591

  
592
   source("src/current/translate_common_wp.R")
593

  
594

  
595
The Script split_samples.R
596
--------------------------
597

  
598
For memory space usage reasons, we split and compress TemplateFilter
599
input files according to their corresponding  chromosome. for example,
600
*sample_1_TF.tab* will be split into :
601

  
602
   * sample_1_chr_1_splited_sample.tab.gz
603

  
604
   * sample_1_chr_2_splited_sample.tab.gz
605

  
606
   * ...
607

  
608
   * sample_1_chr_17_splited_sample.tab.gz
609

  
610
To execute this script, run the following command in your R console:
611

  
612
   source("src/current/split_samples.R")
613

  
614

  
615
The Script count_reads.R
616
------------------------
617

  
618
To associate a number of observations (read) to each nucleosome we run
619
the script *count_reads.R*. It produces the files
620
*BY_RM_H3K14ac_wp_and_nbreads.tab*,
621
*BY_RM_H3K14ac_unr_and_nbreads.tab*
622
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and
623
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well-
624
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned
625
nucleosomes and Mnase UNRs respectively.
626

  
627
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains
628
counted reads for well-positioned nucleosomes with the experimental
629
condition ChIP H3K14ac. It is composed of the following columns:
630

  
631
   * chr_BY, the number of the chromosome for BY
632

  
633
   * lower_bound_BY, the lower bound of the nucleosome for BY
634

  
635
   * upper_bound_BY, the upper bound of the nucleosome  for BY
636

  
637
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
638

  
639
   * chr_RM, the number of the chromosome for RM
640

  
641
   * lower_bound_RM, the lower bound of the nucleosome for RM
642

  
643
   * upper_bound_RM, the upper bound of the nucleosome  for RM
644

  
645
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
646

  
647
   * cur_index, index of the CUR
648

  
649
   * BY_H3K14ac_36, the number of reads for the current nucleosome
650
     for the sample 36
651

  
652
   * BY_H3K14ac_37, #reads for sample 37
653

  
654
   * BY_H3K14ac_53, #reads for sample 53
655

  
656
   * RM_H3K14ac_38, #reads for sample 38
657

  
658
   * RM_H3K14ac_39, #reads for sample 39
659

  
660
To execute this script, run the following command in your R console:
661

  
662
   source("src/current/count_reads.R")
663

  
664

  
665
The Script get_size_factors.R
666
-----------------------------
667

  
668
This script uses the DESeq function *estimateSizeFactors* to compute
669
the size factor of each sample. It corresponds to normalisation of
670
read counts from sample to sample, as determined by DESeq. When a
671
sample has n reads for a nucleosome or a UNR, the normalised count is
672
n/f where f is the factor contained in this file. The script dumps
673
computed size factors into the file *size_factors.tab*. This file has
674
the form:
675

  
676
+-----------+---------+---------+---------+
677
| sample_id | wp      | unr     | wpunr   |
678
+===========+=========+=========+=========+
679
| 1         | 0.87396 | 0.88097 | 0.87584 |
680
+-----------+---------+---------+---------+
681
| 2         | 1.07890 | 1.07440 | 1.07760 |
682
+-----------+---------+---------+---------+
683
| 3         | 1.06400 | 1.05890 | 1.06250 |
684
+-----------+---------+---------+---------+
685
| 4         | 0.85782 | 0.87948 | 0.86305 |
686
+-----------+---------+---------+---------+
687
| 5         | 0.97577 | 0.96590 | 0.97307 |
688
+-----------+---------+---------+---------+
689
| 6         | 1.19630 | 1.18120 | 1.19190 |
690
+-----------+---------+---------+---------+
691
| 36        | 0.93318 | 0.92762 | 0.93166 |
692
+-----------+---------+---------+---------+
693
| 37        | 0.48315 | 0.48453 | 0.48350 |
694
+-----------+---------+---------+---------+
695
| 38        | 1.11240 | 1.11210 | 1.11230 |
696
+-----------+---------+---------+---------+
697
| 39        | 0.89897 | 0.89917 | 0.89903 |
698
+-----------+---------+---------+---------+
699
| 53        | 2.22650 | 2.22700 | 2.22660 |
700
+-----------+---------+---------+---------+
701

  
702
sample_id are given in file samples.csv
703

  
704
If you don't know which column to use, we recommend using wpunr.
705

  
706
If you want the very detailed factors produced by DESeq, here are the
707
information:
708

  
709
   * unr: factor computed from data of UNR regions. These regions
710
     are defined for every pairs of aligned genomes (e.g. BY_RM)
711

  
712
   * wp: same, but for well-positioned nucleosomes.
713

  
714
   * wpunr: both types of regions.
715

  
716
To execute this script, run the following command in your R console:
717

  
718
   source("src/current/get_size_factors.R")
719

  
720

  
721
The Script launch_deseq.R
722
-------------------------
723

  
724
Finally, the script *launch_deseq.R* perform statistical analysis on
725
each nucleosome using *DESeq*. It produces files:
726

  
727
   * results/current/BY_RM_H3K14ac_wp_snep.tab
728

  
729
   * results/current/BY_RM_H3K14ac_unr_snep.tab
730

  
731
   * results/current/BY_RM_H3K14ac_wpunr_snep.tab
732

  
733
   * results/current/BY_RM_H3K14ac_wp_mnase.tab
734

  
735
   * results/current/BY_RM_H3K14ac_unr_mnase.tab
736

  
737
   * results/current/BY_RM_H3K14ac_wpunr_mnase.tab
738

  
739
These files are organised with the following columns (see file
740
*BY_RM_H3K14ac_wp_snep.tab* for an example):
741

  
742
   * chr_BY, the number of the chromosome for BY
743

  
744
   * lower_bound_BY, the lower bound of the nucleosome for BY
745

  
746
   * upper_bound_BY, the upper bound of the nucleosome  for BY
747

  
748
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
749

  
750
   * chr_RM, the number of the chromosome for RM
751

  
752
   * lower_bound_RM, the lower bound of the nucleosome for RM
753

  
754
   * upper_bound_RM, the upper bound of the nucleosome  for RM
755

  
756
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
757

  
758
   * cur_index, index of the CUR
759

  
760
   * form
761

  
762
   * BY_Mnase_Seq_1, the number of reads for the current nucleosome
763
     for the sample 1
764

  
765
Next columns concern indicators for each sample:
766

  
767
   * BY_Mnase_Seq_2, #reads for sample 2
768

  
769
   * BY_Mnase_Seq_3, #reads for sample 3
770

  
771
   * RM_Mnase_Seq_4, #reads for sample 4
772

  
773
   * RM_Mnase_Seq_5, #reads for sample 5
774

  
775
   * RM_Mnase_Seq_6, #reads for sample 6
776

  
777
   * BY_H3K14ac_36, #reads for sample 36
778

  
779
   * BY_H3K14ac_37, #reads for sample 37
780

  
781
   * BY_H3K14ac_53, #reads for sample 53
782

  
783
   * RM_H3K14ac_38, #reads for sample 38
784

  
785
   * RM_H3K14ac_39, #reads for sample 39
786

  
787
The 5 last columns concern DESeq analysis:
788

  
789
   * manip[a_manip] strain[a_strain]
790
     manip[a_strain]:strain[a_strain], the manip (marker) effect, the
791
     strain effect and the snep effect. These are the coefficients of
792
     the fitted generalized linear model.
793

  
794
   * pvalsGLM, the pvalue resulting of the comparison of the GLM
795
     model considering or not the interaction term marker:strain. This
796
     is the statsitcial significance of the interaction term and
797
     therefore the statistical significance of the SNEP.
798

  
799
   * snep_index, a boolean set to TRUE if the pvalueGLM value is
800
     under the threshold computed with FDR function with a rate set to
801
     0.0001.
802

  
803
To execute this script, run the following command
804

  
805
To execute this script, run the following command in your R console:
806

  
807
   source("src/current/launch_deseq.R")
808

  
809

  
810
Results: Number of SNEPs
811
========================
812

  
813
Here are the number of computed SNEPs for each forms.
814

  
815
+-------+---------+-------+---------+
816
| form  | strains | #nucs | H3K14ac |
817
+=======+=========+=======+=========+
818
| wp    | BY-RM   | 30464 | 3549    |
819
+-------+---------+-------+---------+
820
| unr   | BY-RM   | 9497  | 1559    |
821
+-------+---------+-------+---------+
822
| wpunr | BY-RM   | 39961 | 5240    |
823
+-------+---------+-------+---------+
824

  
825

  
826
APPENDICE: Generate .c2c Files
827
==============================
828

  
829
$$$ TODO make it works properly. working directory.
830

  
831
The *.c2c* files is a simple table that describes how the genome
832
sequence can be aligned. We generate it using NucleoMiner 1.0.
833

  
834
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first
835
to install the Genetic Data analysis Library (GDL), which is a dynamic
836
library of useful C functions derived from the GNU Scientific Library.
837

  
838

  
839
Installing the GDL library
840
--------------------------
841

  
842
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps
843
of your working directory). Copy it in a dedicated directory. Go into
844
this directory using the cd command, and then unfold the archive by
845
typing:
846

  
847
tar -xvzf gdl-1.0.tar.gz
848

  
849
This creates a directory called gdl-1.0. You now need to go into this
850
directory and compile the library, by typing:
851

  
852
   mkdir tmp_c2c_workdir
853
   cd tmp_c2c_workdir
854
   cp ../deps/gdl-1.0.tar.gz .
855
   tar -xvzf gdl-1.0.tar.gz
856
   cd gdl-1.0
857
   ./configure
858
   make
859

  
860
   cd ..
861

  
862
Now you need to install the library on your system. This needs root
863
priviledges:
864

  
865
   sudo make install
866

  
867

  
868
Installing NucleoMiner 1.0
869
--------------------------
870

  
871
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a
872
dedicated directory. Go into this directory using the cd command, and
873
then unfold the archive by typing:
874

  
875
This creates a directory called nucleominer-1.0. You now need to go
876
into this directory and compile the library, by typing:
877

  
878
   cp ../deps/nucleominer-1.0.tar.gz .
879
   tar -xvzf nucleominer-1.0.tar.gz
880
   cd nucleominer-1.0
881
   ln -s ../gdl-1.0/gdl
882
   ./configure
883
   make
884

  
885
You can then use the binaries dircetly from this folder (best then is
886
to add the path to this folder in your PATH environment variable). If
887
you want to install nucleominer at the system's level (useful if
888
mutiple users will need it) then type, with root priviledges:
889

  
890
   sudo make install
891

  
892

  
893
Generate .c2c Files
894
-------------------
895

  
896
To generate .c2c files you need to type the following command in a
897
terminal:
898

  
899
   mkdir dir_4_c2c
900
   NMgxcomp ../data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta\
901
            ../data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta\
902
            dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log
903

  
904
After execution, the directory *dir_4_c2c* will hold the .c2c files.

Formats disponibles : Unified diff