Révision dadb6a4d doc/sphinx_doc/build/text/tuto.txt

b/doc/sphinx_doc/build/text/tuto.txt
13 13
(nucleosome position and indicators) from the dataset.
14 14

  
15 15

  
16
Dataset and Configuration File
17
==============================
16
Python and R Common Configuration File
17
======================================
18

  
19
First of all we define in one place some configuration variables that
20
will be launch by python and R scripts. This file is
21
**configurator.py**. The execution of this python script dumps
22
variables into the **nucleo_miner_config.json** that will be launch by
23
both kind of scriopts (R and puython).
24

  
25
To do this launch at the root of your project the following command
26
line:
27

  
28
   python src/current/configurator.py
29

  
30
$$$ other python script to describe: - libcoverage.py - wf.py
31

  
32

  
33
Dataset and Configuration Variables
34
===================================
18 35

  
19 36
We want to compare nucleosomes of 3 yeast strains:
20 37

  
......
37 54

  
38 55
* H4K12ac
39 56

  
40
In order to simplify the design of exeriment, we considere Mnase as a
57
In order to simplify the design of experiment, we considere Mnase as a
41 58
marker. For each couple *(strain, marker)* we perform 3 replicates.
42 59
So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In
43 60
practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of
44 61
the 53 samples is indentify by a uniq identifier. The file
45 62
*CSV_SAMPLE_FILE* sums up this information.
46 63

  
64
configurator.CSV_SAMPLE_FILE = None
65

  
66
   Path to cvs file that contains sample information.
67

  
47 68
We use a convention to link sample and Illumina fastq outputs.
48 69
Illumina output files of the sample *ID* will be stored in the
49 70
directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41
50 71
outputs will be stored in the directory
51 72
*data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*.
52 73

  
74
configurator.ILLUMINA_OUTPUTFILE_PREFIX = None
75

  
76
   Prefix for Illumina fastq output files.
77

  
53 78
For BY (resp. RM and YJM) we use following reference genome
54 79
*saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp.
55 80
*saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and
56 81
*saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index
57 82
*FASTA_REFERENCE_GENOME_FILES* stores this information.
58 83

  
84
configurator.FASTA_REFERENCE_GENOME_FILES = None
85

  
86
   Dictionary where each fasta reference genomes is indexed by
87
   reference strain that it corresponds.
88

  
59 89
Each chromosome/contig is identify in the fasta file by an obscure
60 90
identifier. For example, BY chromosome I is identify by
61 91
*gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an
62 92
integer. So, we translate it. The index *FASTA_INDEXES* stores this
63 93
translation.
64 94

  
95
configurator.FASTA_INDEXES = None
96

  
97
   Dictionary of strain that indexes dictionaries where keys are
98
   chromosome reference from Fastq file and value are its
99
   correspondance for Templatefilter.
100

  
65 101
From a pragamatical point of view we discard some part of the genome
66 102
(repeated sequence etc...). The list of the black listed area is
67 103
explicitely detailled in *AREA_BLACK_LIST*.
68 104

  
105
configurator.AREA_BLACK_LIST = None
106

  
107
   Dictionary where keys are strain and values are black listed of
108
   geneome region.
109

  
69 110
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use
70 111
previously compute .c2c file
71 112
*data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp.
......
74 115
*NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens-
75 116
lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
76 117

  
118
configurator.C2C_FILES = None
119

  
120
   Dictionary where each strain combination indexes genome aligment.
121

  
77 122
*nucleominer* uses specific directory to work in, these are described
78 123
in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*.
79 124

  
......
84 129
All paths, prefixes and indexes could be change in the
85 130
*src/current/nucleominer_config.json* file.
86 131

  
132
wf.json_conf_file = 'src/current/nucleominer_config.json'
133

  
134
   Path to the json configuration file.
135

  
87 136

  
88 137
Preprocessing Illumina Fastq Reads for Each Sample
89 138
==================================================
......
93 142
*samples* *samples_mnase* and *strains* that will be used along the 4
94 143
steps.
95 144

  
145
wf.samples = []
146

  
147
   List of samples where a sample is identify by an id (key: *id*) and
148
   a strain name (key *strain*).
149

  
150
wf.samples_mnase = []
151

  
152
   List of Mnase samples.
153

  
154
wf.strains = []
155

  
156
   List of reference strains.
157

  
96 158

  
97 159
Creating Bowtie Index from each Reference Genome
98 160
------------------------------------------------
......
102 164
will be used by bowtie to align reads. This step is performed by the
103 165
following part of the *wf.py* script:
104 166

  
167
     for strain in strains:
168
       per_strain_stats[strain] = create_bowtie_index(strain, 
169
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
170
         config["BOWTIE_BUILD_BIN"])
171

  
105 172
The following table sum up involved file sizes and process durations
106 173
concerning this step.
107 174

  
......
125 192
*subprocess* class. This step is performed by the followinw part of
126 193
the *wf.py* script:
127 194

  
195
     for sample in samples:
196
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
197
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
198
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
199
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
200

  
128 201

  
129 202
Convert Aligned Reads for TemplateFilter
130 203
----------------------------------------
......
150 223

  
151 224
This step is performed by the followinw part of the *wf.py* script:
152 225

  
226
     for sample in samples:
227
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
228
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
229
         config["READ_LENGTH"],config["MAPQ_THRES"])
230

  
153 231
The following table sum up number of reads, involved file sizes and
154 232
process durations concerning the two last steps. In our case, aligment
155 233
process have been multuthreaded over over 3 cores.
......
264 342
| 55 | 52424414       | 47117107                  | 89,88% | 3811 Mo          | 119 Mo             | 1477  s.         |
265 343
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
266 344

  
267
For some reasons (manipulation efficency, e.g. PCR...), we remove
345
For some reasons (manipulation efficiency, e.g. PCR...), we remove
268 346
samples 33, 45, 48 and 55.
269 347

  
270 348

  
......
285 363

  
286 364
This step is performed by the followinw part of the *wf.py* script:
287 365

  
366
     for sample in samples_mnase:
367
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
368
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
369
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
370
         config["TF_MAXW"], config["TF_OL"])  
371

  
288 372
+----+--------+------------+---------------+------------------+
289 373
| id | strain | found nucs | nuc file size | process duration |
290 374
+====+========+============+===============+==================+
......
311 395
Inferring Nucleosome Position and Extracting Read Counts
312 396
========================================================
313 397

  
314
This preprocessing step consists in the 4 main steps embed in the
315
*wf.py* and described bellow. As a preamble, this script computes
316
*samples* *samples_mnase* and *strains* that will be used along the 4
317
steps.
318

  
319
The second part of the tutoriel use *R*
320
(http://http://www.r-project.org). It consists in the following main
321
steps:
398
The second part of the tutorial uses *R*
399
(http://http://www.r-project.org). It consists in a set of R scripts
400
taht will be sourced in an R console launched at the root of your
401
project. the R srcipts are:
322 402

  
323
   * compute_rois.R
403
   * headers.R
324 404

  
325 405
   * extract_maps.R
326 406

  
......
335 415
   * launch_deseq.R
336 416

  
337 417

  
338
Computing Common Genome Region Between Strains
339
----------------------------------------------
418
The Script headers.R
419
--------------------
420

  
421
The script header.R is included in each other scripts. It is in charge
422
of:
423

  
424
   * launching libraries used in thes scripts
425

  
426
   * launching configuration (design, strain, marker...)
427

  
428
   * computing and caching CURs
429

  
430
In your R console, run the following command line:
431

  
432
   R CMD BATCH src/current/header.R
433

  
434

  
435
The Script extract_maps.R
436
-------------------------
437

  
438
This script is in charge of extracting Maps for well positioned and
439
fuzzy nucleosomes. First of all, this script computed intra and inter
440
strain nucleosome maps for each CUR. This step is executed in parallel
441
on many cores using the BoT library. Next, it collects results and
442
produces well positioned, fuzzy and UNR maps.
443

  
444
The well-positioned map for BY is collected in the result directory
445
and is called **BY_wp.tab**. It is composed of following columns:
446

  
447
   * chr, the number of the chromosome
448

  
449
   * lower_bound, the lower bound of the nucleosome
450

  
451
   * upper_bound, the upper bound of the nucleosome
452

  
453
   * cur_index, index of the CUR
454

  
455
   * index_nuc, the index of the nucleosome in the CUR
456

  
457
   * wp, 1 if it is a well positioned nucleosome, 0 else
458

  
459
   * nb_reads, the number of reads that supports this nucleosome
460

  
461
   * nb_nucs, the number of TemplateFilter nucleosome across
462
     replicates (= the number of replicates if it is a well-positioned
463
     nucleosome)
464

  
465
   * llr_1, for a well-positioned nucleosome, it is the LLR1 between
466
     the first and the second TemplateFilter nucleosome.
467

  
468
   * llr_2, for a well-positioned nucleosome, it is the LLR1 between
469
     the second and the first TemplateFilter nucleosome.
470

  
471
   * wp_llr, for a well-positioned nucleosome, it is the LLR2
472
     overall TemplateFilter nucleosomes.
473

  
474
   * wp_pval, for a well-positioned nucleosome, it is the p-value
475
     chi square test obtained with the LLR2 (**1-pchisq(2.LLR2,
476
     df=4)**)
477

  
478
   * dyad_shift, for a well-positioned nucleosome, it is shift
479
     between the two extreme TemplateFilter nucleosome dyad positions.
480

  
481
The fuzzy map for BY is collected in the result directory and is
482
called **BY_fuzzy.tab**. It is composed of following columns:
483

  
484
   * chr, the number of the chromosome
485

  
486
   * lower_bound, the lower bound of the nucleosome
487

  
488
   * upper_bound, the upper bound of the nucleosome
489

  
490
   * cur_index, index of the CUR
491

  
492
The common well-position map for BY and RM strains is collected in the
493
result directory and is called **BY_RM_common_wp.tab**. It is composed
494
of following columns:
495

  
496
   * cur_index, the index of the CUR
497

  
498
   * index_nuc_BY, the index of the BY nucleosome in the CUR
499

  
500
   * index_nuc_RM,the index of the RM nucleosome in the CUR
501

  
502
   * llr_score, the LLR3 score between th eBy and RM nucleosomes
503

  
504
   * common_wp_pval,  the p-value chi square test obtained with the
505
     LLR3 (**1-pchisq(2.LLR3, df=2)**)
506

  
507
The common UNR map for BY and RM strains is collected in the result
508
directory and is called **BY_RM_common_unr.tab**. It is composed of
509
following columns:
510

  
511
   * cur_index, the index of the CUR
512

  
513
   * index_nuc_BY, the index of the BY nucleosome in the CUR
514

  
515
   * index_nuc_RM,the index of the RM nucleosome in the CUR
516

  
517
To execute this script, run the following command line in your R
518
console:
519

  
520
   source("src/current/extract_maps.R")
521

  
522

  
523
The Script compare_common_wp.R
524
------------------------------
525

  
526
This script is used to compare inter strain distances between common
527
well-positioned nucleosomes.
528

  
529
For example, it compute the file **BY_RM_common_wp_diff.tab** that
530
contains dyad shifts between two well-positioned nucleosomes. It is
531
composed of following columns:
532
   * cur_index, the index of the CUR
533

  
534
   * index_nuc_BY, the index of the BY nucleosome in the CUR
535

  
536
   * index_nuc_RM,the index of the RM nucleosome in the CUR
537

  
538
   * llr_score, the LLR3 score between th eBy and RM nucleosomes
539

  
540
   * common_wp_pval,  the p-value chi square test obtained with the
541
     LLR3 (**1-pchisq(2.LLR3, df=2)**)
542

  
543
   * diff, the dyad shifts between two well-positioned nucleosomes
544

  
545
It also translates well-positioned nucleosome maps from a strain to an
546
other strain and stores it into a table.
547

  
548
For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains
549
RM well-positioned nucleosome translated into the BY genome
550
referential. It is composed of following columns:
551

  
552
   * strain_ref, the reference genome (in which positioned are
553
     defined)
554

  
555
   * begin, the translated lower bound of the nucleosome
340 556

  
341
   R CMD BATCH src/current/compute_rois.R
557
   * end, the translated upper bound of the nucleosome
342 558

  
559
   * chr, the number of chromosome for the reference genome (in
560
     which positioned are defined)
343 561

  
344
Extracting Maps for Well Positionned and Fuzzy Nucleosomes
345
----------------------------------------------------------
562
   * length, the length of the nucleosome (could be negative)
346 563

  
347
   R CMD BATCH src/current/extract_maps.R
564
   * cur_index, the index of the CUR
348 565

  
566
   * index_nuc, the index of the nucleosome in the CUR
349 567

  
350
Compute Distance Between Well Positionned Nucleosomes
351
-----------------------------------------------------
568
To execute this script, run the following command line in your R
569
console:
352 570

  
353 571
   R CMD BATCH src/current/compare_common_wp.R
354 572

  
......
392 610
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination,
393 611
marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each
394 612
post translational histone modification and form is in {wp, fuzzy,
395
wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes
396
or both for SNEP computation.
613
wpfuzzy} considering well positioned nucleosomes, fuzzy nucleosomes or
614
both for SNEP computation.
397 615

  
398 616
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM
399 617
lower_bound_RM upper_bound_RM index_nuc_RM roi_index form

Formats disponibles : Unified diff