Révision dadb6a4d doc/sphinx_doc/build/text/tuto.txt
b/doc/sphinx_doc/build/text/tuto.txt | ||
---|---|---|
13 | 13 |
(nucleosome position and indicators) from the dataset. |
14 | 14 |
|
15 | 15 |
|
16 |
Dataset and Configuration File |
|
17 |
============================== |
|
16 |
Python and R Common Configuration File |
|
17 |
====================================== |
|
18 |
|
|
19 |
First of all we define in one place some configuration variables that |
|
20 |
will be launch by python and R scripts. This file is |
|
21 |
**configurator.py**. The execution of this python script dumps |
|
22 |
variables into the **nucleo_miner_config.json** that will be launch by |
|
23 |
both kind of scriopts (R and puython). |
|
24 |
|
|
25 |
To do this launch at the root of your project the following command |
|
26 |
line: |
|
27 |
|
|
28 |
python src/current/configurator.py |
|
29 |
|
|
30 |
$$$ other python script to describe: - libcoverage.py - wf.py |
|
31 |
|
|
32 |
|
|
33 |
Dataset and Configuration Variables |
|
34 |
=================================== |
|
18 | 35 |
|
19 | 36 |
We want to compare nucleosomes of 3 yeast strains: |
20 | 37 |
|
... | ... | |
37 | 54 |
|
38 | 55 |
* H4K12ac |
39 | 56 |
|
40 |
In order to simplify the design of exeriment, we considere Mnase as a |
|
57 |
In order to simplify the design of experiment, we considere Mnase as a
|
|
41 | 58 |
marker. For each couple *(strain, marker)* we perform 3 replicates. |
42 | 59 |
So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In |
43 | 60 |
practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of |
44 | 61 |
the 53 samples is indentify by a uniq identifier. The file |
45 | 62 |
*CSV_SAMPLE_FILE* sums up this information. |
46 | 63 |
|
64 |
configurator.CSV_SAMPLE_FILE = None |
|
65 |
|
|
66 |
Path to cvs file that contains sample information. |
|
67 |
|
|
47 | 68 |
We use a convention to link sample and Illumina fastq outputs. |
48 | 69 |
Illumina output files of the sample *ID* will be stored in the |
49 | 70 |
directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41 |
50 | 71 |
outputs will be stored in the directory |
51 | 72 |
*data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*. |
52 | 73 |
|
74 |
configurator.ILLUMINA_OUTPUTFILE_PREFIX = None |
|
75 |
|
|
76 |
Prefix for Illumina fastq output files. |
|
77 |
|
|
53 | 78 |
For BY (resp. RM and YJM) we use following reference genome |
54 | 79 |
*saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp. |
55 | 80 |
*saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and |
56 | 81 |
*saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index |
57 | 82 |
*FASTA_REFERENCE_GENOME_FILES* stores this information. |
58 | 83 |
|
84 |
configurator.FASTA_REFERENCE_GENOME_FILES = None |
|
85 |
|
|
86 |
Dictionary where each fasta reference genomes is indexed by |
|
87 |
reference strain that it corresponds. |
|
88 |
|
|
59 | 89 |
Each chromosome/contig is identify in the fasta file by an obscure |
60 | 90 |
identifier. For example, BY chromosome I is identify by |
61 | 91 |
*gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an |
62 | 92 |
integer. So, we translate it. The index *FASTA_INDEXES* stores this |
63 | 93 |
translation. |
64 | 94 |
|
95 |
configurator.FASTA_INDEXES = None |
|
96 |
|
|
97 |
Dictionary of strain that indexes dictionaries where keys are |
|
98 |
chromosome reference from Fastq file and value are its |
|
99 |
correspondance for Templatefilter. |
|
100 |
|
|
65 | 101 |
From a pragamatical point of view we discard some part of the genome |
66 | 102 |
(repeated sequence etc...). The list of the black listed area is |
67 | 103 |
explicitely detailled in *AREA_BLACK_LIST*. |
68 | 104 |
|
105 |
configurator.AREA_BLACK_LIST = None |
|
106 |
|
|
107 |
Dictionary where keys are strain and values are black listed of |
|
108 |
geneome region. |
|
109 |
|
|
69 | 110 |
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use |
70 | 111 |
previously compute .c2c file |
71 | 112 |
*data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp. |
... | ... | |
74 | 115 |
*NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens- |
75 | 116 |
lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf). |
76 | 117 |
|
118 |
configurator.C2C_FILES = None |
|
119 |
|
|
120 |
Dictionary where each strain combination indexes genome aligment. |
|
121 |
|
|
77 | 122 |
*nucleominer* uses specific directory to work in, these are described |
78 | 123 |
in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*. |
79 | 124 |
|
... | ... | |
84 | 129 |
All paths, prefixes and indexes could be change in the |
85 | 130 |
*src/current/nucleominer_config.json* file. |
86 | 131 |
|
132 |
wf.json_conf_file = 'src/current/nucleominer_config.json' |
|
133 |
|
|
134 |
Path to the json configuration file. |
|
135 |
|
|
87 | 136 |
|
88 | 137 |
Preprocessing Illumina Fastq Reads for Each Sample |
89 | 138 |
================================================== |
... | ... | |
93 | 142 |
*samples* *samples_mnase* and *strains* that will be used along the 4 |
94 | 143 |
steps. |
95 | 144 |
|
145 |
wf.samples = [] |
|
146 |
|
|
147 |
List of samples where a sample is identify by an id (key: *id*) and |
|
148 |
a strain name (key *strain*). |
|
149 |
|
|
150 |
wf.samples_mnase = [] |
|
151 |
|
|
152 |
List of Mnase samples. |
|
153 |
|
|
154 |
wf.strains = [] |
|
155 |
|
|
156 |
List of reference strains. |
|
157 |
|
|
96 | 158 |
|
97 | 159 |
Creating Bowtie Index from each Reference Genome |
98 | 160 |
------------------------------------------------ |
... | ... | |
102 | 164 |
will be used by bowtie to align reads. This step is performed by the |
103 | 165 |
following part of the *wf.py* script: |
104 | 166 |
|
167 |
for strain in strains: |
|
168 |
per_strain_stats[strain] = create_bowtie_index(strain, |
|
169 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
|
170 |
config["BOWTIE_BUILD_BIN"]) |
|
171 |
|
|
105 | 172 |
The following table sum up involved file sizes and process durations |
106 | 173 |
concerning this step. |
107 | 174 |
|
... | ... | |
125 | 192 |
*subprocess* class. This step is performed by the followinw part of |
126 | 193 |
the *wf.py* script: |
127 | 194 |
|
195 |
for sample in samples: |
|
196 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
|
197 |
config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], |
|
198 |
config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], |
|
199 |
config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"]) |
|
200 |
|
|
128 | 201 |
|
129 | 202 |
Convert Aligned Reads for TemplateFilter |
130 | 203 |
---------------------------------------- |
... | ... | |
150 | 223 |
|
151 | 224 |
This step is performed by the followinw part of the *wf.py* script: |
152 | 225 |
|
226 |
for sample in samples: |
|
227 |
per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, |
|
228 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
|
229 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
|
230 |
|
|
153 | 231 |
The following table sum up number of reads, involved file sizes and |
154 | 232 |
process durations concerning the two last steps. In our case, aligment |
155 | 233 |
process have been multuthreaded over over 3 cores. |
... | ... | |
264 | 342 |
| 55 | 52424414 | 47117107 | 89,88% | 3811 Mo | 119 Mo | 1477 s. | |
265 | 343 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
266 | 344 |
|
267 |
For some reasons (manipulation efficency, e.g. PCR...), we remove |
|
345 |
For some reasons (manipulation efficiency, e.g. PCR...), we remove
|
|
268 | 346 |
samples 33, 45, 48 and 55. |
269 | 347 |
|
270 | 348 |
|
... | ... | |
285 | 363 |
|
286 | 364 |
This step is performed by the followinw part of the *wf.py* script: |
287 | 365 |
|
366 |
for sample in samples_mnase: |
|
367 |
per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, |
|
368 |
config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], |
|
369 |
config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], |
|
370 |
config["TF_MAXW"], config["TF_OL"]) |
|
371 |
|
|
288 | 372 |
+----+--------+------------+---------------+------------------+ |
289 | 373 |
| id | strain | found nucs | nuc file size | process duration | |
290 | 374 |
+====+========+============+===============+==================+ |
... | ... | |
311 | 395 |
Inferring Nucleosome Position and Extracting Read Counts |
312 | 396 |
======================================================== |
313 | 397 |
|
314 |
This preprocessing step consists in the 4 main steps embed in the |
|
315 |
*wf.py* and described bellow. As a preamble, this script computes |
|
316 |
*samples* *samples_mnase* and *strains* that will be used along the 4 |
|
317 |
steps. |
|
318 |
|
|
319 |
The second part of the tutoriel use *R* |
|
320 |
(http://http://www.r-project.org). It consists in the following main |
|
321 |
steps: |
|
398 |
The second part of the tutorial uses *R* |
|
399 |
(http://http://www.r-project.org). It consists in a set of R scripts |
|
400 |
taht will be sourced in an R console launched at the root of your |
|
401 |
project. the R srcipts are: |
|
322 | 402 |
|
323 |
* compute_rois.R
|
|
403 |
* headers.R
|
|
324 | 404 |
|
325 | 405 |
* extract_maps.R |
326 | 406 |
|
... | ... | |
335 | 415 |
* launch_deseq.R |
336 | 416 |
|
337 | 417 |
|
338 |
Computing Common Genome Region Between Strains |
|
339 |
---------------------------------------------- |
|
418 |
The Script headers.R |
|
419 |
-------------------- |
|
420 |
|
|
421 |
The script header.R is included in each other scripts. It is in charge |
|
422 |
of: |
|
423 |
|
|
424 |
* launching libraries used in thes scripts |
|
425 |
|
|
426 |
* launching configuration (design, strain, marker...) |
|
427 |
|
|
428 |
* computing and caching CURs |
|
429 |
|
|
430 |
In your R console, run the following command line: |
|
431 |
|
|
432 |
R CMD BATCH src/current/header.R |
|
433 |
|
|
434 |
|
|
435 |
The Script extract_maps.R |
|
436 |
------------------------- |
|
437 |
|
|
438 |
This script is in charge of extracting Maps for well positioned and |
|
439 |
fuzzy nucleosomes. First of all, this script computed intra and inter |
|
440 |
strain nucleosome maps for each CUR. This step is executed in parallel |
|
441 |
on many cores using the BoT library. Next, it collects results and |
|
442 |
produces well positioned, fuzzy and UNR maps. |
|
443 |
|
|
444 |
The well-positioned map for BY is collected in the result directory |
|
445 |
and is called **BY_wp.tab**. It is composed of following columns: |
|
446 |
|
|
447 |
* chr, the number of the chromosome |
|
448 |
|
|
449 |
* lower_bound, the lower bound of the nucleosome |
|
450 |
|
|
451 |
* upper_bound, the upper bound of the nucleosome |
|
452 |
|
|
453 |
* cur_index, index of the CUR |
|
454 |
|
|
455 |
* index_nuc, the index of the nucleosome in the CUR |
|
456 |
|
|
457 |
* wp, 1 if it is a well positioned nucleosome, 0 else |
|
458 |
|
|
459 |
* nb_reads, the number of reads that supports this nucleosome |
|
460 |
|
|
461 |
* nb_nucs, the number of TemplateFilter nucleosome across |
|
462 |
replicates (= the number of replicates if it is a well-positioned |
|
463 |
nucleosome) |
|
464 |
|
|
465 |
* llr_1, for a well-positioned nucleosome, it is the LLR1 between |
|
466 |
the first and the second TemplateFilter nucleosome. |
|
467 |
|
|
468 |
* llr_2, for a well-positioned nucleosome, it is the LLR1 between |
|
469 |
the second and the first TemplateFilter nucleosome. |
|
470 |
|
|
471 |
* wp_llr, for a well-positioned nucleosome, it is the LLR2 |
|
472 |
overall TemplateFilter nucleosomes. |
|
473 |
|
|
474 |
* wp_pval, for a well-positioned nucleosome, it is the p-value |
|
475 |
chi square test obtained with the LLR2 (**1-pchisq(2.LLR2, |
|
476 |
df=4)**) |
|
477 |
|
|
478 |
* dyad_shift, for a well-positioned nucleosome, it is shift |
|
479 |
between the two extreme TemplateFilter nucleosome dyad positions. |
|
480 |
|
|
481 |
The fuzzy map for BY is collected in the result directory and is |
|
482 |
called **BY_fuzzy.tab**. It is composed of following columns: |
|
483 |
|
|
484 |
* chr, the number of the chromosome |
|
485 |
|
|
486 |
* lower_bound, the lower bound of the nucleosome |
|
487 |
|
|
488 |
* upper_bound, the upper bound of the nucleosome |
|
489 |
|
|
490 |
* cur_index, index of the CUR |
|
491 |
|
|
492 |
The common well-position map for BY and RM strains is collected in the |
|
493 |
result directory and is called **BY_RM_common_wp.tab**. It is composed |
|
494 |
of following columns: |
|
495 |
|
|
496 |
* cur_index, the index of the CUR |
|
497 |
|
|
498 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
|
499 |
|
|
500 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
|
501 |
|
|
502 |
* llr_score, the LLR3 score between th eBy and RM nucleosomes |
|
503 |
|
|
504 |
* common_wp_pval, the p-value chi square test obtained with the |
|
505 |
LLR3 (**1-pchisq(2.LLR3, df=2)**) |
|
506 |
|
|
507 |
The common UNR map for BY and RM strains is collected in the result |
|
508 |
directory and is called **BY_RM_common_unr.tab**. It is composed of |
|
509 |
following columns: |
|
510 |
|
|
511 |
* cur_index, the index of the CUR |
|
512 |
|
|
513 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
|
514 |
|
|
515 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
|
516 |
|
|
517 |
To execute this script, run the following command line in your R |
|
518 |
console: |
|
519 |
|
|
520 |
source("src/current/extract_maps.R") |
|
521 |
|
|
522 |
|
|
523 |
The Script compare_common_wp.R |
|
524 |
------------------------------ |
|
525 |
|
|
526 |
This script is used to compare inter strain distances between common |
|
527 |
well-positioned nucleosomes. |
|
528 |
|
|
529 |
For example, it compute the file **BY_RM_common_wp_diff.tab** that |
|
530 |
contains dyad shifts between two well-positioned nucleosomes. It is |
|
531 |
composed of following columns: |
|
532 |
* cur_index, the index of the CUR |
|
533 |
|
|
534 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
|
535 |
|
|
536 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
|
537 |
|
|
538 |
* llr_score, the LLR3 score between th eBy and RM nucleosomes |
|
539 |
|
|
540 |
* common_wp_pval, the p-value chi square test obtained with the |
|
541 |
LLR3 (**1-pchisq(2.LLR3, df=2)**) |
|
542 |
|
|
543 |
* diff, the dyad shifts between two well-positioned nucleosomes |
|
544 |
|
|
545 |
It also translates well-positioned nucleosome maps from a strain to an |
|
546 |
other strain and stores it into a table. |
|
547 |
|
|
548 |
For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains |
|
549 |
RM well-positioned nucleosome translated into the BY genome |
|
550 |
referential. It is composed of following columns: |
|
551 |
|
|
552 |
* strain_ref, the reference genome (in which positioned are |
|
553 |
defined) |
|
554 |
|
|
555 |
* begin, the translated lower bound of the nucleosome |
|
340 | 556 |
|
341 |
R CMD BATCH src/current/compute_rois.R
|
|
557 |
* end, the translated upper bound of the nucleosome
|
|
342 | 558 |
|
559 |
* chr, the number of chromosome for the reference genome (in |
|
560 |
which positioned are defined) |
|
343 | 561 |
|
344 |
Extracting Maps for Well Positionned and Fuzzy Nucleosomes |
|
345 |
---------------------------------------------------------- |
|
562 |
* length, the length of the nucleosome (could be negative) |
|
346 | 563 |
|
347 |
R CMD BATCH src/current/extract_maps.R
|
|
564 |
* cur_index, the index of the CUR
|
|
348 | 565 |
|
566 |
* index_nuc, the index of the nucleosome in the CUR |
|
349 | 567 |
|
350 |
Compute Distance Between Well Positionned Nucleosomes
|
|
351 |
-----------------------------------------------------
|
|
568 |
To execute this script, run the following command line in your R
|
|
569 |
console:
|
|
352 | 570 |
|
353 | 571 |
R CMD BATCH src/current/compare_common_wp.R |
354 | 572 |
|
... | ... | |
392 | 610 |
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination, |
393 | 611 |
marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each |
394 | 612 |
post translational histone modification and form is in {wp, fuzzy, |
395 |
wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes
|
|
396 |
or both for SNEP computation.
|
|
613 |
wpfuzzy} considering well positioned nucleosomes, fuzzy nucleosomes or
|
|
614 |
both for SNEP computation. |
|
397 | 615 |
|
398 | 616 |
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM |
399 | 617 |
lower_bound_RM upper_bound_RM index_nuc_RM roi_index form |
Formats disponibles : Unified diff