Révision 3961deb6 README

b/README
101 101
------------
102 102

  
103 103
The first installation step is to retrieve the source code of
104
MyLabStocks. You can do this by typing the following command in a
104
NucleoMiner2. You can do this by typing the following command in a
105 105
terminal.
106 106

  
107 107
   git clone http://forge.cbp.ens-lyon.fr/git/nucleominer
......
135 135

  
136 136
   * DESeq
137 137

  
138
These packages can be installed by typing the following command in an
139
R console:
140

  
141
   install.packages(c("fork", "rjson", "seqinr", "plotrix"))
142
   source("http://bioconductor.org/biocLite.R")
143
   biocLite("DESeq")
144

  
145
Finally,by typing the git command above, you downloaded specific R
146
packages provided with NucleoMiner2 that you now need to install:
147

  
138 148
   * cachecache https://forge.cbp.ens-
139 149
     lyon.fr/redmine/projects/cachecache
140 150

  
......
143 153
   * nucleominer https://forge.cbp.ens-
144 154
     lyon.fr/redmine/projects/nucleominer
145 155

  
146
The first packages could be installed by typing the following command
147
in an R console:
148

  
149
   install.packages(c("fork", "rjson", "seqinr", "plotrix"))
150
   source("http://bioconductor.org/biocLite.R")
151
   biocLite("DESeq")
152

  
153
The last packages are available in the git repository they could be
154
install by typing the following command in your terminal:
156
To do so, type the following command in your terminal:
155 157

  
156 158
   cd nucleominer
157 159
   R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
......
161 163
Tutorial
162 164
********
163 165

  
164
This tutorial describes steps allowing performing quantitative
166
This tutorial describes steps allowing to perform quantitative
165 167
analysis of epigenetic marks on individual nucleosomes. We assume that
166 168
files are organised according to a given hierarchy and that all
167 169
command lines are launched from the project’s root directory.
168 170

  
169 171
This tutorial is divided into two main parts. The first part covers
170 172
the python script *wf.py* that aligns and converts short sequence
171
reads. The second part covers the R scripts that extracts information
172
(nucleosome position and indicators) from the dataset.
173
reads. The second part covers the R scripts that extracts nucleosome-
174
level information (nucleosome position and indicators) from the
175
dataset.
173 176

  
174 177

  
175 178
Experimental Dataset, Working Directory and Configuration File
......
179 182
Working Directory Organisation
180 183
------------------------------
181 184

  
182
After having install NucleoMiner2 environment (Previous section), go
185
After having installed NucleoMiner2 environment (Previous section), go
183 186
to the root working directory of the tutorial by typing the following
184 187
command in a terminal:
185 188

  
......
196 199
$$$ TODO explain how organise Experimental Dataset into the *data*
197 200
directory of the working directory.
198 201

  
199
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each
200
strain we performed Mnase-Seq and ChIP-Seq using an antibody
201
recognizing the H3K14ac epigenetic mark.
202
In this tutorial, we want to compare nucleosomes of 2 yeast strains:
203
BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq
204
using an antibody recognizing the H3K14ac epigenetic mark. Illumina
205
sequencing was done in single-read of 50 bp long.
202 206

  
203 207
The dataset is composed of 55 files organised as follows:
204 208

  
......
236 240
Python and R Common Configuration File
237 241
--------------------------------------
238 242

  
239
First of all we define in one place some configuration variables that
240
will be launched by python and R scripts. These variables are
241
contained in file *configurator.py*. The execution of this python
242
script dumps variables into the *nucleominer_config.json* file that
243
will then be used by both R and python scripts.
243
First, we need to define useful configuration variables that will be
244
passed to python and R scripts. These variables are contained in file
245
*configurator.py*. The execution of this python script dumps variables
246
into the *nucleominer_config.json* file that will then be used by both
247
R and python scripts.
244 248

  
245
To do this, go to the root directory of your project and run the
246
following command:
249
The initialization of this variables is done in the configurator.py
250
file. If you need to adapt variable values (path, default
251
parameters...) you need to edit this file. Then, go to the root
252
directory of your project and run the following command to dump the
253
configuration file:
247 254

  
248 255
   python src/current/configurator.py
249 256

  
......
251 258
Preprocessing Illumina Fastq Reads for Each Sample
252 259
==================================================
253 260

  
254
This preprocessing step consists of 4 main steps embedded in the
255
*wf.py* script. They are described bellow. As a preamble, this script
256
computes *samples*, *samples_mnase* and *strains* that will be used
257
along the 4 steps.
261
Once variables and design have been specified, the script wf.py will
262
automatically run all the analysis. You don't need to do anything. To
263
run the full analysis, run the following command:
264

  
265
   python src/current/wf.py
266

  
267
The details of the steps performed by this script are explained below.
268
This preprocessing consists of 4 steps embedded in the *wf.py* script.
269
They are described bellow.  As a preamble, this script computes
270
*samples*, *samples_mnase* and *strains* that will be used along the 4
271
steps.
258 272

  
259 273
wf.samples = []
260 274

  
......
273 287
Creating Bowtie Index from each Reference Genome
274 288
------------------------------------------------
275 289

  
276
For each strain, we need to create bowtie index. Bowtie index of a
277
strain is a tree view of the genome of this strain. It will be used by
278
bowtie to align reads. This step is performed by the following part of
279
the *wf.py* script:
290
For each strain, the script *wf.py* then creates bowtie index. Bowtie
291
index of a strain is a tree view of the genome of this strain. It will
292
be used by bowtie to align reads. The part of the script performing
293
this is the following:
280 294

  
281 295
     for strain in strains:
282 296
       per_strain_stats[strain] = create_bowtie_index(strain, 
283 297
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
284 298
         config["BOWTIE_BUILD_BIN"])
285 299

  
286
The following table summarizes the file sizes and process durations
287
concerning this step.
300
As an indication, the following table summarizes the file sizes and
301
process durations that we experienced when running this step on a
302
Linux server***.
288 303

  
289 304
+--------+------------------------+------------------------+------------------+
290 305
| strain | fasta genome file size | bowtie index file size | process duration |
......
298 313
Aligning Reads to Reference Genome
299 314
----------------------------------
300 315

  
301
Next, we launch bowtie to align reads to the reference genome. It
302
produces a *.sam* file that we convert into a *.bed* file. Binaries
303
for *bowtie*, *samtools* and *bedtools* are wrapped using python
304
*subprocess* class. This step is performed by the following part of
305
the *wf.py* script:
316
Next, the *wf.py* script launches bowtie to align reads to the
317
reference genome. It produces a *.sam* file that is converted into a
318
*.bed* file. Binaries for *bowtie*, *samtools* and *bedtools* are
319
wrapped using python *subprocess* class. This step is performed by the
320
following part of the script:
306 321

  
307 322
     for sample in samples:
308 323
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
......
315 330
------------------------------------------------
316 331

  
317 332
TemplateFilter uses particular input formats for reads, so it is
318
necessary to convert the *.bed* files. TemplateFilter expect reads as
319
follows: *chr*, *coord*, *strand* and *#read* where:
333
necessary to convert the *.bed* files. TemplateFilter expect reads in
334
the following format: *chr*, *coord*, *strand* and *#read* where:
320 335

  
321 336
* *chr* is the number of the chromosome;
322 337

  
......
330 345

  
331 346
**WARNING** for reverse strands, bowtie returns the position of the
332 347
first nucleotide on the left hand side, whereas TemplateFilter expects
333
the first one on the right hand side.  This step takes this into
334
account by adding the read length (in our case 50) to the reverse
348
the first one on the right hand side.  This is taken into account in
349
NucleoMiner2 by adding the read length (in our case 50) to the reverse
335 350
reads coordinates.
336 351

  
337 352
This step is performed by the following part of the *wf.py* script:
......
341 356
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
342 357
         config["READ_LENGTH"],config["MAPQ_THRES"])
343 358

  
344
The following table summarises the number of reads, the involved file
345
sizes and process durations concerning the two last steps. In our
346
case, alignment process have been multithreaded over 3 cores.
359
The following table summarizes the number of reads, the involved file
360
sizes and process durations that we experienced when running the two
361
last steps. In our case, alignment process were multithreaded over 3
362
cores.
347 363

  
348 364
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
349 365
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
......
378 394
Finally, for each sample we perform TemplateFilter analysis.
379 395

  
380 396
**WARNING** TemplateFilter returns a list of nucleosomes. Each
381
nucleosome is define by its center and its width. An odd width leads
397
nucleosome is defined by its center and its width. An odd width leads
382 398
us to consider non- integer lower and upper bound.
383 399

  
384
**WARNING** TemplateFilter is not designed to deal with replicates. So
400
**WARNING** TemplateFilter was not designed to handle replicates. So
385 401
we recommend to keep a maximum of nucleosomes and filter the aberrant
386 402
ones afterwards using the benefits of having replicates. To do this,
387 403
we set a low correlation threshold parameter (0.5) and a particularly
......
416 432
========================================================
417 433

  
418 434
The second part of the tutorial uses R
419
(http://http://www.r-project.org). It consists of a set of R scripts
420
that will be sourced in an R from a console launched at the root of
421
your project. These scripts are:
435
(http://http://www.r-project.org). NucleoMiner2 contains a set of R
436
scripts that will be sourced in R from a console launched at the root
437
of your project. These scripts are:
422 438

  
423 439
   * headers.R
424 440

  
......
438 454
The Script headers.R
439 455
--------------------
440 456

  
441
The script headers.R is included in each other scripts. It is in
457
The script headers.R is included in all other R scripts. It is in
442 458
charge of:
443 459

  
444 460
   * launching libraries used in the scripts
445 461

  
446 462
   * launching configuration (design, strain, marker...)
447 463

  
448
   * computing and caching CURs (caching means storing the
449
     information in the computer's memory)
464
   * computing and caching Common Uinterrupted Regions (CURs).
465
     Caching means storing the information in the computer's memory.
450 466

  
451 467
Note that you can customize the function “translate”. This function
452 468
allows you to use the alignments between genomes when performing
453
various tasks. You may be using NucleoMiner2 to analyse data of a
454
single strain, or of several strains.
469
various tasks.
455 470

  
456
   * All the data corresponds to the same strain (e.g.
457
     treatment/control, or only few mutations): Then in step 1), the
458
     regions to use are entire chromosomes. Instep 2) simply use the
471
   * You may want to analyze data of a single strain (e.g.
472
     treatment/control, or only few mutations). In this case, the
473
     genome is identical across all samples and you do not need to
474
     define particular CURs (CURs are chromosomes). Simply use the
459 475
     default translate function which is neutral.
460 476

  
461
   * The data come from two or more strains: In this case, edit a
462
     list of regions and customize the translate function which
463
     performs the correspondence between the different genomes. How we
464
     did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to
465
     the Appendice "Generate .c2c Files"), then use it to produce the
466
     list of regions and customise “translate”.
477
   * If you are analyzing data from two or more strains (as
478
     NucleoMiner2 was designed for), then you need to translate
479
     coordinates of one genome into the coordinates of another one.
480
     You must do this by aligning the two genomes, which will produce
481
     a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to
482
     produce the list of regions and customise “translate”.
467 483

  
468
In your R console, run the following command line:
484
In our tutorial, we are in the second case and to perform all these
485
steps run the following command line in your R console:
469 486

  
470 487
   source("src/current/headers.R")
471 488

  
......
474 491
-------------------------
475 492

  
476 493
This script is in charge of extracting Maps for well-positioned and
477
fuzzy nucleosomes. First of all, this script computes intra and inter-
478
strain nucleosome maps for each CUR. This step is executed in parallel
479
on many cores using the BoT library. Next, it collects results and
480
produces well-positioned, fuzzy and UNR maps.
481

  
482
The well-positioned map for BY is collected in the result directory
483
and is called *BY_wp.tab*. It is composed of following columns:
494
sensitive nucleosomes. First of all, this script computes intra and
495
inter-strain matches of nucleosome maps for each CUR. This step can be
496
executed in parallel on many cores using the BoT library. Next, it
497
collects results and produces maps of  well-positioned nucleosomes,
498
sensitive nucleosomes and Unaligned Nucleosomal Regions .
499

  
500
The map of well-positioned nucleosomes for BY is collected in the
501
result directory and is called *BY_wp.tab*. It is composed of
502
following columns:
484 503

  
485 504
   * chr, the number of the chromosome
486 505

  
......
512 531
     nucleosomes.
513 532

  
514 533
   * wp_pval, for a well-positioned nucleosome, it is the p-value
515
     chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*)
534
     chi square test obtained from LLR2 (*1-pchisq(2.LLR2, df=4)*)
516 535

  
517 536
   * dyad_shift, for a well-positioned nucleosome, it is the shift
518 537
     between the two extreme TemplateFilter nucleosome dyad positions.
519 538

  
520
The fuzzy map for BY is collected in the result directory and is
539
The sensitive map for BY is collected in the result directory and is
521 540
called *BY_fuzzy.tab*. It is composed of following columns:
522 541

  
523 542
   * chr, the number of the chromosome
......
545 564
     (*1-pchisq(2.LLR3, df=2)*)
546 565

  
547 566
   * diff, the dyads shift between the positions in the two strains
567
     (in bp)
548 568

  
549 569
The common UNR map for BY and RM strains is collected in the result
550 570
directory and is called *BY_RM_common_unr.tab*. It is composed of the
......
565 585
--------------------------------
566 586

  
567 587
This script is used to translate common well-positioned nucleosome
568
maps from a strain to another strain and stores it into a table.
588
positions from a strain to another strain and stores it into a table.
569 589

  
570 590
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM
571
well-positioned nucleosome translated into the BY genome coordinates.
591
well-positioned nucleosomes translated into the BY genome coordinates.
572 592
It is composed of following columns:
573 593

  
574 594
   * strain_ref, the reference genome (in which positioned are
......
595 615
The Script split_samples.R
596 616
--------------------------
597 617

  
598
For memory space usage reasons, we split and compress TemplateFilter
618
To optimize memory space usage, we split and compress TemplateFilter
599 619
input files according to their corresponding  chromosome. for example,
600 620
*sample_1_TF.tab* will be split into :
601 621

  
......
701 721

  
702 722
sample_id are given in file samples.csv
703 723

  
704
If you don't know which column to use, we recommend using wpunr.
724
If you don't know which column to use for normalization, we recommend
725
using wpunr.
705 726

  
706
If you want the very detailed factors produced by DESeq, here are the
707
information:
727
Here are the details of the factors produced:
708 728

  
709 729
   * unr: factor computed from data of UNR regions. These regions
710 730
     are defined for every pairs of aligned genomes (e.g. BY_RM)
......
791 811
     strain effect and the snep effect. These are the coefficients of
792 812
     the fitted generalized linear model.
793 813

  
794
   * pvalsGLM, the pvalue resulting of the comparison of the GLM
795
     model considering or not the interaction term marker:strain. This
796
     is the statsitcial significance of the interaction term and
797
     therefore the statistical significance of the SNEP.
814
   * pvalsGLM, the pvalue resulting from the comparison of the GLM
815
     model considering the interaction term *marker:strain* to the GLM
816
     model that does not consider it. This is the statsitcial
817
     significance of the interaction term and therefore the
818
     statistical significance of the SNEP.
798 819

  
799 820
   * snep_index, a boolean set to TRUE if the pvalueGLM value is
800 821
     under the threshold computed with FDR function with a rate set to
801 822
     0.0001.
802 823

  
803
To execute this script, run the following command
804

  
805 824
To execute this script, run the following command in your R console:
806 825

  
807 826
   source("src/current/launch_deseq.R")
......
826 845
APPENDICE: Generate .c2c Files
827 846
==============================
828 847

  
829
The *.c2c* files is a simple table that describes how the genome
830
sequence can be aligned. We generate it using some NucleoMiner 1.0
831
scripts.
848
The *.c2c* files is a simple table that describes how two genome
849
sequences are aligned. This file can be generated by using scripts
850
that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics
851
2010) and which we provide in this release of NucleoMiner2.
832 852

  
833
To use NucleoMiner 1.0 scripts on your UNIX/LINUX computer you need
834
first to install MUMmer which is a system for rapidly aligning entire
835
genomes, whether in complete or draft form.
853
To use these scripts on your UNIX/LINUX computer you need first to
854
install MUMmer which is designed to rapidly align entire genomes,
855
whether in complete or draft form.
836 856

  
837 857

  
838
Installing the MUMmer library
839
-----------------------------
858
Installing MUMmer
859
-----------------
840 860

  
841 861
Get the last version of MUMmer archive on your computer
842
(MUMmer3.23.tar.gz distributed in the directory deps of your working
862
(MUMmer3.23.tar.gz is provided in the directory deps of your working
843 863
directory). Copy it in a dedicated directory. Install it locally into
844 864
the src folder of you working directory by typing (working directory):
845 865

  
846
tar -xvzf gdl-1.0.tar.gz
847

  
848
This creates a directory called gdl-1.0. You now need to go into this
849
directory and compile the library, by typing:
866
tar -xvzf MUMmer3.23.tar.gz
850 867

  
851 868
   cd src
852 869
   tar xfvz ../deps/MUMmer3.23.tar.gz
......
858 875
Installing NucleoMiner 1.0 scripts
859 876
----------------------------------
860 877

  
861
Get the nucleominer-1.0.tar.gz archive on your computer (distributed
862
in the directory deps of your working directory). Install it locally
863
into the src folder of you working directory by typing (working
864
directory):
878
Get the nucleominer-1.0.tar.gz archive on your computer (this archive
879
is provided in the directory deps of your working directory). Install
880
it locally into the src folder of you working directory by typing
881
(working directory):
865 882

  
866 883
   cd src
867 884
   tar xfvz ../deps/nucleominer-1.0.tar.gz
868 885
   cd ..
869 886

  
870
This creates a directory called that contains  NucleoMiner 1.0 scripts
887
This creates a directory that contains  NucleoMiner 1.0 scripts
871 888
(src/nucleominer-1.0/scripts).
872 889

  
873 890

  

Formats disponibles : Unified diff