Révision 3961deb6 README
b/README | ||
---|---|---|
101 | 101 |
------------ |
102 | 102 |
|
103 | 103 |
The first installation step is to retrieve the source code of |
104 |
MyLabStocks. You can do this by typing the following command in a
|
|
104 |
NucleoMiner2. You can do this by typing the following command in a
|
|
105 | 105 |
terminal. |
106 | 106 |
|
107 | 107 |
git clone http://forge.cbp.ens-lyon.fr/git/nucleominer |
... | ... | |
135 | 135 |
|
136 | 136 |
* DESeq |
137 | 137 |
|
138 |
These packages can be installed by typing the following command in an |
|
139 |
R console: |
|
140 |
|
|
141 |
install.packages(c("fork", "rjson", "seqinr", "plotrix")) |
|
142 |
source("http://bioconductor.org/biocLite.R") |
|
143 |
biocLite("DESeq") |
|
144 |
|
|
145 |
Finally,by typing the git command above, you downloaded specific R |
|
146 |
packages provided with NucleoMiner2 that you now need to install: |
|
147 |
|
|
138 | 148 |
* cachecache https://forge.cbp.ens- |
139 | 149 |
lyon.fr/redmine/projects/cachecache |
140 | 150 |
|
... | ... | |
143 | 153 |
* nucleominer https://forge.cbp.ens- |
144 | 154 |
lyon.fr/redmine/projects/nucleominer |
145 | 155 |
|
146 |
The first packages could be installed by typing the following command |
|
147 |
in an R console: |
|
148 |
|
|
149 |
install.packages(c("fork", "rjson", "seqinr", "plotrix")) |
|
150 |
source("http://bioconductor.org/biocLite.R") |
|
151 |
biocLite("DESeq") |
|
152 |
|
|
153 |
The last packages are available in the git repository they could be |
|
154 |
install by typing the following command in your terminal: |
|
156 |
To do so, type the following command in your terminal: |
|
155 | 157 |
|
156 | 158 |
cd nucleominer |
157 | 159 |
R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\ |
... | ... | |
161 | 163 |
Tutorial |
162 | 164 |
******** |
163 | 165 |
|
164 |
This tutorial describes steps allowing performing quantitative
|
|
166 |
This tutorial describes steps allowing to perform quantitative
|
|
165 | 167 |
analysis of epigenetic marks on individual nucleosomes. We assume that |
166 | 168 |
files are organised according to a given hierarchy and that all |
167 | 169 |
command lines are launched from the project’s root directory. |
168 | 170 |
|
169 | 171 |
This tutorial is divided into two main parts. The first part covers |
170 | 172 |
the python script *wf.py* that aligns and converts short sequence |
171 |
reads. The second part covers the R scripts that extracts information |
|
172 |
(nucleosome position and indicators) from the dataset. |
|
173 |
reads. The second part covers the R scripts that extracts nucleosome- |
|
174 |
level information (nucleosome position and indicators) from the |
|
175 |
dataset. |
|
173 | 176 |
|
174 | 177 |
|
175 | 178 |
Experimental Dataset, Working Directory and Configuration File |
... | ... | |
179 | 182 |
Working Directory Organisation |
180 | 183 |
------------------------------ |
181 | 184 |
|
182 |
After having install NucleoMiner2 environment (Previous section), go |
|
185 |
After having installed NucleoMiner2 environment (Previous section), go
|
|
183 | 186 |
to the root working directory of the tutorial by typing the following |
184 | 187 |
command in a terminal: |
185 | 188 |
|
... | ... | |
196 | 199 |
$$$ TODO explain how organise Experimental Dataset into the *data* |
197 | 200 |
directory of the working directory. |
198 | 201 |
|
199 |
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each |
|
200 |
strain we performed Mnase-Seq and ChIP-Seq using an antibody |
|
201 |
recognizing the H3K14ac epigenetic mark. |
|
202 |
In this tutorial, we want to compare nucleosomes of 2 yeast strains: |
|
203 |
BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq |
|
204 |
using an antibody recognizing the H3K14ac epigenetic mark. Illumina |
|
205 |
sequencing was done in single-read of 50 bp long. |
|
202 | 206 |
|
203 | 207 |
The dataset is composed of 55 files organised as follows: |
204 | 208 |
|
... | ... | |
236 | 240 |
Python and R Common Configuration File |
237 | 241 |
-------------------------------------- |
238 | 242 |
|
239 |
First of all we define in one place some configuration variables that
|
|
240 |
will be launched by python and R scripts. These variables are
|
|
241 |
contained in file *configurator.py*. The execution of this python
|
|
242 |
script dumps variables into the *nucleominer_config.json* file that
|
|
243 |
will then be used by both R and python scripts.
|
|
243 |
First, we need to define useful configuration variables that will be
|
|
244 |
passed to python and R scripts. These variables are contained in file
|
|
245 |
*configurator.py*. The execution of this python script dumps variables
|
|
246 |
into the *nucleominer_config.json* file that will then be used by both
|
|
247 |
R and python scripts. |
|
244 | 248 |
|
245 |
To do this, go to the root directory of your project and run the |
|
246 |
following command: |
|
249 |
The initialization of this variables is done in the configurator.py |
|
250 |
file. If you need to adapt variable values (path, default |
|
251 |
parameters...) you need to edit this file. Then, go to the root |
|
252 |
directory of your project and run the following command to dump the |
|
253 |
configuration file: |
|
247 | 254 |
|
248 | 255 |
python src/current/configurator.py |
249 | 256 |
|
... | ... | |
251 | 258 |
Preprocessing Illumina Fastq Reads for Each Sample |
252 | 259 |
================================================== |
253 | 260 |
|
254 |
This preprocessing step consists of 4 main steps embedded in the |
|
255 |
*wf.py* script. They are described bellow. As a preamble, this script |
|
256 |
computes *samples*, *samples_mnase* and *strains* that will be used |
|
257 |
along the 4 steps. |
|
261 |
Once variables and design have been specified, the script wf.py will |
|
262 |
automatically run all the analysis. You don't need to do anything. To |
|
263 |
run the full analysis, run the following command: |
|
264 |
|
|
265 |
python src/current/wf.py |
|
266 |
|
|
267 |
The details of the steps performed by this script are explained below. |
|
268 |
This preprocessing consists of 4 steps embedded in the *wf.py* script. |
|
269 |
They are described bellow. As a preamble, this script computes |
|
270 |
*samples*, *samples_mnase* and *strains* that will be used along the 4 |
|
271 |
steps. |
|
258 | 272 |
|
259 | 273 |
wf.samples = [] |
260 | 274 |
|
... | ... | |
273 | 287 |
Creating Bowtie Index from each Reference Genome |
274 | 288 |
------------------------------------------------ |
275 | 289 |
|
276 |
For each strain, we need to create bowtie index. Bowtie index of a
|
|
277 |
strain is a tree view of the genome of this strain. It will be used by
|
|
278 |
bowtie to align reads. This step is performed by the following part of
|
|
279 |
the *wf.py* script:
|
|
290 |
For each strain, the script *wf.py* then creates bowtie index. Bowtie
|
|
291 |
index of a strain is a tree view of the genome of this strain. It will
|
|
292 |
be used by bowtie to align reads. The part of the script performing
|
|
293 |
this is the following:
|
|
280 | 294 |
|
281 | 295 |
for strain in strains: |
282 | 296 |
per_strain_stats[strain] = create_bowtie_index(strain, |
283 | 297 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
284 | 298 |
config["BOWTIE_BUILD_BIN"]) |
285 | 299 |
|
286 |
The following table summarizes the file sizes and process durations |
|
287 |
concerning this step. |
|
300 |
As an indication, the following table summarizes the file sizes and |
|
301 |
process durations that we experienced when running this step on a |
|
302 |
Linux server***. |
|
288 | 303 |
|
289 | 304 |
+--------+------------------------+------------------------+------------------+ |
290 | 305 |
| strain | fasta genome file size | bowtie index file size | process duration | |
... | ... | |
298 | 313 |
Aligning Reads to Reference Genome |
299 | 314 |
---------------------------------- |
300 | 315 |
|
301 |
Next, we launch bowtie to align reads to the reference genome. It
|
|
302 |
produces a *.sam* file that we convert into a *.bed* file. Binaries
|
|
303 |
for *bowtie*, *samtools* and *bedtools* are wrapped using python
|
|
304 |
*subprocess* class. This step is performed by the following part of
|
|
305 |
the *wf.py* script:
|
|
316 |
Next, the *wf.py* script launches bowtie to align reads to the
|
|
317 |
reference genome. It produces a *.sam* file that is converted into a
|
|
318 |
*.bed* file. Binaries for *bowtie*, *samtools* and *bedtools* are
|
|
319 |
wrapped using python *subprocess* class. This step is performed by the
|
|
320 |
following part of the script:
|
|
306 | 321 |
|
307 | 322 |
for sample in samples: |
308 | 323 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
... | ... | |
315 | 330 |
------------------------------------------------ |
316 | 331 |
|
317 | 332 |
TemplateFilter uses particular input formats for reads, so it is |
318 |
necessary to convert the *.bed* files. TemplateFilter expect reads as
|
|
319 |
follows: *chr*, *coord*, *strand* and *#read* where:
|
|
333 |
necessary to convert the *.bed* files. TemplateFilter expect reads in
|
|
334 |
the following format: *chr*, *coord*, *strand* and *#read* where:
|
|
320 | 335 |
|
321 | 336 |
* *chr* is the number of the chromosome; |
322 | 337 |
|
... | ... | |
330 | 345 |
|
331 | 346 |
**WARNING** for reverse strands, bowtie returns the position of the |
332 | 347 |
first nucleotide on the left hand side, whereas TemplateFilter expects |
333 |
the first one on the right hand side. This step takes this into
|
|
334 |
account by adding the read length (in our case 50) to the reverse
|
|
348 |
the first one on the right hand side. This is taken into account in
|
|
349 |
NucleoMiner2 by adding the read length (in our case 50) to the reverse
|
|
335 | 350 |
reads coordinates. |
336 | 351 |
|
337 | 352 |
This step is performed by the following part of the *wf.py* script: |
... | ... | |
341 | 356 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
342 | 357 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
343 | 358 |
|
344 |
The following table summarises the number of reads, the involved file |
|
345 |
sizes and process durations concerning the two last steps. In our |
|
346 |
case, alignment process have been multithreaded over 3 cores. |
|
359 |
The following table summarizes the number of reads, the involved file |
|
360 |
sizes and process durations that we experienced when running the two |
|
361 |
last steps. In our case, alignment process were multithreaded over 3 |
|
362 |
cores. |
|
347 | 363 |
|
348 | 364 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
349 | 365 |
| id | Illumina reads | aligned and filtred reads | ratio | *.bed* file size | TF input file size | process duration | |
... | ... | |
378 | 394 |
Finally, for each sample we perform TemplateFilter analysis. |
379 | 395 |
|
380 | 396 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each |
381 |
nucleosome is define by its center and its width. An odd width leads |
|
397 |
nucleosome is defined by its center and its width. An odd width leads
|
|
382 | 398 |
us to consider non- integer lower and upper bound. |
383 | 399 |
|
384 |
**WARNING** TemplateFilter is not designed to deal with replicates. So
|
|
400 |
**WARNING** TemplateFilter was not designed to handle replicates. So
|
|
385 | 401 |
we recommend to keep a maximum of nucleosomes and filter the aberrant |
386 | 402 |
ones afterwards using the benefits of having replicates. To do this, |
387 | 403 |
we set a low correlation threshold parameter (0.5) and a particularly |
... | ... | |
416 | 432 |
======================================================== |
417 | 433 |
|
418 | 434 |
The second part of the tutorial uses R |
419 |
(http://http://www.r-project.org). It consists of a set of R scripts
|
|
420 |
that will be sourced in an R from a console launched at the root of
|
|
421 |
your project. These scripts are: |
|
435 |
(http://http://www.r-project.org). NucleoMiner2 contains a set of R
|
|
436 |
scripts that will be sourced in R from a console launched at the root
|
|
437 |
of your project. These scripts are:
|
|
422 | 438 |
|
423 | 439 |
* headers.R |
424 | 440 |
|
... | ... | |
438 | 454 |
The Script headers.R |
439 | 455 |
-------------------- |
440 | 456 |
|
441 |
The script headers.R is included in each other scripts. It is in
|
|
457 |
The script headers.R is included in all other R scripts. It is in
|
|
442 | 458 |
charge of: |
443 | 459 |
|
444 | 460 |
* launching libraries used in the scripts |
445 | 461 |
|
446 | 462 |
* launching configuration (design, strain, marker...) |
447 | 463 |
|
448 |
* computing and caching CURs (caching means storing the
|
|
449 |
information in the computer's memory)
|
|
464 |
* computing and caching Common Uinterrupted Regions (CURs).
|
|
465 |
Caching means storing the information in the computer's memory.
|
|
450 | 466 |
|
451 | 467 |
Note that you can customize the function “translate”. This function |
452 | 468 |
allows you to use the alignments between genomes when performing |
453 |
various tasks. You may be using NucleoMiner2 to analyse data of a |
|
454 |
single strain, or of several strains. |
|
469 |
various tasks. |
|
455 | 470 |
|
456 |
* All the data corresponds to the same strain (e.g. |
|
457 |
treatment/control, or only few mutations): Then in step 1), the |
|
458 |
regions to use are entire chromosomes. Instep 2) simply use the |
|
471 |
* You may want to analyze data of a single strain (e.g. |
|
472 |
treatment/control, or only few mutations). In this case, the |
|
473 |
genome is identical across all samples and you do not need to |
|
474 |
define particular CURs (CURs are chromosomes). Simply use the |
|
459 | 475 |
default translate function which is neutral. |
460 | 476 |
|
461 |
* The data come from two or more strains: In this case, edit a
|
|
462 |
list of regions and customize the translate function which
|
|
463 |
performs the correspondence between the different genomes. How we
|
|
464 |
did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to
|
|
465 |
the Appendice "Generate .c2c Files"), then use it to produce the
|
|
466 |
list of regions and customise “translate”. |
|
477 |
* If you are analyzing data from two or more strains (as
|
|
478 |
NucleoMiner2 was designed for), then you need to translate
|
|
479 |
coordinates of one genome into the coordinates of another one.
|
|
480 |
You must do this by aligning the two genomes, which will produce
|
|
481 |
a .c2c file (see Appendice "Generate .c2c Files"). thenuse it to
|
|
482 |
produce the list of regions and customise “translate”.
|
|
467 | 483 |
|
468 |
In your R console, run the following command line: |
|
484 |
In our tutorial, we are in the second case and to perform all these |
|
485 |
steps run the following command line in your R console: |
|
469 | 486 |
|
470 | 487 |
source("src/current/headers.R") |
471 | 488 |
|
... | ... | |
474 | 491 |
------------------------- |
475 | 492 |
|
476 | 493 |
This script is in charge of extracting Maps for well-positioned and |
477 |
fuzzy nucleosomes. First of all, this script computes intra and inter- |
|
478 |
strain nucleosome maps for each CUR. This step is executed in parallel |
|
479 |
on many cores using the BoT library. Next, it collects results and |
|
480 |
produces well-positioned, fuzzy and UNR maps. |
|
481 |
|
|
482 |
The well-positioned map for BY is collected in the result directory |
|
483 |
and is called *BY_wp.tab*. It is composed of following columns: |
|
494 |
sensitive nucleosomes. First of all, this script computes intra and |
|
495 |
inter-strain matches of nucleosome maps for each CUR. This step can be |
|
496 |
executed in parallel on many cores using the BoT library. Next, it |
|
497 |
collects results and produces maps of well-positioned nucleosomes, |
|
498 |
sensitive nucleosomes and Unaligned Nucleosomal Regions . |
|
499 |
|
|
500 |
The map of well-positioned nucleosomes for BY is collected in the |
|
501 |
result directory and is called *BY_wp.tab*. It is composed of |
|
502 |
following columns: |
|
484 | 503 |
|
485 | 504 |
* chr, the number of the chromosome |
486 | 505 |
|
... | ... | |
512 | 531 |
nucleosomes. |
513 | 532 |
|
514 | 533 |
* wp_pval, for a well-positioned nucleosome, it is the p-value |
515 |
chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*)
|
|
534 |
chi square test obtained from LLR2 (*1-pchisq(2.LLR2, df=4)*)
|
|
516 | 535 |
|
517 | 536 |
* dyad_shift, for a well-positioned nucleosome, it is the shift |
518 | 537 |
between the two extreme TemplateFilter nucleosome dyad positions. |
519 | 538 |
|
520 |
The fuzzy map for BY is collected in the result directory and is
|
|
539 |
The sensitive map for BY is collected in the result directory and is
|
|
521 | 540 |
called *BY_fuzzy.tab*. It is composed of following columns: |
522 | 541 |
|
523 | 542 |
* chr, the number of the chromosome |
... | ... | |
545 | 564 |
(*1-pchisq(2.LLR3, df=2)*) |
546 | 565 |
|
547 | 566 |
* diff, the dyads shift between the positions in the two strains |
567 |
(in bp) |
|
548 | 568 |
|
549 | 569 |
The common UNR map for BY and RM strains is collected in the result |
550 | 570 |
directory and is called *BY_RM_common_unr.tab*. It is composed of the |
... | ... | |
565 | 585 |
-------------------------------- |
566 | 586 |
|
567 | 587 |
This script is used to translate common well-positioned nucleosome |
568 |
maps from a strain to another strain and stores it into a table.
|
|
588 |
positions from a strain to another strain and stores it into a table.
|
|
569 | 589 |
|
570 | 590 |
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM |
571 |
well-positioned nucleosome translated into the BY genome coordinates. |
|
591 |
well-positioned nucleosomes translated into the BY genome coordinates.
|
|
572 | 592 |
It is composed of following columns: |
573 | 593 |
|
574 | 594 |
* strain_ref, the reference genome (in which positioned are |
... | ... | |
595 | 615 |
The Script split_samples.R |
596 | 616 |
-------------------------- |
597 | 617 |
|
598 |
For memory space usage reasons, we split and compress TemplateFilter
|
|
618 |
To optimize memory space usage, we split and compress TemplateFilter
|
|
599 | 619 |
input files according to their corresponding chromosome. for example, |
600 | 620 |
*sample_1_TF.tab* will be split into : |
601 | 621 |
|
... | ... | |
701 | 721 |
|
702 | 722 |
sample_id are given in file samples.csv |
703 | 723 |
|
704 |
If you don't know which column to use, we recommend using wpunr. |
|
724 |
If you don't know which column to use for normalization, we recommend |
|
725 |
using wpunr. |
|
705 | 726 |
|
706 |
If you want the very detailed factors produced by DESeq, here are the |
|
707 |
information: |
|
727 |
Here are the details of the factors produced: |
|
708 | 728 |
|
709 | 729 |
* unr: factor computed from data of UNR regions. These regions |
710 | 730 |
are defined for every pairs of aligned genomes (e.g. BY_RM) |
... | ... | |
791 | 811 |
strain effect and the snep effect. These are the coefficients of |
792 | 812 |
the fitted generalized linear model. |
793 | 813 |
|
794 |
* pvalsGLM, the pvalue resulting of the comparison of the GLM |
|
795 |
model considering or not the interaction term marker:strain. This |
|
796 |
is the statsitcial significance of the interaction term and |
|
797 |
therefore the statistical significance of the SNEP. |
|
814 |
* pvalsGLM, the pvalue resulting from the comparison of the GLM |
|
815 |
model considering the interaction term *marker:strain* to the GLM |
|
816 |
model that does not consider it. This is the statsitcial |
|
817 |
significance of the interaction term and therefore the |
|
818 |
statistical significance of the SNEP. |
|
798 | 819 |
|
799 | 820 |
* snep_index, a boolean set to TRUE if the pvalueGLM value is |
800 | 821 |
under the threshold computed with FDR function with a rate set to |
801 | 822 |
0.0001. |
802 | 823 |
|
803 |
To execute this script, run the following command |
|
804 |
|
|
805 | 824 |
To execute this script, run the following command in your R console: |
806 | 825 |
|
807 | 826 |
source("src/current/launch_deseq.R") |
... | ... | |
826 | 845 |
APPENDICE: Generate .c2c Files |
827 | 846 |
============================== |
828 | 847 |
|
829 |
The *.c2c* files is a simple table that describes how the genome |
|
830 |
sequence can be aligned. We generate it using some NucleoMiner 1.0 |
|
831 |
scripts. |
|
848 |
The *.c2c* files is a simple table that describes how two genome |
|
849 |
sequences are aligned. This file can be generated by using scripts |
|
850 |
that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics |
|
851 |
2010) and which we provide in this release of NucleoMiner2. |
|
832 | 852 |
|
833 |
To use NucleoMiner 1.0 scripts on your UNIX/LINUX computer you need
|
|
834 |
first to install MUMmer which is a system for rapidly aligning entire
|
|
835 |
genomes, whether in complete or draft form.
|
|
853 |
To use these scripts on your UNIX/LINUX computer you need first to
|
|
854 |
install MUMmer which is designed to rapidly align entire genomes,
|
|
855 |
whether in complete or draft form. |
|
836 | 856 |
|
837 | 857 |
|
838 |
Installing the MUMmer library
|
|
839 |
-----------------------------
|
|
858 |
Installing MUMmer
|
|
859 |
----------------- |
|
840 | 860 |
|
841 | 861 |
Get the last version of MUMmer archive on your computer |
842 |
(MUMmer3.23.tar.gz distributed in the directory deps of your working
|
|
862 |
(MUMmer3.23.tar.gz is provided in the directory deps of your working
|
|
843 | 863 |
directory). Copy it in a dedicated directory. Install it locally into |
844 | 864 |
the src folder of you working directory by typing (working directory): |
845 | 865 |
|
846 |
tar -xvzf gdl-1.0.tar.gz |
|
847 |
|
|
848 |
This creates a directory called gdl-1.0. You now need to go into this |
|
849 |
directory and compile the library, by typing: |
|
866 |
tar -xvzf MUMmer3.23.tar.gz |
|
850 | 867 |
|
851 | 868 |
cd src |
852 | 869 |
tar xfvz ../deps/MUMmer3.23.tar.gz |
... | ... | |
858 | 875 |
Installing NucleoMiner 1.0 scripts |
859 | 876 |
---------------------------------- |
860 | 877 |
|
861 |
Get the nucleominer-1.0.tar.gz archive on your computer (distributed
|
|
862 |
in the directory deps of your working directory). Install it locally
|
|
863 |
into the src folder of you working directory by typing (working
|
|
864 |
directory): |
|
878 |
Get the nucleominer-1.0.tar.gz archive on your computer (this archive
|
|
879 |
is provided in the directory deps of your working directory). Install
|
|
880 |
it locally into the src folder of you working directory by typing
|
|
881 |
(working directory):
|
|
865 | 882 |
|
866 | 883 |
cd src |
867 | 884 |
tar xfvz ../deps/nucleominer-1.0.tar.gz |
868 | 885 |
cd .. |
869 | 886 |
|
870 |
This creates a directory called that contains NucleoMiner 1.0 scripts
|
|
887 |
This creates a directory that contains NucleoMiner 1.0 scripts |
|
871 | 888 |
(src/nucleominer-1.0/scripts). |
872 | 889 |
|
873 | 890 |
|
Formats disponibles : Unified diff