Statistiques
| Branche: | Révision :

root / README @ master

Historique | Voir | Annoter | Télécharger (30,72 ko)

1 935a568c Florent Chuffart
2 3c88abd0 Florent Chuffart
Welcome to *NucleoMiner2*
3 3c88abd0 Florent Chuffart
*************************
4 3c88abd0 Florent Chuffart
5 3c88abd0 Florent Chuffart
* Readme / Documentation for *NucleoMiner2*
6 3c88abd0 Florent Chuffart
7 3c88abd0 Florent Chuffart
  * License
8 3c88abd0 Florent Chuffart
9 3c88abd0 Florent Chuffart
  * Installation Instructions
10 3c88abd0 Florent Chuffart
11 3c88abd0 Florent Chuffart
* Tutorial
12 3c88abd0 Florent Chuffart
13 3c88abd0 Florent Chuffart
  * Experimental Dataset, Working Directory and Configuration File
14 3c88abd0 Florent Chuffart
15 3c88abd0 Florent Chuffart
  * Preprocessing Illumina Fastq Reads for Each Sample
16 3c88abd0 Florent Chuffart
17 3c88abd0 Florent Chuffart
  * Inferring Nucleosome Position and Extracting Read Counts
18 3c88abd0 Florent Chuffart
19 3c88abd0 Florent Chuffart
  * Results: Number of SNEPs
20 3c88abd0 Florent Chuffart
21 3c88abd0 Florent Chuffart
  * APPENDICE: Generate .c2c Files
22 3c88abd0 Florent Chuffart
23 3c88abd0 Florent Chuffart
* References
24 3c88abd0 Florent Chuffart
25 3c88abd0 Florent Chuffart
  * Python Reference
26 3c88abd0 Florent Chuffart
27 3c88abd0 Florent Chuffart
  * R Reference
28 3c88abd0 Florent Chuffart
29 935a568c Florent Chuffart
Readme / Documentation for *NucleoMiner2*
30 935a568c Florent Chuffart
*****************************************
31 935a568c Florent Chuffart
32 935a568c Florent Chuffart
*NucleoMiner2* offers Python API and R package allowing to perform
33 e5603c3f Florent Chuffart
quantitative analysis of epigenetic marks on individual nucleosomes.
34 e5603c3f Florent Chuffart
It was developed to detect natural Single-Nucleosome Epi-Polymorphisms
35 e5603c3f Florent Chuffart
(SNEP) from MNase-seq and ChIP-seq data.
36 935a568c Florent Chuffart
37 935a568c Florent Chuffart
38 935a568c Florent Chuffart
License
39 935a568c Florent Chuffart
=======
40 935a568c Florent Chuffart
41 935a568c Florent Chuffart
Copyright CNRS 2012-2013
42 935a568c Florent Chuffart
43 935a568c Florent Chuffart
* Florent CHUFFART
44 935a568c Florent Chuffart
45 935a568c Florent Chuffart
* Jean-Baptiste VEYRIERAS
46 935a568c Florent Chuffart
47 935a568c Florent Chuffart
* Gael YVERT
48 935a568c Florent Chuffart
49 935a568c Florent Chuffart
This software is a computer program which purpose is to perform
50 935a568c Florent Chuffart
quanti- tative analysis of epigenetic marks at single nucleosome
51 935a568c Florent Chuffart
resolution.
52 935a568c Florent Chuffart
53 935a568c Florent Chuffart
This software is governed by the CeCILL license under French law and
54 935a568c Florent Chuffart
abiding by the rules of distribution of free software.  You can  use,
55 935a568c Florent Chuffart
modify and/ or redistribute the software under the terms of the CeCILL
56 935a568c Florent Chuffart
license as circulated by CEA, CNRS and INRIA at the following URL
57 935a568c Florent Chuffart
"http://www.cecill.info".
58 935a568c Florent Chuffart
59 935a568c Florent Chuffart
As a counterpart to the access to the source code and  rights to copy,
60 935a568c Florent Chuffart
modify and redistribute granted by the license, users are provided
61 935a568c Florent Chuffart
only with a limited warranty  and the software's author,  the holder
62 935a568c Florent Chuffart
of the economic rights,  and the successive licensors  have only
63 935a568c Florent Chuffart
limited liability.
64 935a568c Florent Chuffart
65 e5603c3f Florent Chuffart
This software is provided with absolutely NO WARRANTY. The authors can
66 e5603c3f Florent Chuffart
not be held responsible, even partially, for any damage, loss,
67 e5603c3f Florent Chuffart
financial loss or any other undesired facts resulting from the use of
68 e5603c3f Florent Chuffart
the software. In this respect, the user's attention is drawn to the
69 e5603c3f Florent Chuffart
risks associated with loading,  using,  modifying and/or developing or
70 e5603c3f Florent Chuffart
reproducing the software by the user in light of its specific status
71 e5603c3f Florent Chuffart
of free software, that may mean  that it is complicated to manipulate,
72 e5603c3f Florent Chuffart
and  that  also therefore means  that it is reserved for developers
73 e5603c3f Florent Chuffart
and  experienced professionals having in-depth computer knowledge.
74 e5603c3f Florent Chuffart
Users are therefore encouraged to load and test the software's
75 e5603c3f Florent Chuffart
suitability as regards their requirements in conditions enabling the
76 e5603c3f Florent Chuffart
security of their systems and/or data to be ensured and,  more
77 e5603c3f Florent Chuffart
generally, to use and operate it in the same conditions as regards
78 e5603c3f Florent Chuffart
security.
79 935a568c Florent Chuffart
80 935a568c Florent Chuffart
The fact that you are presently reading this means that you have had
81 935a568c Florent Chuffart
knowledge of the CeCILL license and that you accept its terms.
82 935a568c Florent Chuffart
83 935a568c Florent Chuffart
84 935a568c Florent Chuffart
Installation Instructions
85 935a568c Florent Chuffart
=========================
86 935a568c Florent Chuffart
87 935a568c Florent Chuffart
88 935a568c Florent Chuffart
Links
89 935a568c Florent Chuffart
-----
90 935a568c Florent Chuffart
91 e5603c3f Florent Chuffart
*NucleoMiner2* home page and documentation are available here:
92 935a568c Florent Chuffart
93 e5603c3f Florent Chuffart
   * https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer
94 e5603c3f Florent Chuffart
95 e5603c3f Florent Chuffart
The Yvert lab web page is accessible here:
96 e5603c3f Florent Chuffart
97 e5603c3f Florent Chuffart
   * http://www.ens-lyon.fr/LBMC/gisv/
98 935a568c Florent Chuffart
99 935a568c Florent Chuffart
100 935a568c Florent Chuffart
Installation
101 935a568c Florent Chuffart
------------
102 935a568c Florent Chuffart
103 5badc2fd Florent Chuffart
The first installation step is to retrieve the source code of
104 3961deb6 Florent Chuffart
NucleoMiner2. You can do this by typing the following command in a
105 5badc2fd Florent Chuffart
terminal.
106 5badc2fd Florent Chuffart
107 5badc2fd Florent Chuffart
   git clone http://forge.cbp.ens-lyon.fr/git/nucleominer
108 5badc2fd Florent Chuffart
109 935a568c Florent Chuffart
110 e5603c3f Florent Chuffart
Prerequisites
111 e5603c3f Florent Chuffart
~~~~~~~~~~~~~
112 e5603c3f Florent Chuffart
113 e5603c3f Florent Chuffart
To work properly, NucleoMiner2 needs that the following free software
114 e5603c3f Florent Chuffart
are installed and made available on your system:
115 e5603c3f Florent Chuffart
116 e5603c3f Florent Chuffart
   * Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2
117 935a568c Florent Chuffart
118 e5603c3f Florent Chuffart
   * SAMtools http://samtools.sourceforge.net
119 935a568c Florent Chuffart
120 e5603c3f Florent Chuffart
   * bedtools http://code.google.com/p/bedtools/
121 935a568c Florent Chuffart
122 e5603c3f Florent Chuffart
   * TemplateFilter
123 e5603c3f Florent Chuffart
     http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering
124 935a568c Florent Chuffart
125 e5603c3f Florent Chuffart
It also requires the following R packages to be installed on your
126 e5603c3f Florent Chuffart
system:
127 dadb6a4d Florent Chuffart
128 dadb6a4d Florent Chuffart
   * fork
129 935a568c Florent Chuffart
130 935a568c Florent Chuffart
   * rjson
131 935a568c Florent Chuffart
132 935a568c Florent Chuffart
   * seqinr
133 935a568c Florent Chuffart
134 5badc2fd Florent Chuffart
   * plotrix
135 5badc2fd Florent Chuffart
136 5badc2fd Florent Chuffart
   * DESeq
137 5badc2fd Florent Chuffart
138 3961deb6 Florent Chuffart
These packages can be installed by typing the following command in an
139 3961deb6 Florent Chuffart
R console:
140 3961deb6 Florent Chuffart
141 3961deb6 Florent Chuffart
   install.packages(c("fork", "rjson", "seqinr", "plotrix"))
142 3961deb6 Florent Chuffart
   source("http://bioconductor.org/biocLite.R")
143 3961deb6 Florent Chuffart
   biocLite("DESeq")
144 3961deb6 Florent Chuffart
145 3961deb6 Florent Chuffart
Finally,by typing the git command above, you downloaded specific R
146 3961deb6 Florent Chuffart
packages provided with NucleoMiner2 that you now need to install:
147 3961deb6 Florent Chuffart
148 e5603c3f Florent Chuffart
   * cachecache https://forge.cbp.ens-
149 e5603c3f Florent Chuffart
     lyon.fr/redmine/projects/cachecache
150 935a568c Florent Chuffart
151 e5603c3f Florent Chuffart
   * bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot
152 935a568c Florent Chuffart
153 e5603c3f Florent Chuffart
   * nucleominer https://forge.cbp.ens-
154 e5603c3f Florent Chuffart
     lyon.fr/redmine/projects/nucleominer
155 5badc2fd Florent Chuffart
156 3961deb6 Florent Chuffart
To do so, type the following command in your terminal:
157 5badc2fd Florent Chuffart
158 5badc2fd Florent Chuffart
   cd nucleominer
159 5badc2fd Florent Chuffart
   R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
160 5badc2fd Florent Chuffart
       doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\
161 5badc2fd Florent Chuffart
       build/nucleominer_2.3.46.tar.gz
162 3c88abd0 Florent Chuffart
163 3c88abd0 Florent Chuffart
Tutorial
164 3c88abd0 Florent Chuffart
********
165 3c88abd0 Florent Chuffart
166 3961deb6 Florent Chuffart
This tutorial describes steps allowing to perform quantitative
167 3c88abd0 Florent Chuffart
analysis of epigenetic marks on individual nucleosomes. We assume that
168 3c88abd0 Florent Chuffart
files are organised according to a given hierarchy and that all
169 3c88abd0 Florent Chuffart
command lines are launched from the project’s root directory.
170 3c88abd0 Florent Chuffart
171 3c88abd0 Florent Chuffart
This tutorial is divided into two main parts. The first part covers
172 3c88abd0 Florent Chuffart
the python script *wf.py* that aligns and converts short sequence
173 3961deb6 Florent Chuffart
reads. The second part covers the R scripts that extracts nucleosome-
174 3961deb6 Florent Chuffart
level information (nucleosome position and indicators) from the
175 3961deb6 Florent Chuffart
dataset.
176 3c88abd0 Florent Chuffart
177 3c88abd0 Florent Chuffart
178 3c88abd0 Florent Chuffart
Experimental Dataset, Working Directory and Configuration File
179 3c88abd0 Florent Chuffart
==============================================================
180 3c88abd0 Florent Chuffart
181 3c88abd0 Florent Chuffart
182 3c88abd0 Florent Chuffart
Working Directory Organisation
183 3c88abd0 Florent Chuffart
------------------------------
184 3c88abd0 Florent Chuffart
185 3961deb6 Florent Chuffart
After having installed NucleoMiner2 environment (Previous section), go
186 3c88abd0 Florent Chuffart
to the root working directory of the tutorial by typing the following
187 3c88abd0 Florent Chuffart
command in a terminal:
188 3c88abd0 Florent Chuffart
189 3c88abd0 Florent Chuffart
   cd doc/Chuffart_NM2_workdir/
190 3c88abd0 Florent Chuffart
191 3c88abd0 Florent Chuffart
192 3c88abd0 Florent Chuffart
Retrieving Experimental Dataset
193 3c88abd0 Florent Chuffart
-------------------------------
194 3c88abd0 Florent Chuffart
195 3c88abd0 Florent Chuffart
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress
196 3c88abd0 Florent Chuffart
(http://www.ebi.ac.uk/arrayexpress/) under accession number
197 3c88abd0 Florent Chuffart
E-MTAB-2671.
198 3c88abd0 Florent Chuffart
199 3c88abd0 Florent Chuffart
$$$ TODO explain how organise Experimental Dataset into the *data*
200 3c88abd0 Florent Chuffart
directory of the working directory.
201 3c88abd0 Florent Chuffart
202 3961deb6 Florent Chuffart
In this tutorial, we want to compare nucleosomes of 2 yeast strains:
203 3961deb6 Florent Chuffart
BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq
204 3961deb6 Florent Chuffart
using an antibody recognizing the H3K14ac epigenetic mark. Illumina
205 3961deb6 Florent Chuffart
sequencing was done in single-read of 50 bp long.
206 3c88abd0 Florent Chuffart
207 3c88abd0 Florent Chuffart
The dataset is composed of 55 files organised as follows:
208 3c88abd0 Florent Chuffart
209 3c88abd0 Florent Chuffart
   * 3 replicates for BY MNase Seq
210 3c88abd0 Florent Chuffart
211 3c88abd0 Florent Chuffart
     * sample 1 (5 fastq.gz files)
212 3c88abd0 Florent Chuffart
213 3c88abd0 Florent Chuffart
     * sample 2 (5 fastq.gz files)
214 3c88abd0 Florent Chuffart
215 3c88abd0 Florent Chuffart
     * sample 3 (4 fastq.gz files)
216 3c88abd0 Florent Chuffart
217 3c88abd0 Florent Chuffart
   * 3 replicates for RM MNase Seq
218 3c88abd0 Florent Chuffart
219 3c88abd0 Florent Chuffart
     * sample 4 (4 fastq.gz files)
220 3c88abd0 Florent Chuffart
221 3c88abd0 Florent Chuffart
     * sample 5 (4 fastq.gz files)
222 3c88abd0 Florent Chuffart
223 3c88abd0 Florent Chuffart
     * sample 6 (5 fastq.gz files)
224 3c88abd0 Florent Chuffart
225 3c88abd0 Florent Chuffart
   * 3 replicates for BY ChIP Seq H3K14ac
226 3c88abd0 Florent Chuffart
227 3c88abd0 Florent Chuffart
     * sample 36 (5 fastq.gz files)
228 3c88abd0 Florent Chuffart
229 3c88abd0 Florent Chuffart
     * sample 37 (5 fastq.gz files)
230 3c88abd0 Florent Chuffart
231 3c88abd0 Florent Chuffart
     * sample 53 (9 fastq.gz files)
232 3c88abd0 Florent Chuffart
233 3c88abd0 Florent Chuffart
   * 2 replicates for RM ChIP Seq H3K14ac
234 3c88abd0 Florent Chuffart
235 3c88abd0 Florent Chuffart
     * sample 38 (5 fastq.gz files)
236 3c88abd0 Florent Chuffart
237 3c88abd0 Florent Chuffart
     * sample 39 (4 fastq.gz files)
238 3c88abd0 Florent Chuffart
239 3c88abd0 Florent Chuffart
240 3c88abd0 Florent Chuffart
Python and R Common Configuration File
241 3c88abd0 Florent Chuffart
--------------------------------------
242 3c88abd0 Florent Chuffart
243 3961deb6 Florent Chuffart
First, we need to define useful configuration variables that will be
244 3961deb6 Florent Chuffart
passed to python and R scripts. These variables are contained in file
245 3961deb6 Florent Chuffart
*configurator.py*. The execution of this python script dumps variables
246 3961deb6 Florent Chuffart
into the *nucleominer_config.json* file that will then be used by both
247 3961deb6 Florent Chuffart
R and python scripts.
248 3c88abd0 Florent Chuffart
249 3961deb6 Florent Chuffart
The initialization of this variables is done in the configurator.py
250 3961deb6 Florent Chuffart
file. If you need to adapt variable values (path, default
251 3961deb6 Florent Chuffart
parameters...) you need to edit this file. Then, go to the root
252 3961deb6 Florent Chuffart
directory of your project and run the following command to dump the
253 3961deb6 Florent Chuffart
configuration file:
254 3c88abd0 Florent Chuffart
255 3c88abd0 Florent Chuffart
   python src/current/configurator.py
256 3c88abd0 Florent Chuffart
257 3c88abd0 Florent Chuffart
258 3c88abd0 Florent Chuffart
Preprocessing Illumina Fastq Reads for Each Sample
259 3c88abd0 Florent Chuffart
==================================================
260 3c88abd0 Florent Chuffart
261 3961deb6 Florent Chuffart
Once variables and design have been specified, the script wf.py will
262 3961deb6 Florent Chuffart
automatically run all the analysis. You don't need to do anything. To
263 3961deb6 Florent Chuffart
run the full analysis, run the following command:
264 3961deb6 Florent Chuffart
265 3961deb6 Florent Chuffart
   python src/current/wf.py
266 3961deb6 Florent Chuffart
267 3961deb6 Florent Chuffart
The details of the steps performed by this script are explained below.
268 3961deb6 Florent Chuffart
This preprocessing consists of 4 steps embedded in the *wf.py* script.
269 3961deb6 Florent Chuffart
They are described bellow.  As a preamble, this script computes
270 3961deb6 Florent Chuffart
*samples*, *samples_mnase* and *strains* that will be used along the 4
271 3961deb6 Florent Chuffart
steps.
272 3c88abd0 Florent Chuffart
273 3c88abd0 Florent Chuffart
wf.samples = []
274 3c88abd0 Florent Chuffart
275 3c88abd0 Florent Chuffart
   List of samples where a sample is identified by an id (key: *id*)
276 3c88abd0 Florent Chuffart
   and a strain name (key *strain*).
277 3c88abd0 Florent Chuffart
278 3c88abd0 Florent Chuffart
wf.samples_mnase = []
279 3c88abd0 Florent Chuffart
280 3c88abd0 Florent Chuffart
   List of Mnase samples.
281 3c88abd0 Florent Chuffart
282 3c88abd0 Florent Chuffart
wf.strains = []
283 3c88abd0 Florent Chuffart
284 3c88abd0 Florent Chuffart
   List of reference strains.
285 3c88abd0 Florent Chuffart
286 3c88abd0 Florent Chuffart
287 3c88abd0 Florent Chuffart
Creating Bowtie Index from each Reference Genome
288 3c88abd0 Florent Chuffart
------------------------------------------------
289 3c88abd0 Florent Chuffart
290 3961deb6 Florent Chuffart
For each strain, the script *wf.py* then creates bowtie index. Bowtie
291 3961deb6 Florent Chuffart
index of a strain is a tree view of the genome of this strain. It will
292 3961deb6 Florent Chuffart
be used by bowtie to align reads. The part of the script performing
293 3961deb6 Florent Chuffart
this is the following:
294 3c88abd0 Florent Chuffart
295 3c88abd0 Florent Chuffart
     for strain in strains:
296 3c88abd0 Florent Chuffart
       per_strain_stats[strain] = create_bowtie_index(strain, 
297 3c88abd0 Florent Chuffart
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
298 3c88abd0 Florent Chuffart
         config["BOWTIE_BUILD_BIN"])
299 3c88abd0 Florent Chuffart
300 3961deb6 Florent Chuffart
As an indication, the following table summarizes the file sizes and
301 3961deb6 Florent Chuffart
process durations that we experienced when running this step on a
302 3961deb6 Florent Chuffart
Linux server***.
303 3c88abd0 Florent Chuffart
304 3c88abd0 Florent Chuffart
+--------+------------------------+------------------------+------------------+
305 3c88abd0 Florent Chuffart
| strain | fasta genome file size | bowtie index file size | process duration |
306 3c88abd0 Florent Chuffart
+========+========================+========================+==================+
307 3c88abd0 Florent Chuffart
| BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
308 3c88abd0 Florent Chuffart
+--------+------------------------+------------------------+------------------+
309 3c88abd0 Florent Chuffart
| RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
310 3c88abd0 Florent Chuffart
+--------+------------------------+------------------------+------------------+
311 3c88abd0 Florent Chuffart
312 3c88abd0 Florent Chuffart
313 3c88abd0 Florent Chuffart
Aligning Reads to Reference Genome
314 3c88abd0 Florent Chuffart
----------------------------------
315 3c88abd0 Florent Chuffart
316 3961deb6 Florent Chuffart
Next, the *wf.py* script launches bowtie to align reads to the
317 3961deb6 Florent Chuffart
reference genome. It produces a *.sam* file that is converted into a
318 3961deb6 Florent Chuffart
*.bed* file. Binaries for *bowtie*, *samtools* and *bedtools* are
319 3961deb6 Florent Chuffart
wrapped using python *subprocess* class. This step is performed by the
320 3961deb6 Florent Chuffart
following part of the script:
321 3c88abd0 Florent Chuffart
322 3c88abd0 Florent Chuffart
     for sample in samples:
323 3c88abd0 Florent Chuffart
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
324 3c88abd0 Florent Chuffart
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
325 3c88abd0 Florent Chuffart
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
326 3c88abd0 Florent Chuffart
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
327 3c88abd0 Florent Chuffart
328 3c88abd0 Florent Chuffart
329 3c88abd0 Florent Chuffart
Convert Aligned Reads into TemplateFilter Format
330 3c88abd0 Florent Chuffart
------------------------------------------------
331 3c88abd0 Florent Chuffart
332 3c88abd0 Florent Chuffart
TemplateFilter uses particular input formats for reads, so it is
333 3961deb6 Florent Chuffart
necessary to convert the *.bed* files. TemplateFilter expect reads in
334 3961deb6 Florent Chuffart
the following format: *chr*, *coord*, *strand* and *#read* where:
335 3c88abd0 Florent Chuffart
336 3c88abd0 Florent Chuffart
* *chr* is the number of the chromosome;
337 3c88abd0 Florent Chuffart
338 3c88abd0 Florent Chuffart
* *coord* is the coordinate of the reads;
339 3c88abd0 Florent Chuffart
340 3c88abd0 Florent Chuffart
* *strand* is *F* for forward and *R* for reverse;
341 3c88abd0 Florent Chuffart
342 3c88abd0 Florent Chuffart
* *#reads* the number of reads covering this position.
343 3c88abd0 Florent Chuffart
344 3c88abd0 Florent Chuffart
Each entry is *tab*-separated.
345 3c88abd0 Florent Chuffart
346 3c88abd0 Florent Chuffart
**WARNING** for reverse strands, bowtie returns the position of the
347 3c88abd0 Florent Chuffart
first nucleotide on the left hand side, whereas TemplateFilter expects
348 3961deb6 Florent Chuffart
the first one on the right hand side.  This is taken into account in
349 3961deb6 Florent Chuffart
NucleoMiner2 by adding the read length (in our case 50) to the reverse
350 3c88abd0 Florent Chuffart
reads coordinates.
351 3c88abd0 Florent Chuffart
352 3c88abd0 Florent Chuffart
This step is performed by the following part of the *wf.py* script:
353 3c88abd0 Florent Chuffart
354 3c88abd0 Florent Chuffart
     for sample in samples:
355 3c88abd0 Florent Chuffart
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
356 3c88abd0 Florent Chuffart
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
357 3c88abd0 Florent Chuffart
         config["READ_LENGTH"],config["MAPQ_THRES"])
358 3c88abd0 Florent Chuffart
359 3961deb6 Florent Chuffart
The following table summarizes the number of reads, the involved file
360 3961deb6 Florent Chuffart
sizes and process durations that we experienced when running the two
361 3961deb6 Florent Chuffart
last steps. In our case, alignment process were multithreaded over 3
362 3961deb6 Florent Chuffart
cores.
363 3c88abd0 Florent Chuffart
364 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
365 3c88abd0 Florent Chuffart
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
366 3c88abd0 Florent Chuffart
+====+================+===========================+========+==================+====================+==================+
367 3c88abd0 Florent Chuffart
| 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
368 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
369 3c88abd0 Florent Chuffart
| 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
370 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
371 3c88abd0 Florent Chuffart
| 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
372 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
373 3c88abd0 Florent Chuffart
| 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
374 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
375 3c88abd0 Florent Chuffart
| 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
376 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
377 3c88abd0 Florent Chuffart
| 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
378 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
379 3c88abd0 Florent Chuffart
| 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
380 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
381 3c88abd0 Florent Chuffart
| 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
382 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
383 3c88abd0 Florent Chuffart
| 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
384 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
385 3c88abd0 Florent Chuffart
| 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
386 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
387 3c88abd0 Florent Chuffart
| 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
388 3c88abd0 Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
389 3c88abd0 Florent Chuffart
390 3c88abd0 Florent Chuffart
391 3c88abd0 Florent Chuffart
Run TemplateFilter on Mnase Samples
392 3c88abd0 Florent Chuffart
-----------------------------------
393 3c88abd0 Florent Chuffart
394 3c88abd0 Florent Chuffart
Finally, for each sample we perform TemplateFilter analysis.
395 3c88abd0 Florent Chuffart
396 3c88abd0 Florent Chuffart
**WARNING** TemplateFilter returns a list of nucleosomes. Each
397 3961deb6 Florent Chuffart
nucleosome is defined by its center and its width. An odd width leads
398 3c88abd0 Florent Chuffart
us to consider non- integer lower and upper bound.
399 3c88abd0 Florent Chuffart
400 3961deb6 Florent Chuffart
**WARNING** TemplateFilter was not designed to handle replicates. So
401 3c88abd0 Florent Chuffart
we recommend to keep a maximum of nucleosomes and filter the aberrant
402 3c88abd0 Florent Chuffart
ones afterwards using the benefits of having replicates. To do this,
403 3c88abd0 Florent Chuffart
we set a low correlation threshold parameter (0.5) and a particularly
404 3c88abd0 Florent Chuffart
high value of overlap (300%).
405 3c88abd0 Florent Chuffart
406 3c88abd0 Florent Chuffart
This step is performed by the following part of the *wf.py* script:
407 3c88abd0 Florent Chuffart
408 3c88abd0 Florent Chuffart
     for sample in samples_mnase:
409 3c88abd0 Florent Chuffart
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
410 3c88abd0 Florent Chuffart
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
411 3c88abd0 Florent Chuffart
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
412 3c88abd0 Florent Chuffart
         config["TF_MAXW"], config["TF_OL"])  
413 3c88abd0 Florent Chuffart
414 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
415 3c88abd0 Florent Chuffart
| id | strain | found nucs | nuc file size | process duration |
416 3c88abd0 Florent Chuffart
+====+========+============+===============+==================+
417 3c88abd0 Florent Chuffart
| 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
418 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
419 3c88abd0 Florent Chuffart
| 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
420 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
421 3c88abd0 Florent Chuffart
| 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
422 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
423 3c88abd0 Florent Chuffart
| 4  | RM     | 88076      | 62 Mo         | 984 s.           |
424 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
425 3c88abd0 Florent Chuffart
| 5  | RM     | 90141      | 64 Mo         | 967 s.           |
426 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
427 3c88abd0 Florent Chuffart
| 6  | RM     | 87517      | 62 Mo         | 980 s.           |
428 3c88abd0 Florent Chuffart
+----+--------+------------+---------------+------------------+
429 3c88abd0 Florent Chuffart
430 3c88abd0 Florent Chuffart
431 3c88abd0 Florent Chuffart
Inferring Nucleosome Position and Extracting Read Counts
432 3c88abd0 Florent Chuffart
========================================================
433 3c88abd0 Florent Chuffart
434 3c88abd0 Florent Chuffart
The second part of the tutorial uses R
435 3961deb6 Florent Chuffart
(http://http://www.r-project.org). NucleoMiner2 contains a set of R
436 3961deb6 Florent Chuffart
scripts that will be sourced in R from a console launched at the root
437 3961deb6 Florent Chuffart
of your project. These scripts are:
438 3c88abd0 Florent Chuffart
439 3c88abd0 Florent Chuffart
   * headers.R
440 3c88abd0 Florent Chuffart
441 3c88abd0 Florent Chuffart
   * extract_maps.R
442 3c88abd0 Florent Chuffart
443 3c88abd0 Florent Chuffart
   * translate_common_wp.R
444 3c88abd0 Florent Chuffart
445 3c88abd0 Florent Chuffart
   * split_samples.R
446 3c88abd0 Florent Chuffart
447 3c88abd0 Florent Chuffart
   * count_reads.R
448 3c88abd0 Florent Chuffart
449 3c88abd0 Florent Chuffart
   * get_size_factors
450 3c88abd0 Florent Chuffart
451 3c88abd0 Florent Chuffart
   * launch_deseq.R
452 3c88abd0 Florent Chuffart
453 3c88abd0 Florent Chuffart
454 3c88abd0 Florent Chuffart
The Script headers.R
455 3c88abd0 Florent Chuffart
--------------------
456 3c88abd0 Florent Chuffart
457 3961deb6 Florent Chuffart
The script headers.R is included in all other R scripts. It is in
458 3c88abd0 Florent Chuffart
charge of:
459 3c88abd0 Florent Chuffart
460 3c88abd0 Florent Chuffart
   * launching libraries used in the scripts
461 3c88abd0 Florent Chuffart
462 3c88abd0 Florent Chuffart
   * launching configuration (design, strain, marker...)
463 3c88abd0 Florent Chuffart
464 3961deb6 Florent Chuffart
   * computing and caching Common Uinterrupted Regions (CURs).
465 3961deb6 Florent Chuffart
     Caching means storing the information in the computer's memory.
466 3c88abd0 Florent Chuffart
467 3c88abd0 Florent Chuffart
Note that you can customize the function “translate”. This function
468 3c88abd0 Florent Chuffart
allows you to use the alignments between genomes when performing
469 3961deb6 Florent Chuffart
various tasks.
470 3c88abd0 Florent Chuffart
471 3961deb6 Florent Chuffart
   * You may want to analyze data of a single strain (e.g.
472 3961deb6 Florent Chuffart
     treatment/control, or only few mutations). In this case, the
473 3961deb6 Florent Chuffart
     genome is identical across all samples and you do not need to
474 3961deb6 Florent Chuffart
     define particular CURs (CURs are chromosomes). Simply use the
475 3c88abd0 Florent Chuffart
     default translate function which is neutral.
476 3c88abd0 Florent Chuffart
477 3961deb6 Florent Chuffart
   * If you are analyzing data from two or more strains (as
478 3961deb6 Florent Chuffart
     NucleoMiner2 was designed for), then you need to translate
479 3961deb6 Florent Chuffart
     coordinates of one genome into the coordinates of another one.
480 3961deb6 Florent Chuffart
     You must do this by aligning the two genomes, which will produce
481 3961deb6 Florent Chuffart
     a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to
482 3961deb6 Florent Chuffart
     produce the list of regions and customise “translate”.
483 3c88abd0 Florent Chuffart
484 3961deb6 Florent Chuffart
In our tutorial, we are in the second case and to perform all these
485 3961deb6 Florent Chuffart
steps run the following command line in your R console:
486 3c88abd0 Florent Chuffart
487 3c88abd0 Florent Chuffart
   source("src/current/headers.R")
488 3c88abd0 Florent Chuffart
489 3c88abd0 Florent Chuffart
490 3c88abd0 Florent Chuffart
The Script extract_maps.R
491 3c88abd0 Florent Chuffart
-------------------------
492 3c88abd0 Florent Chuffart
493 3c88abd0 Florent Chuffart
This script is in charge of extracting Maps for well-positioned and
494 3961deb6 Florent Chuffart
sensitive nucleosomes. First of all, this script computes intra and
495 3961deb6 Florent Chuffart
inter-strain matches of nucleosome maps for each CUR. This step can be
496 3961deb6 Florent Chuffart
executed in parallel on many cores using the BoT library. Next, it
497 3961deb6 Florent Chuffart
collects results and produces maps of  well-positioned nucleosomes,
498 3961deb6 Florent Chuffart
sensitive nucleosomes and Unaligned Nucleosomal Regions .
499 3961deb6 Florent Chuffart
500 3961deb6 Florent Chuffart
The map of well-positioned nucleosomes for BY is collected in the
501 3961deb6 Florent Chuffart
result directory and is called *BY_wp.tab*. It is composed of
502 3961deb6 Florent Chuffart
following columns:
503 3c88abd0 Florent Chuffart
504 3c88abd0 Florent Chuffart
   * chr, the number of the chromosome
505 3c88abd0 Florent Chuffart
506 3c88abd0 Florent Chuffart
   * lower_bound, the lower bound of the nucleosome
507 3c88abd0 Florent Chuffart
508 3c88abd0 Florent Chuffart
   * upper_bound, the upper bound of the nucleosome
509 3c88abd0 Florent Chuffart
510 3c88abd0 Florent Chuffart
   * cur_index, index of the CUR
511 3c88abd0 Florent Chuffart
512 3c88abd0 Florent Chuffart
   * index_nuc, the index of the nucleosome in the CUR
513 3c88abd0 Florent Chuffart
514 3c88abd0 Florent Chuffart
   * wp, 1 if it is a well positioned nucleosome, 0 otherwise
515 3c88abd0 Florent Chuffart
516 3c88abd0 Florent Chuffart
   * nb_reads, the number of reads that support this nucleosome
517 3c88abd0 Florent Chuffart
518 3c88abd0 Florent Chuffart
   * nb_nucs, the number of TemplateFilter nucleosome across
519 3c88abd0 Florent Chuffart
     replicates (= the number of replicates in which it is a well-
520 3c88abd0 Florent Chuffart
     positioned nucleosome)
521 3c88abd0 Florent Chuffart
522 3c88abd0 Florent Chuffart
   * llr_1, for a well-positioned nucleosome, it is the LLR1 (log-
523 3c88abd0 Florent Chuffart
     likelihood ratio) between the first and the second TemplateFilter
524 3c88abd0 Florent Chuffart
     nucleosome on the chain.
525 3c88abd0 Florent Chuffart
526 3c88abd0 Florent Chuffart
   * llr_2, for a well-positioned nucleosome, it is the LLR1 between
527 3c88abd0 Florent Chuffart
     the second and the third TemplateFilter nucleosome on the chain.
528 3c88abd0 Florent Chuffart
529 3c88abd0 Florent Chuffart
   * wp_llr, for a well-positioned nucleosome, it is the LLR2 that
530 3c88abd0 Florent Chuffart
     compares consistency of the positioning over all TemplateFilter
531 3c88abd0 Florent Chuffart
     nucleosomes.
532 3c88abd0 Florent Chuffart
533 3c88abd0 Florent Chuffart
   * wp_pval, for a well-positioned nucleosome, it is the p-value
534 3961deb6 Florent Chuffart
     chi square test obtained from LLR2 (*1-pchisq(2.LLR2, df=4)*)
535 3c88abd0 Florent Chuffart
536 3c88abd0 Florent Chuffart
   * dyad_shift, for a well-positioned nucleosome, it is the shift
537 3c88abd0 Florent Chuffart
     between the two extreme TemplateFilter nucleosome dyad positions.
538 3c88abd0 Florent Chuffart
539 3961deb6 Florent Chuffart
The sensitive map for BY is collected in the result directory and is
540 3c88abd0 Florent Chuffart
called *BY_fuzzy.tab*. It is composed of following columns:
541 3c88abd0 Florent Chuffart
542 3c88abd0 Florent Chuffart
   * chr, the number of the chromosome
543 3c88abd0 Florent Chuffart
544 3c88abd0 Florent Chuffart
   * lower_bound, the lower bound of the nucleosome
545 3c88abd0 Florent Chuffart
546 3c88abd0 Florent Chuffart
   * upper_bound, the upper bound of the nucleosome
547 3c88abd0 Florent Chuffart
548 3c88abd0 Florent Chuffart
   * cur_index, index of the CUR
549 3c88abd0 Florent Chuffart
550 3c88abd0 Florent Chuffart
The map of common well-positioned nucleosomes aligned between the BY
551 3c88abd0 Florent Chuffart
and RM strains is collected in the result directory and is called
552 3c88abd0 Florent Chuffart
*BY_RM_common_wp.tab*. It is composed of following columns:
553 3c88abd0 Florent Chuffart
554 3c88abd0 Florent Chuffart
   * cur_index, the index of the CUR
555 3c88abd0 Florent Chuffart
556 3c88abd0 Florent Chuffart
   * index_nuc_BY, the index of the BY nucleosome in the CUR
557 3c88abd0 Florent Chuffart
558 3c88abd0 Florent Chuffart
   * index_nuc_RM, the index of the RM nucleosome in the CUR
559 3c88abd0 Florent Chuffart
560 3c88abd0 Florent Chuffart
   * llr_score, , the LLR3 score that estimates conservation between
561 3c88abd0 Florent Chuffart
     the positions in BY and RM
562 3c88abd0 Florent Chuffart
563 3c88abd0 Florent Chuffart
   * common_wp_pval,  the p-value chi square test obtained from LLR3
564 3c88abd0 Florent Chuffart
     (*1-pchisq(2.LLR3, df=2)*)
565 3c88abd0 Florent Chuffart
566 3c88abd0 Florent Chuffart
   * diff, the dyads shift between the positions in the two strains
567 3961deb6 Florent Chuffart
     (in bp)
568 3c88abd0 Florent Chuffart
569 3c88abd0 Florent Chuffart
The common UNR map for BY and RM strains is collected in the result
570 3c88abd0 Florent Chuffart
directory and is called *BY_RM_common_unr.tab*. It is composed of the
571 3c88abd0 Florent Chuffart
following columns:
572 3c88abd0 Florent Chuffart
573 3c88abd0 Florent Chuffart
   * cur_index, the index of the CUR
574 3c88abd0 Florent Chuffart
575 3c88abd0 Florent Chuffart
   * index_nuc_BY, the index of the BY nucleosome in the CUR
576 3c88abd0 Florent Chuffart
577 3c88abd0 Florent Chuffart
   * index_nuc_RM,the index of the RM nucleosome in the CUR
578 3c88abd0 Florent Chuffart
579 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
580 3c88abd0 Florent Chuffart
581 3c88abd0 Florent Chuffart
   source("src/current/extract_maps.R")
582 3c88abd0 Florent Chuffart
583 3c88abd0 Florent Chuffart
584 3c88abd0 Florent Chuffart
The Script translate_common_wp.R
585 3c88abd0 Florent Chuffart
--------------------------------
586 3c88abd0 Florent Chuffart
587 3c88abd0 Florent Chuffart
This script is used to translate common well-positioned nucleosome
588 3961deb6 Florent Chuffart
positions from a strain to another strain and stores it into a table.
589 3c88abd0 Florent Chuffart
590 3c88abd0 Florent Chuffart
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM
591 3961deb6 Florent Chuffart
well-positioned nucleosomes translated into the BY genome coordinates.
592 3c88abd0 Florent Chuffart
It is composed of following columns:
593 3c88abd0 Florent Chuffart
594 3c88abd0 Florent Chuffart
   * strain_ref, the reference genome (in which positioned are
595 3c88abd0 Florent Chuffart
     defined)
596 3c88abd0 Florent Chuffart
597 3c88abd0 Florent Chuffart
   * begin, the translated lower bound of the nucleosome
598 3c88abd0 Florent Chuffart
599 3c88abd0 Florent Chuffart
   * end, the translated upper bound of the nucleosome
600 3c88abd0 Florent Chuffart
601 3c88abd0 Florent Chuffart
   * chr, the number of chromosomes for the reference genome (in
602 3c88abd0 Florent Chuffart
     which positioned are defined)
603 3c88abd0 Florent Chuffart
604 3c88abd0 Florent Chuffart
   * length, the length of the nucleosome (could be negative)
605 3c88abd0 Florent Chuffart
606 3c88abd0 Florent Chuffart
   * cur_index, the index of the CUR
607 3c88abd0 Florent Chuffart
608 3c88abd0 Florent Chuffart
   * index_nuc, the index of the nucleosome in the CUR
609 3c88abd0 Florent Chuffart
610 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
611 3c88abd0 Florent Chuffart
612 3c88abd0 Florent Chuffart
   source("src/current/translate_common_wp.R")
613 3c88abd0 Florent Chuffart
614 3c88abd0 Florent Chuffart
615 3c88abd0 Florent Chuffart
The Script split_samples.R
616 3c88abd0 Florent Chuffart
--------------------------
617 3c88abd0 Florent Chuffart
618 3961deb6 Florent Chuffart
To optimize memory space usage, we split and compress TemplateFilter
619 3c88abd0 Florent Chuffart
input files according to their corresponding  chromosome. for example,
620 3c88abd0 Florent Chuffart
*sample_1_TF.tab* will be split into :
621 3c88abd0 Florent Chuffart
622 3c88abd0 Florent Chuffart
   * sample_1_chr_1_splited_sample.tab.gz
623 3c88abd0 Florent Chuffart
624 3c88abd0 Florent Chuffart
   * sample_1_chr_2_splited_sample.tab.gz
625 3c88abd0 Florent Chuffart
626 3c88abd0 Florent Chuffart
   * ...
627 3c88abd0 Florent Chuffart
628 3c88abd0 Florent Chuffart
   * sample_1_chr_17_splited_sample.tab.gz
629 3c88abd0 Florent Chuffart
630 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
631 3c88abd0 Florent Chuffart
632 3c88abd0 Florent Chuffart
   source("src/current/split_samples.R")
633 3c88abd0 Florent Chuffart
634 3c88abd0 Florent Chuffart
635 3c88abd0 Florent Chuffart
The Script count_reads.R
636 3c88abd0 Florent Chuffart
------------------------
637 3c88abd0 Florent Chuffart
638 3c88abd0 Florent Chuffart
To associate a number of observations (read) to each nucleosome we run
639 3c88abd0 Florent Chuffart
the script *count_reads.R*. It produces the files
640 3c88abd0 Florent Chuffart
*BY_RM_H3K14ac_wp_and_nbreads.tab*,
641 3c88abd0 Florent Chuffart
*BY_RM_H3K14ac_unr_and_nbreads.tab*
642 3c88abd0 Florent Chuffart
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and
643 3c88abd0 Florent Chuffart
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well-
644 3c88abd0 Florent Chuffart
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned
645 3c88abd0 Florent Chuffart
nucleosomes and Mnase UNRs respectively.
646 3c88abd0 Florent Chuffart
647 3c88abd0 Florent Chuffart
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains
648 3c88abd0 Florent Chuffart
counted reads for well-positioned nucleosomes with the experimental
649 3c88abd0 Florent Chuffart
condition ChIP H3K14ac. It is composed of the following columns:
650 3c88abd0 Florent Chuffart
651 3c88abd0 Florent Chuffart
   * chr_BY, the number of the chromosome for BY
652 3c88abd0 Florent Chuffart
653 3c88abd0 Florent Chuffart
   * lower_bound_BY, the lower bound of the nucleosome for BY
654 3c88abd0 Florent Chuffart
655 3c88abd0 Florent Chuffart
   * upper_bound_BY, the upper bound of the nucleosome  for BY
656 3c88abd0 Florent Chuffart
657 3c88abd0 Florent Chuffart
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
658 3c88abd0 Florent Chuffart
659 3c88abd0 Florent Chuffart
   * chr_RM, the number of the chromosome for RM
660 3c88abd0 Florent Chuffart
661 3c88abd0 Florent Chuffart
   * lower_bound_RM, the lower bound of the nucleosome for RM
662 3c88abd0 Florent Chuffart
663 3c88abd0 Florent Chuffart
   * upper_bound_RM, the upper bound of the nucleosome  for RM
664 3c88abd0 Florent Chuffart
665 3c88abd0 Florent Chuffart
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
666 3c88abd0 Florent Chuffart
667 3c88abd0 Florent Chuffart
   * cur_index, index of the CUR
668 3c88abd0 Florent Chuffart
669 3c88abd0 Florent Chuffart
   * BY_H3K14ac_36, the number of reads for the current nucleosome
670 3c88abd0 Florent Chuffart
     for the sample 36
671 3c88abd0 Florent Chuffart
672 3c88abd0 Florent Chuffart
   * BY_H3K14ac_37, #reads for sample 37
673 3c88abd0 Florent Chuffart
674 3c88abd0 Florent Chuffart
   * BY_H3K14ac_53, #reads for sample 53
675 3c88abd0 Florent Chuffart
676 3c88abd0 Florent Chuffart
   * RM_H3K14ac_38, #reads for sample 38
677 3c88abd0 Florent Chuffart
678 3c88abd0 Florent Chuffart
   * RM_H3K14ac_39, #reads for sample 39
679 3c88abd0 Florent Chuffart
680 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
681 3c88abd0 Florent Chuffart
682 3c88abd0 Florent Chuffart
   source("src/current/count_reads.R")
683 3c88abd0 Florent Chuffart
684 3c88abd0 Florent Chuffart
685 3c88abd0 Florent Chuffart
The Script get_size_factors.R
686 3c88abd0 Florent Chuffart
-----------------------------
687 3c88abd0 Florent Chuffart
688 3c88abd0 Florent Chuffart
This script uses the DESeq function *estimateSizeFactors* to compute
689 3c88abd0 Florent Chuffart
the size factor of each sample. It corresponds to normalisation of
690 3c88abd0 Florent Chuffart
read counts from sample to sample, as determined by DESeq. When a
691 3c88abd0 Florent Chuffart
sample has n reads for a nucleosome or a UNR, the normalised count is
692 3c88abd0 Florent Chuffart
n/f where f is the factor contained in this file. The script dumps
693 3c88abd0 Florent Chuffart
computed size factors into the file *size_factors.tab*. This file has
694 3c88abd0 Florent Chuffart
the form:
695 3c88abd0 Florent Chuffart
696 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
697 3c88abd0 Florent Chuffart
| sample_id | wp      | unr     | wpunr   |
698 3c88abd0 Florent Chuffart
+===========+=========+=========+=========+
699 3c88abd0 Florent Chuffart
| 1         | 0.87396 | 0.88097 | 0.87584 |
700 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
701 3c88abd0 Florent Chuffart
| 2         | 1.07890 | 1.07440 | 1.07760 |
702 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
703 3c88abd0 Florent Chuffart
| 3         | 1.06400 | 1.05890 | 1.06250 |
704 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
705 3c88abd0 Florent Chuffart
| 4         | 0.85782 | 0.87948 | 0.86305 |
706 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
707 3c88abd0 Florent Chuffart
| 5         | 0.97577 | 0.96590 | 0.97307 |
708 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
709 3c88abd0 Florent Chuffart
| 6         | 1.19630 | 1.18120 | 1.19190 |
710 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
711 3c88abd0 Florent Chuffart
| 36        | 0.93318 | 0.92762 | 0.93166 |
712 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
713 3c88abd0 Florent Chuffart
| 37        | 0.48315 | 0.48453 | 0.48350 |
714 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
715 3c88abd0 Florent Chuffart
| 38        | 1.11240 | 1.11210 | 1.11230 |
716 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
717 3c88abd0 Florent Chuffart
| 39        | 0.89897 | 0.89917 | 0.89903 |
718 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
719 3c88abd0 Florent Chuffart
| 53        | 2.22650 | 2.22700 | 2.22660 |
720 3c88abd0 Florent Chuffart
+-----------+---------+---------+---------+
721 3c88abd0 Florent Chuffart
722 3c88abd0 Florent Chuffart
sample_id are given in file samples.csv
723 3c88abd0 Florent Chuffart
724 3961deb6 Florent Chuffart
If you don't know which column to use for normalization, we recommend
725 3961deb6 Florent Chuffart
using wpunr.
726 3c88abd0 Florent Chuffart
727 3961deb6 Florent Chuffart
Here are the details of the factors produced:
728 3c88abd0 Florent Chuffart
729 3c88abd0 Florent Chuffart
   * unr: factor computed from data of UNR regions. These regions
730 3c88abd0 Florent Chuffart
     are defined for every pairs of aligned genomes (e.g. BY_RM)
731 3c88abd0 Florent Chuffart
732 3c88abd0 Florent Chuffart
   * wp: same, but for well-positioned nucleosomes.
733 3c88abd0 Florent Chuffart
734 3c88abd0 Florent Chuffart
   * wpunr: both types of regions.
735 3c88abd0 Florent Chuffart
736 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
737 3c88abd0 Florent Chuffart
738 3c88abd0 Florent Chuffart
   source("src/current/get_size_factors.R")
739 3c88abd0 Florent Chuffart
740 3c88abd0 Florent Chuffart
741 3c88abd0 Florent Chuffart
The Script launch_deseq.R
742 3c88abd0 Florent Chuffart
-------------------------
743 3c88abd0 Florent Chuffart
744 3c88abd0 Florent Chuffart
Finally, the script *launch_deseq.R* perform statistical analysis on
745 3c88abd0 Florent Chuffart
each nucleosome using *DESeq*. It produces files:
746 3c88abd0 Florent Chuffart
747 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_wp_snep.tab
748 3c88abd0 Florent Chuffart
749 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_unr_snep.tab
750 3c88abd0 Florent Chuffart
751 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_wpunr_snep.tab
752 3c88abd0 Florent Chuffart
753 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_wp_mnase.tab
754 3c88abd0 Florent Chuffart
755 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_unr_mnase.tab
756 3c88abd0 Florent Chuffart
757 3c88abd0 Florent Chuffart
   * results/current/BY_RM_H3K14ac_wpunr_mnase.tab
758 3c88abd0 Florent Chuffart
759 3c88abd0 Florent Chuffart
These files are organised with the following columns (see file
760 3c88abd0 Florent Chuffart
*BY_RM_H3K14ac_wp_snep.tab* for an example):
761 3c88abd0 Florent Chuffart
762 3c88abd0 Florent Chuffart
   * chr_BY, the number of the chromosome for BY
763 3c88abd0 Florent Chuffart
764 3c88abd0 Florent Chuffart
   * lower_bound_BY, the lower bound of the nucleosome for BY
765 3c88abd0 Florent Chuffart
766 3c88abd0 Florent Chuffart
   * upper_bound_BY, the upper bound of the nucleosome  for BY
767 3c88abd0 Florent Chuffart
768 3c88abd0 Florent Chuffart
   * index_nuc_BY, the index of the BY nucleosome in the CUR for BY
769 3c88abd0 Florent Chuffart
770 3c88abd0 Florent Chuffart
   * chr_RM, the number of the chromosome for RM
771 3c88abd0 Florent Chuffart
772 3c88abd0 Florent Chuffart
   * lower_bound_RM, the lower bound of the nucleosome for RM
773 3c88abd0 Florent Chuffart
774 3c88abd0 Florent Chuffart
   * upper_bound_RM, the upper bound of the nucleosome  for RM
775 3c88abd0 Florent Chuffart
776 3c88abd0 Florent Chuffart
   * index_nuc_RM,the index of the RM nucleosome in the CUR for RM
777 3c88abd0 Florent Chuffart
778 3c88abd0 Florent Chuffart
   * cur_index, index of the CUR
779 3c88abd0 Florent Chuffart
780 3c88abd0 Florent Chuffart
   * form
781 3c88abd0 Florent Chuffart
782 3c88abd0 Florent Chuffart
   * BY_Mnase_Seq_1, the number of reads for the current nucleosome
783 3c88abd0 Florent Chuffart
     for the sample 1
784 3c88abd0 Florent Chuffart
785 3c88abd0 Florent Chuffart
Next columns concern indicators for each sample:
786 3c88abd0 Florent Chuffart
787 3c88abd0 Florent Chuffart
   * BY_Mnase_Seq_2, #reads for sample 2
788 3c88abd0 Florent Chuffart
789 3c88abd0 Florent Chuffart
   * BY_Mnase_Seq_3, #reads for sample 3
790 3c88abd0 Florent Chuffart
791 3c88abd0 Florent Chuffart
   * RM_Mnase_Seq_4, #reads for sample 4
792 3c88abd0 Florent Chuffart
793 3c88abd0 Florent Chuffart
   * RM_Mnase_Seq_5, #reads for sample 5
794 3c88abd0 Florent Chuffart
795 3c88abd0 Florent Chuffart
   * RM_Mnase_Seq_6, #reads for sample 6
796 3c88abd0 Florent Chuffart
797 3c88abd0 Florent Chuffart
   * BY_H3K14ac_36, #reads for sample 36
798 3c88abd0 Florent Chuffart
799 3c88abd0 Florent Chuffart
   * BY_H3K14ac_37, #reads for sample 37
800 3c88abd0 Florent Chuffart
801 3c88abd0 Florent Chuffart
   * BY_H3K14ac_53, #reads for sample 53
802 3c88abd0 Florent Chuffart
803 3c88abd0 Florent Chuffart
   * RM_H3K14ac_38, #reads for sample 38
804 3c88abd0 Florent Chuffart
805 3c88abd0 Florent Chuffart
   * RM_H3K14ac_39, #reads for sample 39
806 3c88abd0 Florent Chuffart
807 3c88abd0 Florent Chuffart
The 5 last columns concern DESeq analysis:
808 3c88abd0 Florent Chuffart
809 3c88abd0 Florent Chuffart
   * manip[a_manip] strain[a_strain]
810 3c88abd0 Florent Chuffart
     manip[a_strain]:strain[a_strain], the manip (marker) effect, the
811 3c88abd0 Florent Chuffart
     strain effect and the snep effect. These are the coefficients of
812 3c88abd0 Florent Chuffart
     the fitted generalized linear model.
813 3c88abd0 Florent Chuffart
814 3961deb6 Florent Chuffart
   * pvalsGLM, the pvalue resulting from the comparison of the GLM
815 3961deb6 Florent Chuffart
     model considering the interaction term *marker:strain* to the GLM
816 3961deb6 Florent Chuffart
     model that does not consider it. This is the statsitcial
817 3961deb6 Florent Chuffart
     significance of the interaction term and therefore the
818 3961deb6 Florent Chuffart
     statistical significance of the SNEP.
819 3c88abd0 Florent Chuffart
820 3c88abd0 Florent Chuffart
   * snep_index, a boolean set to TRUE if the pvalueGLM value is
821 3c88abd0 Florent Chuffart
     under the threshold computed with FDR function with a rate set to
822 3c88abd0 Florent Chuffart
     0.0001.
823 3c88abd0 Florent Chuffart
824 3c88abd0 Florent Chuffart
To execute this script, run the following command in your R console:
825 3c88abd0 Florent Chuffart
826 3c88abd0 Florent Chuffart
   source("src/current/launch_deseq.R")
827 3c88abd0 Florent Chuffart
828 3c88abd0 Florent Chuffart
829 3c88abd0 Florent Chuffart
Results: Number of SNEPs
830 3c88abd0 Florent Chuffart
========================
831 3c88abd0 Florent Chuffart
832 3c88abd0 Florent Chuffart
Here are the number of computed SNEPs for each forms.
833 3c88abd0 Florent Chuffart
834 3c88abd0 Florent Chuffart
+-------+---------+-------+---------+
835 3c88abd0 Florent Chuffart
| form  | strains | #nucs | H3K14ac |
836 3c88abd0 Florent Chuffart
+=======+=========+=======+=========+
837 3c88abd0 Florent Chuffart
| wp    | BY-RM   | 30464 | 3549    |
838 3c88abd0 Florent Chuffart
+-------+---------+-------+---------+
839 3c88abd0 Florent Chuffart
| unr   | BY-RM   | 9497  | 1559    |
840 3c88abd0 Florent Chuffart
+-------+---------+-------+---------+
841 3c88abd0 Florent Chuffart
| wpunr | BY-RM   | 39961 | 5240    |
842 3c88abd0 Florent Chuffart
+-------+---------+-------+---------+
843 3c88abd0 Florent Chuffart
844 3c88abd0 Florent Chuffart
845 3c88abd0 Florent Chuffart
APPENDICE: Generate .c2c Files
846 3c88abd0 Florent Chuffart
==============================
847 3c88abd0 Florent Chuffart
848 3961deb6 Florent Chuffart
The *.c2c* files is a simple table that describes how two genome
849 3961deb6 Florent Chuffart
sequences are aligned. This file can be generated by using scripts
850 3961deb6 Florent Chuffart
that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics
851 3961deb6 Florent Chuffart
2010) and which we provide in this release of NucleoMiner2.
852 3c88abd0 Florent Chuffart
853 3961deb6 Florent Chuffart
To use these scripts on your UNIX/LINUX computer you need first to
854 3961deb6 Florent Chuffart
install MUMmer which is designed to rapidly align entire genomes,
855 3961deb6 Florent Chuffart
whether in complete or draft form.
856 3c88abd0 Florent Chuffart
857 3c88abd0 Florent Chuffart
858 3961deb6 Florent Chuffart
Installing MUMmer
859 3961deb6 Florent Chuffart
-----------------
860 3c88abd0 Florent Chuffart
861 c25275e2 Florent Chuffart
Get the last version of MUMmer archive on your computer
862 3961deb6 Florent Chuffart
(MUMmer3.23.tar.gz is provided in the directory deps of your working
863 c25275e2 Florent Chuffart
directory). Copy it in a dedicated directory. Install it locally into
864 c25275e2 Florent Chuffart
the src folder of you working directory by typing (working directory):
865 3c88abd0 Florent Chuffart
866 3961deb6 Florent Chuffart
tar -xvzf MUMmer3.23.tar.gz
867 3c88abd0 Florent Chuffart
868 c25275e2 Florent Chuffart
   cd src
869 c25275e2 Florent Chuffart
   tar xfvz ../deps/MUMmer3.23.tar.gz
870 c25275e2 Florent Chuffart
   cd MUMmer3.23
871 c25275e2 Florent Chuffart
   make check
872 c25275e2 Florent Chuffart
   make install
873 3c88abd0 Florent Chuffart
874 3c88abd0 Florent Chuffart
875 c25275e2 Florent Chuffart
Installing NucleoMiner 1.0 scripts
876 c25275e2 Florent Chuffart
----------------------------------
877 3c88abd0 Florent Chuffart
878 3961deb6 Florent Chuffart
Get the nucleominer-1.0.tar.gz archive on your computer (this archive
879 3961deb6 Florent Chuffart
is provided in the directory deps of your working directory). Install
880 3961deb6 Florent Chuffart
it locally into the src folder of you working directory by typing
881 3961deb6 Florent Chuffart
(working directory):
882 3c88abd0 Florent Chuffart
883 c25275e2 Florent Chuffart
   cd src
884 c25275e2 Florent Chuffart
   tar xfvz ../deps/nucleominer-1.0.tar.gz
885 c25275e2 Florent Chuffart
   cd ..
886 3c88abd0 Florent Chuffart
887 3961deb6 Florent Chuffart
This creates a directory that contains  NucleoMiner 1.0 scripts
888 c25275e2 Florent Chuffart
(src/nucleominer-1.0/scripts).
889 3c88abd0 Florent Chuffart
890 3c88abd0 Florent Chuffart
891 3c88abd0 Florent Chuffart
Generate .c2c Files
892 3c88abd0 Florent Chuffart
-------------------
893 3c88abd0 Florent Chuffart
894 3c88abd0 Florent Chuffart
To generate .c2c files you need to type the following command in a
895 3c88abd0 Florent Chuffart
terminal:
896 3c88abd0 Florent Chuffart
897 c25275e2 Florent Chuffart
   export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
898 c25275e2 Florent Chuffart
   export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
899 c25275e2 Florent Chuffart
   NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
900 c25275e2 Florent Chuffart
     data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
901 c25275e2 Florent Chuffart
     data/byxrm 2>NMgxcomp.log
902 3c88abd0 Florent Chuffart
903 c25275e2 Florent Chuffart
After execution, the directory *data* will hold the .c2c files.