Statistiques
| Branche: | Révision :

root / doc / R_tutorial / NucleoMiner2.Rnw @ 780f632a

Historique | Voir | Annoter | Télécharger (12,5 ko)

1
% Generated by Sphinx.
2
% \def\sphinxdocclass{report}
3
\documentclass[letterpaper,10pt,english]{article}
4
\usepackage[utf8]{inputenc}
5
\DeclareUnicodeCharacter{00A0}{\nobreakspace}
6
\usepackage{cmap}
7
\usepackage[T1]{fontenc}
8
\usepackage{babel}
9
\usepackage{times}
10
\usepackage[Bjarne]{fncychap}
11
\usepackage{longtable}
12
\usepackage{multirow}
13
\usepackage{url}
14

    
15
\addto\captionsenglish{\renewcommand{\figurename}{Fig. }}
16
\addto\captionsenglish{\renewcommand{\tablename}{Table }}
17

    
18

    
19

    
20
\title{NucleoMiner2 Documentation}
21
\date{\today}
22
\author{Florent CHUFFART, Jean-Baptiste VEYRIERAS, Gael YVERT}
23
\newcommand{\sphinxlogo}{}
24
\makeindex
25

    
26
\def\PYGZus{\char`\_}
27

    
28

    
29
\begin{document}
30

    
31
\maketitle
32
\tableofcontents
33
\textbf{}\label{index::doc}
34

    
35

    
36
\section{Inferring Nucleosome Position and Extracting Read Counts}
37
\label{tuto:inferring-nucleosome-position-and-extracting-read-counts}
38
The second part of the tutorial uses R (\url{http://http://www.r-project.org}{http://http://www.r-project.org}).
39
It consists in 5 main steps: 
40
\begin{itemize}
41

    
42
\item {} 
43
Seting Up Configuration
44

    
45
\item {} 
46
Computing CURs
47

    
48
\item {} 
49
Extracting Nucleosome Maps from TemplateFilter Outputs
50

    
51
\item {} 
52
Building Count Tables from TemplateFilter Inputs
53

    
54
\item {} 
55
Analysing Count Tables using DESeq
56

    
57
\end{itemize}
58

    
59

    
60

    
61

    
62
\subsection{Seting Up Configuration}
63

    
64
This first step is in charge of:
65
\begin{itemize}
66
\item {} 
67
launching libraries used in the scripts
68

    
69
\item {} 
70
launching configuration (design, strain, marker...)
71

    
72
{\footnotesize
73
<<echo=TRUE, results=hide>>=
74
library(nucleominer)
75
@
76
<<echo=TRUE, results=verbatim>>=
77
package.version("nucleominer")
78

    
79
library(rjson)
80
json_conf_file = "nucleominer_config.json"
81
config = fromJSON(paste(readLines(json_conf_file), collapse=""))
82

    
83
all_samples = read.csv(config$CSV_SAMPLE_FILE, sep=";", head=TRUE, stringsAsFactors=FALSE)
84
head(all_samples)
85

    
86
markers = unique(all_samples$marker)
87
print(markers)
88

    
89
strains = unique(all_samples$strain)
90
print(strains)
91

    
92
combis = combn(strains, 2, simplify=FALSE)
93
print(combis)
94
@
95
}
96

    
97

    
98
\subsection{Computing CURs}
99

    
100
This second step is in charge of computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory.
101

    
102
\end{itemize}
103

    
104
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks.
105
\begin{itemize}
106
\item {} 
107
You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral.
108

    
109
\item {} 
110
If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice ``Generate .c2c Files'').  thenuse it to produce the list of regions and customise “translate”.
111

    
112
\end{itemize}
113

    
114
In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console:
115

    
116

    
117
{\footnotesize
118
<<echo=TRUE, results=hide>>=
119
curs = compute_curs(diff_allowed=30, min_cur_width=4000, combis, config)
120
@
121
<<echo=TRUE, results=verbatim>>=
122
head(curs)
123
# Reduce first CUR length to accelerate demo, TODO: remove it before submission.
124
curs [1,3] = curs[1,3] - 110000
125
curs [1,5] = curs[1,5] - 110000
126
@
127
}
128

    
129

    
130
\subsection{Extracting Nucleosome Maps from TemplateFilter Outputs}
131

    
132
This third step is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of  well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions .
133

    
134
The map of well-positioned nucleosomes for BY is collected in the result directory and is called \emph{BY\_wp.tab}. It is composed of following columns:
135
\begin{itemize}
136
\item {} 
137
chr, the number of the chromosome
138

    
139
\item {} 
140
lower\_bound, the lower bound of the nucleosome
141

    
142
\item {} 
143
upper\_bound, the upper bound of the nucleosome
144

    
145
\item {} 
146
cur\_index, index of the CUR
147

    
148
\item {} 
149
index\_nuc, the index of the nucleosome in the CUR
150

    
151
\item {} 
152
wp, 1 if it is a well positioned nucleosome, 0 otherwise
153

    
154
\item {} 
155
nb\_reads, the number of reads that support this nucleosome
156

    
157
\item {} 
158
nb\_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
159

    
160
\item {} 
161
llr\_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
162

    
163
\item {} 
164
llr\_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
165

    
166
\item {} 
167
wp\_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
168

    
169
\item {} 
170
wp\_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (\emph{1-pchisq(2.LLR2, df=4)})
171

    
172
\item {} 
173
dyad\_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions.
174

    
175
\end{itemize}
176

    
177
The sensitive map for BY is collected in the result directory and is called \emph{BY\_fuzzy.tab}. It is composed of following columns:
178
\begin{itemize}
179
\item {} 
180
chr, the number of the chromosome
181

    
182
\item {} 
183
lower\_bound, the lower bound of the nucleosome
184

    
185
\item {} 
186
upper\_bound, the upper bound of the nucleosome
187

    
188
\item {} 
189
cur\_index, index of the CUR
190

    
191
\end{itemize}
192

    
193
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called \emph{BY\_RM\_common\_wp.tab}. It is composed of following columns:
194
\begin{itemize}
195
\item {} 
196
cur\_index, the index of the CUR
197

    
198
\item {} 
199
index\_nuc\_BY, the index of the BY nucleosome in the CUR
200

    
201
\item {} 
202
index\_nuc\_RM, the index of the RM nucleosome in the CUR
203

    
204
\item {} 
205
llr\_score, , the LLR3 score that estimates conservation between the positions in BY and RM
206

    
207
\item {} 
208
common\_wp\_pval,  the p-value chi square test obtained from LLR3 (\emph{1-pchisq(2.LLR3, df=2)})
209

    
210
\item {} 
211
diff, the dyads shift between the positions in the two strains (in bp)
212

    
213
\end{itemize}
214

    
215
The common UNR map for BY and RM strains is collected in the result directory and is called \emph{BY\_RM\_common\_unr.tab}. It is composed of the following columns:
216
\begin{itemize}
217
\item {} 
218
cur\_index, the index of the CUR
219

    
220
\item {} 
221
index\_nuc\_BY, the index of the BY nucleosome in the CUR
222

    
223
\item {} 
224
index\_nuc\_RM,the index of the RM nucleosome in the CUR
225

    
226
\end{itemize}
227

    
228
To execute this script, run the following command in your R console:
229

    
230

    
231
{\footnotesize
232
<<echo=TRUE, results=hide>>=
233
build_maps(strains, combis, all_samples, curs, config)
234
@
235
<<echo=TRUE, results=verbatim>>=
236
BY_wp = read.table("outputs/BY_wp.tab", header=TRUE)
237
head(BY_wp)
238
RM_wp = read.table("outputs/RM_wp.tab", header=TRUE)
239
head(RM_wp)
240
BY_RM_common_wp = read.table("outputs/BY_RM_common_wp.tab", header=TRUE)
241
head(BY_RM_common_wp)
242
@
243
}
244

    
245
\subsection{Building Count Tables from TemplateFilter Inputs
246
}
247
\label{tuto:the-script-count-reads-r}
248
To associate a number of observations (read) to each nucleosome we run the script \emph{count\_reads.R}. It produces the files \emph{BY\_RM\_H3K14ac\_wp\_and\_nbreads.tab}, \emph{BY\_RM\_H3K14ac\_unr\_and\_nbreads.tab} \emph{BY\_RM\_Mnase\_Seq\_wp\_and\_nbreads.tab} and \emph{BY\_RM\_Mnase\_Seq\_unr\_and\_nbreads.tab}
249
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively.
250

    
251
For example, the file \emph{BY\_RM\_H3K14ac\_unr\_and\_nbreads.tab} contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
252
\begin{itemize}
253
\item {} 
254
chr\_BY, the number of the chromosome for BY
255

    
256
\item {} 
257
lower\_bound\_BY, the lower bound of the nucleosome for BY
258

    
259
\item {} 
260
upper\_bound\_BY, the upper bound of the nucleosome  for BY
261

    
262
\item {} 
263
index\_nuc\_BY, the index of the BY nucleosome in the CUR for BY
264

    
265
\item {} 
266
chr\_RM, the number of the chromosome for RM
267

    
268
\item {} 
269
lower\_bound\_RM, the lower bound of the nucleosome for RM
270

    
271
\item {} 
272
upper\_bound\_RM, the upper bound of the nucleosome  for RM
273

    
274
\item {} 
275
index\_nuc\_RM,the index of the RM nucleosome in the CUR for RM
276

    
277
\item {} 
278
cur\_index, index of the CUR
279

    
280
\item {} 
281
BY\_H3K14ac\_36, the number of reads for the current nucleosome for the sample 36
282

    
283
\item {} 
284
BY\_H3K14ac\_37, \#reads for sample 37
285

    
286
\item {} 
287
BY\_H3K14ac\_53, \#reads for sample 53
288

    
289
\item {} 
290
RM\_H3K14ac\_38, \#reads for sample 38
291

    
292
\item {} 
293
RM\_H3K14ac\_39, \#reads for sample 39
294

    
295
\end{itemize}
296

    
297
To execute this script, run the following command in your R console:
298

    
299

    
300
{\footnotesize
301
<<echo=TRUE, results=hide>>=
302
for (form in c("wp", "unr")) {
303
	for (marker in markers) {
304
    build_count_table(marker, combis[[1]], form, curs, all_samples, config)
305
  }
306
}			
307
@
308
<<echo=TRUE, results=verbatim>>=
309
a_count_table = read.table("outputs/BY_RM_Mnase_Seq_wp_and_nbreads.tab", header=TRUE)
310
head(a_count_table)
311
@
312
}
313

    
314
\subsection{Analysing Count Tables using DESeq}
315

    
316
Finally, the script \emph{launch\_deseq.R} perform statistical analysis on each nucleosome using \emph{DESeq}. It produces files:
317
\begin{itemize}
318
\item {} 
319
results/current/BY\_RM\_H3K14ac\_wp\_snep.tab
320

    
321
\item {} 
322
results/current/BY\_RM\_H3K14ac\_unr\_snep.tab
323

    
324
\item {} 
325
results/current/BY\_RM\_H3K14ac\_wpunr\_snep.tab
326

    
327
\end{itemize}
328

    
329
These files are organised with the following columns (see file \emph{BY\_RM\_H3K14ac\_wp\_snep.tab} for an example):
330
\begin{itemize}
331
\item {} 
332
chr\_BY, the number of the chromosome for BY
333

    
334
\item {} 
335
lower\_bound\_BY, the lower bound of the nucleosome for BY
336

    
337
\item {} 
338
upper\_bound\_BY, the upper bound of the nucleosome  for BY
339

    
340
\item {} 
341
index\_nuc\_BY, the index of the BY nucleosome in the CUR for BY
342

    
343
\item {} 
344
chr\_RM, the number of the chromosome for RM
345

    
346
\item {} 
347
lower\_bound\_RM, the lower bound of the nucleosome for RM
348

    
349
\item {} 
350
upper\_bound\_RM, the upper bound of the nucleosome  for RM
351

    
352
\item {} 
353
index\_nuc\_RM,the index of the RM nucleosome in the CUR for RM
354

    
355
\item {} 
356
cur\_index, index of the CUR
357

    
358
\item {} 
359
form
360

    
361
\item {} 
362
BY\_Mnase\_Seq\_1, the number of reads for the current nucleosome for the sample 1
363

    
364
\end{itemize}
365

    
366
Next columns concern indicators for each sample:
367
\begin{itemize}
368
\item {} 
369
BY\_Mnase\_Seq\_2, \#reads for sample 2
370

    
371
\item {} 
372
BY\_Mnase\_Seq\_3, \#reads for sample 3
373

    
374
\item {} 
375
RM\_Mnase\_Seq\_4, \#reads for sample 4
376

    
377
\item {} 
378
RM\_Mnase\_Seq\_5, \#reads for sample 5
379

    
380
\item {} 
381
RM\_Mnase\_Seq\_6, \#reads for sample 6
382

    
383
\item {} 
384
BY\_H3K14ac\_36, \#reads for sample 36
385

    
386
\item {} 
387
BY\_H3K14ac\_37, \#reads for sample 37
388

    
389
\item {} 
390
BY\_H3K14ac\_53, \#reads for sample 53
391

    
392
\item {} 
393
RM\_H3K14ac\_38, \#reads for sample 38
394

    
395
\item {} 
396
RM\_H3K14ac\_39, \#reads for sample 39
397

    
398
\end{itemize}
399

    
400
The 7 last columns concern DESeq analysis:
401
\begin{itemize}
402
\item {} 
403
manip{[}a\_manip{]} strain{[}a\_strain{]} manip{[}a\_strain{]}:strain{[}a\_strain{]}, the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
404

    
405
\item {} 
406
pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term \emph{marker:strain} to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
407

    
408
\item {} 
409
snep\_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
410

    
411
\item {} 
412
mnase\_l2fc, this is the coefficients of the fitted generalized linear model based on MNASE samples only.
413

    
414
\item {} 
415
mnase\_l2fc\_pval, the pvalue resulting from the analysis of the simple model based on MNASE samples only. This is the statsitcial significance of the \emph{strain} term and therefore the statistical significance of the strain effect.
416

    
417
\end{itemize}
418

    
419
To execute this script, run the following command in your R console:
420

    
421

    
422
{\footnotesize
423
<<echo=TRUE, results=hide>>=
424
for (form in c("wp", "unr", "wpunr")) {
425
  sneps = analyse_count_table(markers[2], combis[[1]], form, all_samples, config=config)
426
}
427
@
428
<<echo=TRUE, results=verbatim>>=
429
a_sneps_table = read.table("outputs/BY_RM_H3K14ac_wp_snep.tab", header=TRUE)
430
head(a_sneps_table)
431

    
432
@
433
}
434

    
435
Here are the number of computed SNEPs for each forms.
436

    
437

    
438

    
439

    
440
\end{document}