root / doc / R_tutorial / NucleoMiner2.Rnw @ 780f632a
Historique | Voir | Annoter | Télécharger (12,5 ko)
1 |
% Generated by Sphinx. |
---|---|
2 |
% \def\sphinxdocclass{report} |
3 |
\documentclass[letterpaper,10pt,english]{article} |
4 |
\usepackage[utf8]{inputenc} |
5 |
\DeclareUnicodeCharacter{00A0}{\nobreakspace} |
6 |
\usepackage{cmap} |
7 |
\usepackage[T1]{fontenc} |
8 |
\usepackage{babel} |
9 |
\usepackage{times} |
10 |
\usepackage[Bjarne]{fncychap} |
11 |
\usepackage{longtable} |
12 |
\usepackage{multirow} |
13 |
\usepackage{url} |
14 |
|
15 |
\addto\captionsenglish{\renewcommand{\figurename}{Fig. }} |
16 |
\addto\captionsenglish{\renewcommand{\tablename}{Table }} |
17 |
|
18 |
|
19 |
|
20 |
\title{NucleoMiner2 Documentation} |
21 |
\date{\today} |
22 |
\author{Florent CHUFFART, Jean-Baptiste VEYRIERAS, Gael YVERT} |
23 |
\newcommand{\sphinxlogo}{} |
24 |
\makeindex |
25 |
|
26 |
\def\PYGZus{\char`\_} |
27 |
|
28 |
|
29 |
\begin{document} |
30 |
|
31 |
\maketitle |
32 |
\tableofcontents |
33 |
\textbf{}\label{index::doc} |
34 |
|
35 |
|
36 |
\section{Inferring Nucleosome Position and Extracting Read Counts} |
37 |
\label{tuto:inferring-nucleosome-position-and-extracting-read-counts} |
38 |
The second part of the tutorial uses R (\url{http://http://www.r-project.org}{http://http://www.r-project.org}). |
39 |
It consists in 5 main steps: |
40 |
\begin{itemize} |
41 |
|
42 |
\item {} |
43 |
Seting Up Configuration |
44 |
|
45 |
\item {} |
46 |
Computing CURs |
47 |
|
48 |
\item {} |
49 |
Extracting Nucleosome Maps from TemplateFilter Outputs |
50 |
|
51 |
\item {} |
52 |
Building Count Tables from TemplateFilter Inputs |
53 |
|
54 |
\item {} |
55 |
Analysing Count Tables using DESeq |
56 |
|
57 |
\end{itemize} |
58 |
|
59 |
|
60 |
|
61 |
|
62 |
\subsection{Seting Up Configuration} |
63 |
|
64 |
This first step is in charge of: |
65 |
\begin{itemize} |
66 |
\item {} |
67 |
launching libraries used in the scripts |
68 |
|
69 |
\item {} |
70 |
launching configuration (design, strain, marker...) |
71 |
|
72 |
{\footnotesize |
73 |
<<echo=TRUE, results=hide>>= |
74 |
library(nucleominer) |
75 |
@ |
76 |
<<echo=TRUE, results=verbatim>>= |
77 |
package.version("nucleominer") |
78 |
|
79 |
library(rjson) |
80 |
json_conf_file = "nucleominer_config.json" |
81 |
config = fromJSON(paste(readLines(json_conf_file), collapse="")) |
82 |
|
83 |
all_samples = read.csv(config$CSV_SAMPLE_FILE, sep=";", head=TRUE, stringsAsFactors=FALSE) |
84 |
head(all_samples) |
85 |
|
86 |
markers = unique(all_samples$marker) |
87 |
print(markers) |
88 |
|
89 |
strains = unique(all_samples$strain) |
90 |
print(strains) |
91 |
|
92 |
combis = combn(strains, 2, simplify=FALSE) |
93 |
print(combis) |
94 |
@ |
95 |
} |
96 |
|
97 |
|
98 |
\subsection{Computing CURs} |
99 |
|
100 |
This second step is in charge of computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory. |
101 |
|
102 |
\end{itemize} |
103 |
|
104 |
Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. |
105 |
\begin{itemize} |
106 |
\item {} |
107 |
You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral. |
108 |
|
109 |
\item {} |
110 |
If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice ``Generate .c2c Files''). thenuse it to produce the list of regions and customise “translate”. |
111 |
|
112 |
\end{itemize} |
113 |
|
114 |
In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console: |
115 |
|
116 |
|
117 |
{\footnotesize |
118 |
<<echo=TRUE, results=hide>>= |
119 |
curs = compute_curs(diff_allowed=30, min_cur_width=4000, combis, config) |
120 |
@ |
121 |
<<echo=TRUE, results=verbatim>>= |
122 |
head(curs) |
123 |
# Reduce first CUR length to accelerate demo, TODO: remove it before submission. |
124 |
curs [1,3] = curs[1,3] - 110000 |
125 |
curs [1,5] = curs[1,5] - 110000 |
126 |
@ |
127 |
} |
128 |
|
129 |
|
130 |
\subsection{Extracting Nucleosome Maps from TemplateFilter Outputs} |
131 |
|
132 |
This third step is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions . |
133 |
|
134 |
The map of well-positioned nucleosomes for BY is collected in the result directory and is called \emph{BY\_wp.tab}. It is composed of following columns: |
135 |
\begin{itemize} |
136 |
\item {} |
137 |
chr, the number of the chromosome |
138 |
|
139 |
\item {} |
140 |
lower\_bound, the lower bound of the nucleosome |
141 |
|
142 |
\item {} |
143 |
upper\_bound, the upper bound of the nucleosome |
144 |
|
145 |
\item {} |
146 |
cur\_index, index of the CUR |
147 |
|
148 |
\item {} |
149 |
index\_nuc, the index of the nucleosome in the CUR |
150 |
|
151 |
\item {} |
152 |
wp, 1 if it is a well positioned nucleosome, 0 otherwise |
153 |
|
154 |
\item {} |
155 |
nb\_reads, the number of reads that support this nucleosome |
156 |
|
157 |
\item {} |
158 |
nb\_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome) |
159 |
|
160 |
\item {} |
161 |
llr\_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain. |
162 |
|
163 |
\item {} |
164 |
llr\_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain. |
165 |
|
166 |
\item {} |
167 |
wp\_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes. |
168 |
|
169 |
\item {} |
170 |
wp\_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (\emph{1-pchisq(2.LLR2, df=4)}) |
171 |
|
172 |
\item {} |
173 |
dyad\_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions. |
174 |
|
175 |
\end{itemize} |
176 |
|
177 |
The sensitive map for BY is collected in the result directory and is called \emph{BY\_fuzzy.tab}. It is composed of following columns: |
178 |
\begin{itemize} |
179 |
\item {} |
180 |
chr, the number of the chromosome |
181 |
|
182 |
\item {} |
183 |
lower\_bound, the lower bound of the nucleosome |
184 |
|
185 |
\item {} |
186 |
upper\_bound, the upper bound of the nucleosome |
187 |
|
188 |
\item {} |
189 |
cur\_index, index of the CUR |
190 |
|
191 |
\end{itemize} |
192 |
|
193 |
The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called \emph{BY\_RM\_common\_wp.tab}. It is composed of following columns: |
194 |
\begin{itemize} |
195 |
\item {} |
196 |
cur\_index, the index of the CUR |
197 |
|
198 |
\item {} |
199 |
index\_nuc\_BY, the index of the BY nucleosome in the CUR |
200 |
|
201 |
\item {} |
202 |
index\_nuc\_RM, the index of the RM nucleosome in the CUR |
203 |
|
204 |
\item {} |
205 |
llr\_score, , the LLR3 score that estimates conservation between the positions in BY and RM |
206 |
|
207 |
\item {} |
208 |
common\_wp\_pval, the p-value chi square test obtained from LLR3 (\emph{1-pchisq(2.LLR3, df=2)}) |
209 |
|
210 |
\item {} |
211 |
diff, the dyads shift between the positions in the two strains (in bp) |
212 |
|
213 |
\end{itemize} |
214 |
|
215 |
The common UNR map for BY and RM strains is collected in the result directory and is called \emph{BY\_RM\_common\_unr.tab}. It is composed of the following columns: |
216 |
\begin{itemize} |
217 |
\item {} |
218 |
cur\_index, the index of the CUR |
219 |
|
220 |
\item {} |
221 |
index\_nuc\_BY, the index of the BY nucleosome in the CUR |
222 |
|
223 |
\item {} |
224 |
index\_nuc\_RM,the index of the RM nucleosome in the CUR |
225 |
|
226 |
\end{itemize} |
227 |
|
228 |
To execute this script, run the following command in your R console: |
229 |
|
230 |
|
231 |
{\footnotesize |
232 |
<<echo=TRUE, results=hide>>= |
233 |
build_maps(strains, combis, all_samples, curs, config) |
234 |
@ |
235 |
<<echo=TRUE, results=verbatim>>= |
236 |
BY_wp = read.table("outputs/BY_wp.tab", header=TRUE) |
237 |
head(BY_wp) |
238 |
RM_wp = read.table("outputs/RM_wp.tab", header=TRUE) |
239 |
head(RM_wp) |
240 |
BY_RM_common_wp = read.table("outputs/BY_RM_common_wp.tab", header=TRUE) |
241 |
head(BY_RM_common_wp) |
242 |
@ |
243 |
} |
244 |
|
245 |
\subsection{Building Count Tables from TemplateFilter Inputs |
246 |
} |
247 |
\label{tuto:the-script-count-reads-r} |
248 |
To associate a number of observations (read) to each nucleosome we run the script \emph{count\_reads.R}. It produces the files \emph{BY\_RM\_H3K14ac\_wp\_and\_nbreads.tab}, \emph{BY\_RM\_H3K14ac\_unr\_and\_nbreads.tab} \emph{BY\_RM\_Mnase\_Seq\_wp\_and\_nbreads.tab} and \emph{BY\_RM\_Mnase\_Seq\_unr\_and\_nbreads.tab} |
249 |
for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively. |
250 |
|
251 |
For example, the file \emph{BY\_RM\_H3K14ac\_unr\_and\_nbreads.tab} contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns: |
252 |
\begin{itemize} |
253 |
\item {} |
254 |
chr\_BY, the number of the chromosome for BY |
255 |
|
256 |
\item {} |
257 |
lower\_bound\_BY, the lower bound of the nucleosome for BY |
258 |
|
259 |
\item {} |
260 |
upper\_bound\_BY, the upper bound of the nucleosome for BY |
261 |
|
262 |
\item {} |
263 |
index\_nuc\_BY, the index of the BY nucleosome in the CUR for BY |
264 |
|
265 |
\item {} |
266 |
chr\_RM, the number of the chromosome for RM |
267 |
|
268 |
\item {} |
269 |
lower\_bound\_RM, the lower bound of the nucleosome for RM |
270 |
|
271 |
\item {} |
272 |
upper\_bound\_RM, the upper bound of the nucleosome for RM |
273 |
|
274 |
\item {} |
275 |
index\_nuc\_RM,the index of the RM nucleosome in the CUR for RM |
276 |
|
277 |
\item {} |
278 |
cur\_index, index of the CUR |
279 |
|
280 |
\item {} |
281 |
BY\_H3K14ac\_36, the number of reads for the current nucleosome for the sample 36 |
282 |
|
283 |
\item {} |
284 |
BY\_H3K14ac\_37, \#reads for sample 37 |
285 |
|
286 |
\item {} |
287 |
BY\_H3K14ac\_53, \#reads for sample 53 |
288 |
|
289 |
\item {} |
290 |
RM\_H3K14ac\_38, \#reads for sample 38 |
291 |
|
292 |
\item {} |
293 |
RM\_H3K14ac\_39, \#reads for sample 39 |
294 |
|
295 |
\end{itemize} |
296 |
|
297 |
To execute this script, run the following command in your R console: |
298 |
|
299 |
|
300 |
{\footnotesize |
301 |
<<echo=TRUE, results=hide>>= |
302 |
for (form in c("wp", "unr")) { |
303 |
for (marker in markers) { |
304 |
build_count_table(marker, combis[[1]], form, curs, all_samples, config) |
305 |
} |
306 |
} |
307 |
@ |
308 |
<<echo=TRUE, results=verbatim>>= |
309 |
a_count_table = read.table("outputs/BY_RM_Mnase_Seq_wp_and_nbreads.tab", header=TRUE) |
310 |
head(a_count_table) |
311 |
@ |
312 |
} |
313 |
|
314 |
\subsection{Analysing Count Tables using DESeq} |
315 |
|
316 |
Finally, the script \emph{launch\_deseq.R} perform statistical analysis on each nucleosome using \emph{DESeq}. It produces files: |
317 |
\begin{itemize} |
318 |
\item {} |
319 |
results/current/BY\_RM\_H3K14ac\_wp\_snep.tab |
320 |
|
321 |
\item {} |
322 |
results/current/BY\_RM\_H3K14ac\_unr\_snep.tab |
323 |
|
324 |
\item {} |
325 |
results/current/BY\_RM\_H3K14ac\_wpunr\_snep.tab |
326 |
|
327 |
\end{itemize} |
328 |
|
329 |
These files are organised with the following columns (see file \emph{BY\_RM\_H3K14ac\_wp\_snep.tab} for an example): |
330 |
\begin{itemize} |
331 |
\item {} |
332 |
chr\_BY, the number of the chromosome for BY |
333 |
|
334 |
\item {} |
335 |
lower\_bound\_BY, the lower bound of the nucleosome for BY |
336 |
|
337 |
\item {} |
338 |
upper\_bound\_BY, the upper bound of the nucleosome for BY |
339 |
|
340 |
\item {} |
341 |
index\_nuc\_BY, the index of the BY nucleosome in the CUR for BY |
342 |
|
343 |
\item {} |
344 |
chr\_RM, the number of the chromosome for RM |
345 |
|
346 |
\item {} |
347 |
lower\_bound\_RM, the lower bound of the nucleosome for RM |
348 |
|
349 |
\item {} |
350 |
upper\_bound\_RM, the upper bound of the nucleosome for RM |
351 |
|
352 |
\item {} |
353 |
index\_nuc\_RM,the index of the RM nucleosome in the CUR for RM |
354 |
|
355 |
\item {} |
356 |
cur\_index, index of the CUR |
357 |
|
358 |
\item {} |
359 |
form |
360 |
|
361 |
\item {} |
362 |
BY\_Mnase\_Seq\_1, the number of reads for the current nucleosome for the sample 1 |
363 |
|
364 |
\end{itemize} |
365 |
|
366 |
Next columns concern indicators for each sample: |
367 |
\begin{itemize} |
368 |
\item {} |
369 |
BY\_Mnase\_Seq\_2, \#reads for sample 2 |
370 |
|
371 |
\item {} |
372 |
BY\_Mnase\_Seq\_3, \#reads for sample 3 |
373 |
|
374 |
\item {} |
375 |
RM\_Mnase\_Seq\_4, \#reads for sample 4 |
376 |
|
377 |
\item {} |
378 |
RM\_Mnase\_Seq\_5, \#reads for sample 5 |
379 |
|
380 |
\item {} |
381 |
RM\_Mnase\_Seq\_6, \#reads for sample 6 |
382 |
|
383 |
\item {} |
384 |
BY\_H3K14ac\_36, \#reads for sample 36 |
385 |
|
386 |
\item {} |
387 |
BY\_H3K14ac\_37, \#reads for sample 37 |
388 |
|
389 |
\item {} |
390 |
BY\_H3K14ac\_53, \#reads for sample 53 |
391 |
|
392 |
\item {} |
393 |
RM\_H3K14ac\_38, \#reads for sample 38 |
394 |
|
395 |
\item {} |
396 |
RM\_H3K14ac\_39, \#reads for sample 39 |
397 |
|
398 |
\end{itemize} |
399 |
|
400 |
The 7 last columns concern DESeq analysis: |
401 |
\begin{itemize} |
402 |
\item {} |
403 |
manip{[}a\_manip{]} strain{[}a\_strain{]} manip{[}a\_strain{]}:strain{[}a\_strain{]}, the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model. |
404 |
|
405 |
\item {} |
406 |
pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term \emph{marker:strain} to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP. |
407 |
|
408 |
\item {} |
409 |
snep\_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001. |
410 |
|
411 |
\item {} |
412 |
mnase\_l2fc, this is the coefficients of the fitted generalized linear model based on MNASE samples only. |
413 |
|
414 |
\item {} |
415 |
mnase\_l2fc\_pval, the pvalue resulting from the analysis of the simple model based on MNASE samples only. This is the statsitcial significance of the \emph{strain} term and therefore the statistical significance of the strain effect. |
416 |
|
417 |
\end{itemize} |
418 |
|
419 |
To execute this script, run the following command in your R console: |
420 |
|
421 |
|
422 |
{\footnotesize |
423 |
<<echo=TRUE, results=hide>>= |
424 |
for (form in c("wp", "unr", "wpunr")) { |
425 |
sneps = analyse_count_table(markers[2], combis[[1]], form, all_samples, config=config) |
426 |
} |
427 |
@ |
428 |
<<echo=TRUE, results=verbatim>>= |
429 |
a_sneps_table = read.table("outputs/BY_RM_H3K14ac_wp_snep.tab", header=TRUE) |
430 |
head(a_sneps_table) |
431 |
|
432 |
@ |
433 |
} |
434 |
|
435 |
Here are the number of computed SNEPs for each forms. |
436 |
|
437 |
|
438 |
|
439 |
|
440 |
\end{document} |