/README - Annoter - NucleoMiner - Forge du Centre Blaise Pascal

root / README @ master

Historique | Voir | Annoter | Télécharger (30,72 ko)

1	935a568c	Florent Chuffart
2	3c88abd0	Florent Chuffart	Welcome to NucleoMiner2
3	3c88abd0	Florent Chuffart	*************************
4	3c88abd0	Florent Chuffart
5	3c88abd0	Florent Chuffart	* Readme / Documentation for NucleoMiner2
6	3c88abd0	Florent Chuffart
7	3c88abd0	Florent Chuffart	* License
8	3c88abd0	Florent Chuffart
9	3c88abd0	Florent Chuffart	* Installation Instructions
10	3c88abd0	Florent Chuffart
11	3c88abd0	Florent Chuffart	* Tutorial
12	3c88abd0	Florent Chuffart
13	3c88abd0	Florent Chuffart	* Experimental Dataset, Working Directory and Configuration File
14	3c88abd0	Florent Chuffart
15	3c88abd0	Florent Chuffart	* Preprocessing Illumina Fastq Reads for Each Sample
16	3c88abd0	Florent Chuffart
17	3c88abd0	Florent Chuffart	* Inferring Nucleosome Position and Extracting Read Counts
18	3c88abd0	Florent Chuffart
19	3c88abd0	Florent Chuffart	* Results: Number of SNEPs
20	3c88abd0	Florent Chuffart
21	3c88abd0	Florent Chuffart	* APPENDICE: Generate .c2c Files
22	3c88abd0	Florent Chuffart
23	3c88abd0	Florent Chuffart	* References
24	3c88abd0	Florent Chuffart
25	3c88abd0	Florent Chuffart	* Python Reference
26	3c88abd0	Florent Chuffart
27	3c88abd0	Florent Chuffart	* R Reference
28	3c88abd0	Florent Chuffart
29	935a568c	Florent Chuffart	Readme / Documentation for NucleoMiner2
30	935a568c	Florent Chuffart	*****************************************
31	935a568c	Florent Chuffart
32	935a568c	Florent Chuffart	NucleoMiner2 offers Python API and R package allowing to perform
33	e5603c3f	Florent Chuffart	quantitative analysis of epigenetic marks on individual nucleosomes.
34	e5603c3f	Florent Chuffart	It was developed to detect natural Single-Nucleosome Epi-Polymorphisms
35	e5603c3f	Florent Chuffart	(SNEP) from MNase-seq and ChIP-seq data.
36	935a568c	Florent Chuffart
37	935a568c	Florent Chuffart
38	935a568c	Florent Chuffart	License
39	935a568c	Florent Chuffart	=======
40	935a568c	Florent Chuffart
41	935a568c	Florent Chuffart	Copyright CNRS 2012-2013
42	935a568c	Florent Chuffart
43	935a568c	Florent Chuffart	* Florent CHUFFART
44	935a568c	Florent Chuffart
45	935a568c	Florent Chuffart	* Jean-Baptiste VEYRIERAS
46	935a568c	Florent Chuffart
47	935a568c	Florent Chuffart	* Gael YVERT
48	935a568c	Florent Chuffart
49	935a568c	Florent Chuffart	This software is a computer program which purpose is to perform
50	935a568c	Florent Chuffart	quanti- tative analysis of epigenetic marks at single nucleosome
51	935a568c	Florent Chuffart	resolution.
52	935a568c	Florent Chuffart
53	935a568c	Florent Chuffart	This software is governed by the CeCILL license under French law and
54	935a568c	Florent Chuffart	abiding by the rules of distribution of free software. You can use,
55	935a568c	Florent Chuffart	modify and/ or redistribute the software under the terms of the CeCILL
56	935a568c	Florent Chuffart	license as circulated by CEA, CNRS and INRIA at the following URL
57	935a568c	Florent Chuffart	"http://www.cecill.info".
58	935a568c	Florent Chuffart
59	935a568c	Florent Chuffart	As a counterpart to the access to the source code and rights to copy,
60	935a568c	Florent Chuffart	modify and redistribute granted by the license, users are provided
61	935a568c	Florent Chuffart	only with a limited warranty and the software's author, the holder
62	935a568c	Florent Chuffart	of the economic rights, and the successive licensors have only
63	935a568c	Florent Chuffart	limited liability.
64	935a568c	Florent Chuffart
65	e5603c3f	Florent Chuffart	This software is provided with absolutely NO WARRANTY. The authors can
66	e5603c3f	Florent Chuffart	not be held responsible, even partially, for any damage, loss,
67	e5603c3f	Florent Chuffart	financial loss or any other undesired facts resulting from the use of
68	e5603c3f	Florent Chuffart	the software. In this respect, the user's attention is drawn to the
69	e5603c3f	Florent Chuffart	risks associated with loading, using, modifying and/or developing or
70	e5603c3f	Florent Chuffart	reproducing the software by the user in light of its specific status
71	e5603c3f	Florent Chuffart	of free software, that may mean that it is complicated to manipulate,
72	e5603c3f	Florent Chuffart	and that also therefore means that it is reserved for developers
73	e5603c3f	Florent Chuffart	and experienced professionals having in-depth computer knowledge.
74	e5603c3f	Florent Chuffart	Users are therefore encouraged to load and test the software's
75	e5603c3f	Florent Chuffart	suitability as regards their requirements in conditions enabling the
76	e5603c3f	Florent Chuffart	security of their systems and/or data to be ensured and, more
77	e5603c3f	Florent Chuffart	generally, to use and operate it in the same conditions as regards
78	e5603c3f	Florent Chuffart	security.
79	935a568c	Florent Chuffart
80	935a568c	Florent Chuffart	The fact that you are presently reading this means that you have had
81	935a568c	Florent Chuffart	knowledge of the CeCILL license and that you accept its terms.
82	935a568c	Florent Chuffart
83	935a568c	Florent Chuffart
84	935a568c	Florent Chuffart	Installation Instructions
85	935a568c	Florent Chuffart	=========================
86	935a568c	Florent Chuffart
87	935a568c	Florent Chuffart
88	935a568c	Florent Chuffart	Links
89	935a568c	Florent Chuffart	-----
90	935a568c	Florent Chuffart
91	e5603c3f	Florent Chuffart	NucleoMiner2 home page and documentation are available here:
92	935a568c	Florent Chuffart
93	e5603c3f	Florent Chuffart	* https://forge.cbp.ens-lyon.fr/redmine/projects/nucleominer
94	e5603c3f	Florent Chuffart
95	e5603c3f	Florent Chuffart	The Yvert lab web page is accessible here:
96	e5603c3f	Florent Chuffart
97	e5603c3f	Florent Chuffart	* http://www.ens-lyon.fr/LBMC/gisv/
98	935a568c	Florent Chuffart
99	935a568c	Florent Chuffart
100	935a568c	Florent Chuffart	Installation
101	935a568c	Florent Chuffart	------------
102	935a568c	Florent Chuffart
103	5badc2fd	Florent Chuffart	The first installation step is to retrieve the source code of
104	3961deb6	Florent Chuffart	NucleoMiner2. You can do this by typing the following command in a
105	5badc2fd	Florent Chuffart	terminal.
106	5badc2fd	Florent Chuffart
107	5badc2fd	Florent Chuffart	git clone http://forge.cbp.ens-lyon.fr/git/nucleominer
108	5badc2fd	Florent Chuffart
109	935a568c	Florent Chuffart
110	e5603c3f	Florent Chuffart	Prerequisites
111	e5603c3f	Florent Chuffart	~~~~~~~~~~~~~
112	e5603c3f	Florent Chuffart
113	e5603c3f	Florent Chuffart	To work properly, NucleoMiner2 needs that the following free software
114	e5603c3f	Florent Chuffart	are installed and made available on your system:
115	e5603c3f	Florent Chuffart
116	e5603c3f	Florent Chuffart	* Bowtie2 http://bowtie-bio.sourceforge.net/bowtie2
117	935a568c	Florent Chuffart
118	e5603c3f	Florent Chuffart	* SAMtools http://samtools.sourceforge.net
119	935a568c	Florent Chuffart
120	e5603c3f	Florent Chuffart	* bedtools http://code.google.com/p/bedtools/
121	935a568c	Florent Chuffart
122	e5603c3f	Florent Chuffart	* TemplateFilter
123	e5603c3f	Florent Chuffart	http://compbio.cs.huji.ac.il/NucPosition/TemplateFiltering
124	935a568c	Florent Chuffart
125	e5603c3f	Florent Chuffart	It also requires the following R packages to be installed on your
126	e5603c3f	Florent Chuffart	system:
127	dadb6a4d	Florent Chuffart
128	dadb6a4d	Florent Chuffart	* fork
129	935a568c	Florent Chuffart
130	935a568c	Florent Chuffart	* rjson
131	935a568c	Florent Chuffart
132	935a568c	Florent Chuffart	* seqinr
133	935a568c	Florent Chuffart
134	5badc2fd	Florent Chuffart	* plotrix
135	5badc2fd	Florent Chuffart
136	5badc2fd	Florent Chuffart	* DESeq
137	5badc2fd	Florent Chuffart
138	3961deb6	Florent Chuffart	These packages can be installed by typing the following command in an
139	3961deb6	Florent Chuffart	R console:
140	3961deb6	Florent Chuffart
141	3961deb6	Florent Chuffart	install.packages(c("fork", "rjson", "seqinr", "plotrix"))
142	3961deb6	Florent Chuffart	source("http://bioconductor.org/biocLite.R")
143	3961deb6	Florent Chuffart	biocLite("DESeq")
144	3961deb6	Florent Chuffart
145	3961deb6	Florent Chuffart	Finally,by typing the git command above, you downloaded specific R
146	3961deb6	Florent Chuffart	packages provided with NucleoMiner2 that you now need to install:
147	3961deb6	Florent Chuffart
148	e5603c3f	Florent Chuffart	* cachecache https://forge.cbp.ens-
149	e5603c3f	Florent Chuffart	lyon.fr/redmine/projects/cachecache
150	935a568c	Florent Chuffart
151	e5603c3f	Florent Chuffart	* bot https://forge.cbp.ens-lyon.fr/redmine/projects/bot
152	935a568c	Florent Chuffart
153	e5603c3f	Florent Chuffart	* nucleominer https://forge.cbp.ens-
154	e5603c3f	Florent Chuffart	lyon.fr/redmine/projects/nucleominer
155	5badc2fd	Florent Chuffart
156	3961deb6	Florent Chuffart	To do so, type the following command in your terminal:
157	5badc2fd	Florent Chuffart
158	5badc2fd	Florent Chuffart	cd nucleominer
159	5badc2fd	Florent Chuffart	R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\
160	5badc2fd	Florent Chuffart	doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\
161	5badc2fd	Florent Chuffart	build/nucleominer_2.3.46.tar.gz
162	3c88abd0	Florent Chuffart
163	3c88abd0	Florent Chuffart	Tutorial
164	3c88abd0	Florent Chuffart	********
165	3c88abd0	Florent Chuffart
166	3961deb6	Florent Chuffart	This tutorial describes steps allowing to perform quantitative
167	3c88abd0	Florent Chuffart	analysis of epigenetic marks on individual nucleosomes. We assume that
168	3c88abd0	Florent Chuffart	files are organised according to a given hierarchy and that all
169	3c88abd0	Florent Chuffart	command lines are launched from the project’s root directory.
170	3c88abd0	Florent Chuffart
171	3c88abd0	Florent Chuffart	This tutorial is divided into two main parts. The first part covers
172	3c88abd0	Florent Chuffart	the python script wf.py that aligns and converts short sequence
173	3961deb6	Florent Chuffart	reads. The second part covers the R scripts that extracts nucleosome-
174	3961deb6	Florent Chuffart	level information (nucleosome position and indicators) from the
175	3961deb6	Florent Chuffart	dataset.
176	3c88abd0	Florent Chuffart
177	3c88abd0	Florent Chuffart
178	3c88abd0	Florent Chuffart	Experimental Dataset, Working Directory and Configuration File
179	3c88abd0	Florent Chuffart	==============================================================
180	3c88abd0	Florent Chuffart
181	3c88abd0	Florent Chuffart
182	3c88abd0	Florent Chuffart	Working Directory Organisation
183	3c88abd0	Florent Chuffart	------------------------------
184	3c88abd0	Florent Chuffart
185	3961deb6	Florent Chuffart	After having installed NucleoMiner2 environment (Previous section), go
186	3c88abd0	Florent Chuffart	to the root working directory of the tutorial by typing the following
187	3c88abd0	Florent Chuffart	command in a terminal:
188	3c88abd0	Florent Chuffart
189	3c88abd0	Florent Chuffart	cd doc/Chuffart_NM2_workdir/
190	3c88abd0	Florent Chuffart
191	3c88abd0	Florent Chuffart
192	3c88abd0	Florent Chuffart	Retrieving Experimental Dataset
193	3c88abd0	Florent Chuffart	-------------------------------
194	3c88abd0	Florent Chuffart
195	3c88abd0	Florent Chuffart	The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress
196	3c88abd0	Florent Chuffart	(http://www.ebi.ac.uk/arrayexpress/) under accession number
197	3c88abd0	Florent Chuffart	E-MTAB-2671.
198	3c88abd0	Florent Chuffart
199	3c88abd0	Florent Chuffart	$$$ TODO explain how organise Experimental Dataset into the data
200	3c88abd0	Florent Chuffart	directory of the working directory.
201	3c88abd0	Florent Chuffart
202	3961deb6	Florent Chuffart	In this tutorial, we want to compare nucleosomes of 2 yeast strains:
203	3961deb6	Florent Chuffart	BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq
204	3961deb6	Florent Chuffart	using an antibody recognizing the H3K14ac epigenetic mark. Illumina
205	3961deb6	Florent Chuffart	sequencing was done in single-read of 50 bp long.
206	3c88abd0	Florent Chuffart
207	3c88abd0	Florent Chuffart	The dataset is composed of 55 files organised as follows:
208	3c88abd0	Florent Chuffart
209	3c88abd0	Florent Chuffart	* 3 replicates for BY MNase Seq
210	3c88abd0	Florent Chuffart
211	3c88abd0	Florent Chuffart	* sample 1 (5 fastq.gz files)
212	3c88abd0	Florent Chuffart
213	3c88abd0	Florent Chuffart	* sample 2 (5 fastq.gz files)
214	3c88abd0	Florent Chuffart
215	3c88abd0	Florent Chuffart	* sample 3 (4 fastq.gz files)
216	3c88abd0	Florent Chuffart
217	3c88abd0	Florent Chuffart	* 3 replicates for RM MNase Seq
218	3c88abd0	Florent Chuffart
219	3c88abd0	Florent Chuffart	* sample 4 (4 fastq.gz files)
220	3c88abd0	Florent Chuffart
221	3c88abd0	Florent Chuffart	* sample 5 (4 fastq.gz files)
222	3c88abd0	Florent Chuffart
223	3c88abd0	Florent Chuffart	* sample 6 (5 fastq.gz files)
224	3c88abd0	Florent Chuffart
225	3c88abd0	Florent Chuffart	* 3 replicates for BY ChIP Seq H3K14ac
226	3c88abd0	Florent Chuffart
227	3c88abd0	Florent Chuffart	* sample 36 (5 fastq.gz files)
228	3c88abd0	Florent Chuffart
229	3c88abd0	Florent Chuffart	* sample 37 (5 fastq.gz files)
230	3c88abd0	Florent Chuffart
231	3c88abd0	Florent Chuffart	* sample 53 (9 fastq.gz files)
232	3c88abd0	Florent Chuffart
233	3c88abd0	Florent Chuffart	* 2 replicates for RM ChIP Seq H3K14ac
234	3c88abd0	Florent Chuffart
235	3c88abd0	Florent Chuffart	* sample 38 (5 fastq.gz files)
236	3c88abd0	Florent Chuffart
237	3c88abd0	Florent Chuffart	* sample 39 (4 fastq.gz files)
238	3c88abd0	Florent Chuffart
239	3c88abd0	Florent Chuffart
240	3c88abd0	Florent Chuffart	Python and R Common Configuration File
241	3c88abd0	Florent Chuffart	--------------------------------------
242	3c88abd0	Florent Chuffart
243	3961deb6	Florent Chuffart	First, we need to define useful configuration variables that will be
244	3961deb6	Florent Chuffart	passed to python and R scripts. These variables are contained in file
245	3961deb6	Florent Chuffart	configurator.py. The execution of this python script dumps variables
246	3961deb6	Florent Chuffart	into the nucleominer_config.json file that will then be used by both
247	3961deb6	Florent Chuffart	R and python scripts.
248	3c88abd0	Florent Chuffart
249	3961deb6	Florent Chuffart	The initialization of this variables is done in the configurator.py
250	3961deb6	Florent Chuffart	file. If you need to adapt variable values (path, default
251	3961deb6	Florent Chuffart	parameters...) you need to edit this file. Then, go to the root
252	3961deb6	Florent Chuffart	directory of your project and run the following command to dump the
253	3961deb6	Florent Chuffart	configuration file:
254	3c88abd0	Florent Chuffart
255	3c88abd0	Florent Chuffart	python src/current/configurator.py
256	3c88abd0	Florent Chuffart
257	3c88abd0	Florent Chuffart
258	3c88abd0	Florent Chuffart	Preprocessing Illumina Fastq Reads for Each Sample
259	3c88abd0	Florent Chuffart	==================================================
260	3c88abd0	Florent Chuffart
261	3961deb6	Florent Chuffart	Once variables and design have been specified, the script wf.py will
262	3961deb6	Florent Chuffart	automatically run all the analysis. You don't need to do anything. To
263	3961deb6	Florent Chuffart	run the full analysis, run the following command:
264	3961deb6	Florent Chuffart
265	3961deb6	Florent Chuffart	python src/current/wf.py
266	3961deb6	Florent Chuffart
267	3961deb6	Florent Chuffart	The details of the steps performed by this script are explained below.
268	3961deb6	Florent Chuffart	This preprocessing consists of 4 steps embedded in the wf.py script.
269	3961deb6	Florent Chuffart	They are described bellow. As a preamble, this script computes
270	3961deb6	Florent Chuffart	samples, samples_mnase and strains that will be used along the 4
271	3961deb6	Florent Chuffart	steps.
272	3c88abd0	Florent Chuffart
273	3c88abd0	Florent Chuffart	wf.samples = []
274	3c88abd0	Florent Chuffart
275	3c88abd0	Florent Chuffart	List of samples where a sample is identified by an id (key: id)
276	3c88abd0	Florent Chuffart	and a strain name (key strain).
277	3c88abd0	Florent Chuffart
278	3c88abd0	Florent Chuffart	wf.samples_mnase = []
279	3c88abd0	Florent Chuffart
280	3c88abd0	Florent Chuffart	List of Mnase samples.
281	3c88abd0	Florent Chuffart
282	3c88abd0	Florent Chuffart	wf.strains = []
283	3c88abd0	Florent Chuffart
284	3c88abd0	Florent Chuffart	List of reference strains.
285	3c88abd0	Florent Chuffart
286	3c88abd0	Florent Chuffart
287	3c88abd0	Florent Chuffart	Creating Bowtie Index from each Reference Genome
288	3c88abd0	Florent Chuffart	------------------------------------------------
289	3c88abd0	Florent Chuffart
290	3961deb6	Florent Chuffart	For each strain, the script wf.py then creates bowtie index. Bowtie
291	3961deb6	Florent Chuffart	index of a strain is a tree view of the genome of this strain. It will
292	3961deb6	Florent Chuffart	be used by bowtie to align reads. The part of the script performing
293	3961deb6	Florent Chuffart	this is the following:
294	3c88abd0	Florent Chuffart
295	3c88abd0	Florent Chuffart	for strain in strains:
296	3c88abd0	Florent Chuffart	per_strain_stats[strain] = create_bowtie_index(strain,
297	3c88abd0	Florent Chuffart	config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"],
298	3c88abd0	Florent Chuffart	config["BOWTIE_BUILD_BIN"])
299	3c88abd0	Florent Chuffart
300	3961deb6	Florent Chuffart	As an indication, the following table summarizes the file sizes and
301	3961deb6	Florent Chuffart	process durations that we experienced when running this step on a
302	3961deb6	Florent Chuffart	Linux server***.
303	3c88abd0	Florent Chuffart
304	3c88abd0	Florent Chuffart	+--------+------------------------+------------------------+------------------+
305	3c88abd0	Florent Chuffart	\| strain \| fasta genome file size \| bowtie index file size \| process duration \|
306	3c88abd0	Florent Chuffart	+========+========================+========================+==================+
307	3c88abd0	Florent Chuffart	\| BY \| 12 Mo \| 25 Mo \| 11 s. \|
308	3c88abd0	Florent Chuffart	+--------+------------------------+------------------------+------------------+
309	3c88abd0	Florent Chuffart	\| RM \| 12 Mo \| 24 Mo \| 9 s. \|
310	3c88abd0	Florent Chuffart	+--------+------------------------+------------------------+------------------+
311	3c88abd0	Florent Chuffart
312	3c88abd0	Florent Chuffart
313	3c88abd0	Florent Chuffart	Aligning Reads to Reference Genome
314	3c88abd0	Florent Chuffart	----------------------------------
315	3c88abd0	Florent Chuffart
316	3961deb6	Florent Chuffart	Next, the wf.py script launches bowtie to align reads to the
317	3961deb6	Florent Chuffart	reference genome. It produces a .sam file that is converted into a
318	3961deb6	Florent Chuffart	.bed file. Binaries for bowtie, samtools and bedtools are
319	3961deb6	Florent Chuffart	wrapped using python subprocess class. This step is performed by the
320	3961deb6	Florent Chuffart	following part of the script:
321	3c88abd0	Florent Chuffart
322	3c88abd0	Florent Chuffart	for sample in samples:
323	3c88abd0	Florent Chuffart	per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample,
324	3c88abd0	Florent Chuffart	config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"],
325	3c88abd0	Florent Chuffart	config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"],
326	3c88abd0	Florent Chuffart	config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
327	3c88abd0	Florent Chuffart
328	3c88abd0	Florent Chuffart
329	3c88abd0	Florent Chuffart	Convert Aligned Reads into TemplateFilter Format
330	3c88abd0	Florent Chuffart	------------------------------------------------
331	3c88abd0	Florent Chuffart
332	3c88abd0	Florent Chuffart	TemplateFilter uses particular input formats for reads, so it is
333	3961deb6	Florent Chuffart	necessary to convert the .bed files. TemplateFilter expect reads in
334	3961deb6	Florent Chuffart	the following format: chr, coord, strand and #read where:
335	3c88abd0	Florent Chuffart
336	3c88abd0	Florent Chuffart	* chr is the number of the chromosome;
337	3c88abd0	Florent Chuffart
338	3c88abd0	Florent Chuffart	* coord is the coordinate of the reads;
339	3c88abd0	Florent Chuffart
340	3c88abd0	Florent Chuffart	* strand is F for forward and R for reverse;
341	3c88abd0	Florent Chuffart
342	3c88abd0	Florent Chuffart	* #reads the number of reads covering this position.
343	3c88abd0	Florent Chuffart
344	3c88abd0	Florent Chuffart	Each entry is tab-separated.
345	3c88abd0	Florent Chuffart
346	3c88abd0	Florent Chuffart	WARNING for reverse strands, bowtie returns the position of the
347	3c88abd0	Florent Chuffart	first nucleotide on the left hand side, whereas TemplateFilter expects
348	3961deb6	Florent Chuffart	the first one on the right hand side. This is taken into account in
349	3961deb6	Florent Chuffart	NucleoMiner2 by adding the read length (in our case 50) to the reverse
350	3c88abd0	Florent Chuffart	reads coordinates.
351	3c88abd0	Florent Chuffart
352	3c88abd0	Florent Chuffart	This step is performed by the following part of the wf.py script:
353	3c88abd0	Florent Chuffart
354	3c88abd0	Florent Chuffart	for sample in samples:
355	3c88abd0	Florent Chuffart	per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample,
356	3c88abd0	Florent Chuffart	config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"],
357	3c88abd0	Florent Chuffart	config["READ_LENGTH"],config["MAPQ_THRES"])
358	3c88abd0	Florent Chuffart
359	3961deb6	Florent Chuffart	The following table summarizes the number of reads, the involved file
360	3961deb6	Florent Chuffart	sizes and process durations that we experienced when running the two
361	3961deb6	Florent Chuffart	last steps. In our case, alignment process were multithreaded over 3
362	3961deb6	Florent Chuffart	cores.
363	3c88abd0	Florent Chuffart
364	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
365	3c88abd0	Florent Chuffart	\| id \| Illumina reads \| aligned and filtred reads \| ratio \| .bed file size \| TF input file size \| process duration \|
366	3c88abd0	Florent Chuffart	+====+================+===========================+========+==================+====================+==================+
367	3c88abd0	Florent Chuffart	\| 1 \| 16436138 \| 10199695 \| 62,06% \| 1064 Mo \| 60 Mo \| 383 s. \|
368	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
369	3c88abd0	Florent Chuffart	\| 2 \| 16911132 \| 12512727 \| 73,99% \| 1298 Mo \| 64 Mo \| 437 s. \|
370	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
371	3c88abd0	Florent Chuffart	\| 3 \| 15946902 \| 12340426 \| 77,38% \| 1280 Mo \| 65 Mo \| 423 s. \|
372	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
373	3c88abd0	Florent Chuffart	\| 4 \| 13765584 \| 10381903 \| 75,42% \| 931 Mo \| 59 Mo \| 352 s. \|
374	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
375	3c88abd0	Florent Chuffart	\| 5 \| 15168268 \| 11502855 \| 75,83% \| 1031 Mo \| 64 Mo \| 386 s. \|
376	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
377	3c88abd0	Florent Chuffart	\| 6 \| 18850820 \| 14024905 \| 74,40% \| 1254 Mo \| 69 Mo \| 482 s. \|
378	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
379	3c88abd0	Florent Chuffart	\| 36 \| 17715118 \| 14092985 \| 79,55% \| 1404 Mo \| 68 Mo \| 483 s. \|
380	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
381	3c88abd0	Florent Chuffart	\| 37 \| 17288466 \| 7402082 \| 42,82% \| 741 Mo \| 48 Mo \| 339 s. \|
382	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
383	3c88abd0	Florent Chuffart	\| 38 \| 16116394 \| 13178457 \| 81,77% \| 1101 Mo \| 63 Mo \| 420 s. \|
384	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
385	3c88abd0	Florent Chuffart	\| 39 \| 14241106 \| 10537228 \| 73,99% \| 880 Mo \| 57 Mo \| 348 s. \|
386	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
387	3c88abd0	Florent Chuffart	\| 53 \| 40876476 \| 33780065 \| 82,64% \| 3316 Mo \| 103 Mo \| 1165 s. \|
388	3c88abd0	Florent Chuffart	+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
389	3c88abd0	Florent Chuffart
390	3c88abd0	Florent Chuffart
391	3c88abd0	Florent Chuffart	Run TemplateFilter on Mnase Samples
392	3c88abd0	Florent Chuffart	-----------------------------------
393	3c88abd0	Florent Chuffart
394	3c88abd0	Florent Chuffart	Finally, for each sample we perform TemplateFilter analysis.
395	3c88abd0	Florent Chuffart
396	3c88abd0	Florent Chuffart	WARNING TemplateFilter returns a list of nucleosomes. Each
397	3961deb6	Florent Chuffart	nucleosome is defined by its center and its width. An odd width leads
398	3c88abd0	Florent Chuffart	us to consider non- integer lower and upper bound.
399	3c88abd0	Florent Chuffart
400	3961deb6	Florent Chuffart	WARNING TemplateFilter was not designed to handle replicates. So
401	3c88abd0	Florent Chuffart	we recommend to keep a maximum of nucleosomes and filter the aberrant
402	3c88abd0	Florent Chuffart	ones afterwards using the benefits of having replicates. To do this,
403	3c88abd0	Florent Chuffart	we set a low correlation threshold parameter (0.5) and a particularly
404	3c88abd0	Florent Chuffart	high value of overlap (300%).
405	3c88abd0	Florent Chuffart
406	3c88abd0	Florent Chuffart	This step is performed by the following part of the wf.py script:
407	3c88abd0	Florent Chuffart
408	3c88abd0	Florent Chuffart	for sample in samples_mnase:
409	3c88abd0	Florent Chuffart	per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample,
410	3c88abd0	Florent Chuffart	config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"],
411	3c88abd0	Florent Chuffart	config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"],
412	3c88abd0	Florent Chuffart	config["TF_MAXW"], config["TF_OL"])
413	3c88abd0	Florent Chuffart
414	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
415	3c88abd0	Florent Chuffart	\| id \| strain \| found nucs \| nuc file size \| process duration \|
416	3c88abd0	Florent Chuffart	+====+========+============+===============+==================+
417	3c88abd0	Florent Chuffart	\| 1 \| BY \| 96214 \| 68 Mo \| 1022 s. \|
418	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
419	3c88abd0	Florent Chuffart	\| 2 \| BY \| 91694 \| 65 Mo \| 1038 s. \|
420	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
421	3c88abd0	Florent Chuffart	\| 3 \| BY \| 91205 \| 65 Mo \| 1036 s. \|
422	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
423	3c88abd0	Florent Chuffart	\| 4 \| RM \| 88076 \| 62 Mo \| 984 s. \|
424	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
425	3c88abd0	Florent Chuffart	\| 5 \| RM \| 90141 \| 64 Mo \| 967 s. \|
426	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
427	3c88abd0	Florent Chuffart	\| 6 \| RM \| 87517 \| 62 Mo \| 980 s. \|
428	3c88abd0	Florent Chuffart	+----+--------+------------+---------------+------------------+
429	3c88abd0	Florent Chuffart
430	3c88abd0	Florent Chuffart
431	3c88abd0	Florent Chuffart	Inferring Nucleosome Position and Extracting Read Counts
432	3c88abd0	Florent Chuffart	========================================================
433	3c88abd0	Florent Chuffart
434	3c88abd0	Florent Chuffart	The second part of the tutorial uses R
435	3961deb6	Florent Chuffart	(http://http://www.r-project.org). NucleoMiner2 contains a set of R
436	3961deb6	Florent Chuffart	scripts that will be sourced in R from a console launched at the root
437	3961deb6	Florent Chuffart	of your project. These scripts are:
438	3c88abd0	Florent Chuffart
439	3c88abd0	Florent Chuffart	* headers.R
440	3c88abd0	Florent Chuffart
441	3c88abd0	Florent Chuffart	* extract_maps.R
442	3c88abd0	Florent Chuffart
443	3c88abd0	Florent Chuffart	* translate_common_wp.R
444	3c88abd0	Florent Chuffart
445	3c88abd0	Florent Chuffart	* split_samples.R
446	3c88abd0	Florent Chuffart
447	3c88abd0	Florent Chuffart	* count_reads.R
448	3c88abd0	Florent Chuffart
449	3c88abd0	Florent Chuffart	* get_size_factors
450	3c88abd0	Florent Chuffart
451	3c88abd0	Florent Chuffart	* launch_deseq.R
452	3c88abd0	Florent Chuffart
453	3c88abd0	Florent Chuffart
454	3c88abd0	Florent Chuffart	The Script headers.R
455	3c88abd0	Florent Chuffart	--------------------
456	3c88abd0	Florent Chuffart
457	3961deb6	Florent Chuffart	The script headers.R is included in all other R scripts. It is in
458	3c88abd0	Florent Chuffart	charge of:
459	3c88abd0	Florent Chuffart
460	3c88abd0	Florent Chuffart	* launching libraries used in the scripts
461	3c88abd0	Florent Chuffart
462	3c88abd0	Florent Chuffart	* launching configuration (design, strain, marker...)
463	3c88abd0	Florent Chuffart
464	3961deb6	Florent Chuffart	* computing and caching Common Uinterrupted Regions (CURs).
465	3961deb6	Florent Chuffart	Caching means storing the information in the computer's memory.
466	3c88abd0	Florent Chuffart
467	3c88abd0	Florent Chuffart	Note that you can customize the function “translate”. This function
468	3c88abd0	Florent Chuffart	allows you to use the alignments between genomes when performing
469	3961deb6	Florent Chuffart	various tasks.
470	3c88abd0	Florent Chuffart
471	3961deb6	Florent Chuffart	* You may want to analyze data of a single strain (e.g.
472	3961deb6	Florent Chuffart	treatment/control, or only few mutations). In this case, the
473	3961deb6	Florent Chuffart	genome is identical across all samples and you do not need to
474	3961deb6	Florent Chuffart	define particular CURs (CURs are chromosomes). Simply use the
475	3c88abd0	Florent Chuffart	default translate function which is neutral.
476	3c88abd0	Florent Chuffart
477	3961deb6	Florent Chuffart	* If you are analyzing data from two or more strains (as
478	3961deb6	Florent Chuffart	NucleoMiner2 was designed for), then you need to translate
479	3961deb6	Florent Chuffart	coordinates of one genome into the coordinates of another one.
480	3961deb6	Florent Chuffart	You must do this by aligning the two genomes, which will produce
481	3961deb6	Florent Chuffart	a .c2c file (see Appendice "Generate .c2c Files"). thenuse it to
482	3961deb6	Florent Chuffart	produce the list of regions and customise “translate”.
483	3c88abd0	Florent Chuffart
484	3961deb6	Florent Chuffart	In our tutorial, we are in the second case and to perform all these
485	3961deb6	Florent Chuffart	steps run the following command line in your R console:
486	3c88abd0	Florent Chuffart
487	3c88abd0	Florent Chuffart	source("src/current/headers.R")
488	3c88abd0	Florent Chuffart
489	3c88abd0	Florent Chuffart
490	3c88abd0	Florent Chuffart	The Script extract_maps.R
491	3c88abd0	Florent Chuffart	-------------------------
492	3c88abd0	Florent Chuffart
493	3c88abd0	Florent Chuffart	This script is in charge of extracting Maps for well-positioned and
494	3961deb6	Florent Chuffart	sensitive nucleosomes. First of all, this script computes intra and
495	3961deb6	Florent Chuffart	inter-strain matches of nucleosome maps for each CUR. This step can be
496	3961deb6	Florent Chuffart	executed in parallel on many cores using the BoT library. Next, it
497	3961deb6	Florent Chuffart	collects results and produces maps of well-positioned nucleosomes,
498	3961deb6	Florent Chuffart	sensitive nucleosomes and Unaligned Nucleosomal Regions .
499	3961deb6	Florent Chuffart
500	3961deb6	Florent Chuffart	The map of well-positioned nucleosomes for BY is collected in the
501	3961deb6	Florent Chuffart	result directory and is called BY_wp.tab. It is composed of
502	3961deb6	Florent Chuffart	following columns:
503	3c88abd0	Florent Chuffart
504	3c88abd0	Florent Chuffart	* chr, the number of the chromosome
505	3c88abd0	Florent Chuffart
506	3c88abd0	Florent Chuffart	* lower_bound, the lower bound of the nucleosome
507	3c88abd0	Florent Chuffart
508	3c88abd0	Florent Chuffart	* upper_bound, the upper bound of the nucleosome
509	3c88abd0	Florent Chuffart
510	3c88abd0	Florent Chuffart	* cur_index, index of the CUR
511	3c88abd0	Florent Chuffart
512	3c88abd0	Florent Chuffart	* index_nuc, the index of the nucleosome in the CUR
513	3c88abd0	Florent Chuffart
514	3c88abd0	Florent Chuffart	* wp, 1 if it is a well positioned nucleosome, 0 otherwise
515	3c88abd0	Florent Chuffart
516	3c88abd0	Florent Chuffart	* nb_reads, the number of reads that support this nucleosome
517	3c88abd0	Florent Chuffart
518	3c88abd0	Florent Chuffart	* nb_nucs, the number of TemplateFilter nucleosome across
519	3c88abd0	Florent Chuffart	replicates (= the number of replicates in which it is a well-
520	3c88abd0	Florent Chuffart	positioned nucleosome)
521	3c88abd0	Florent Chuffart
522	3c88abd0	Florent Chuffart	* llr_1, for a well-positioned nucleosome, it is the LLR1 (log-
523	3c88abd0	Florent Chuffart	likelihood ratio) between the first and the second TemplateFilter
524	3c88abd0	Florent Chuffart	nucleosome on the chain.
525	3c88abd0	Florent Chuffart
526	3c88abd0	Florent Chuffart	* llr_2, for a well-positioned nucleosome, it is the LLR1 between
527	3c88abd0	Florent Chuffart	the second and the third TemplateFilter nucleosome on the chain.
528	3c88abd0	Florent Chuffart
529	3c88abd0	Florent Chuffart	* wp_llr, for a well-positioned nucleosome, it is the LLR2 that
530	3c88abd0	Florent Chuffart	compares consistency of the positioning over all TemplateFilter
531	3c88abd0	Florent Chuffart	nucleosomes.
532	3c88abd0	Florent Chuffart
533	3c88abd0	Florent Chuffart	* wp_pval, for a well-positioned nucleosome, it is the p-value
534	3961deb6	Florent Chuffart	chi square test obtained from LLR2 (1-pchisq(2.LLR2, df=4))
535	3c88abd0	Florent Chuffart
536	3c88abd0	Florent Chuffart	* dyad_shift, for a well-positioned nucleosome, it is the shift
537	3c88abd0	Florent Chuffart	between the two extreme TemplateFilter nucleosome dyad positions.
538	3c88abd0	Florent Chuffart
539	3961deb6	Florent Chuffart	The sensitive map for BY is collected in the result directory and is
540	3c88abd0	Florent Chuffart	called BY_fuzzy.tab. It is composed of following columns:
541	3c88abd0	Florent Chuffart
542	3c88abd0	Florent Chuffart	* chr, the number of the chromosome
543	3c88abd0	Florent Chuffart
544	3c88abd0	Florent Chuffart	* lower_bound, the lower bound of the nucleosome
545	3c88abd0	Florent Chuffart
546	3c88abd0	Florent Chuffart	* upper_bound, the upper bound of the nucleosome
547	3c88abd0	Florent Chuffart
548	3c88abd0	Florent Chuffart	* cur_index, index of the CUR
549	3c88abd0	Florent Chuffart
550	3c88abd0	Florent Chuffart	The map of common well-positioned nucleosomes aligned between the BY
551	3c88abd0	Florent Chuffart	and RM strains is collected in the result directory and is called
552	3c88abd0	Florent Chuffart	BY_RM_common_wp.tab. It is composed of following columns:
553	3c88abd0	Florent Chuffart
554	3c88abd0	Florent Chuffart	* cur_index, the index of the CUR
555	3c88abd0	Florent Chuffart
556	3c88abd0	Florent Chuffart	* index_nuc_BY, the index of the BY nucleosome in the CUR
557	3c88abd0	Florent Chuffart
558	3c88abd0	Florent Chuffart	* index_nuc_RM, the index of the RM nucleosome in the CUR
559	3c88abd0	Florent Chuffart
560	3c88abd0	Florent Chuffart	* llr_score, , the LLR3 score that estimates conservation between
561	3c88abd0	Florent Chuffart	the positions in BY and RM
562	3c88abd0	Florent Chuffart
563	3c88abd0	Florent Chuffart	* common_wp_pval, the p-value chi square test obtained from LLR3
564	3c88abd0	Florent Chuffart	(1-pchisq(2.LLR3, df=2))
565	3c88abd0	Florent Chuffart
566	3c88abd0	Florent Chuffart	* diff, the dyads shift between the positions in the two strains
567	3961deb6	Florent Chuffart	(in bp)
568	3c88abd0	Florent Chuffart
569	3c88abd0	Florent Chuffart	The common UNR map for BY and RM strains is collected in the result
570	3c88abd0	Florent Chuffart	directory and is called BY_RM_common_unr.tab. It is composed of the
571	3c88abd0	Florent Chuffart	following columns:
572	3c88abd0	Florent Chuffart
573	3c88abd0	Florent Chuffart	* cur_index, the index of the CUR
574	3c88abd0	Florent Chuffart
575	3c88abd0	Florent Chuffart	* index_nuc_BY, the index of the BY nucleosome in the CUR
576	3c88abd0	Florent Chuffart
577	3c88abd0	Florent Chuffart	* index_nuc_RM,the index of the RM nucleosome in the CUR
578	3c88abd0	Florent Chuffart
579	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
580	3c88abd0	Florent Chuffart
581	3c88abd0	Florent Chuffart	source("src/current/extract_maps.R")
582	3c88abd0	Florent Chuffart
583	3c88abd0	Florent Chuffart
584	3c88abd0	Florent Chuffart	The Script translate_common_wp.R
585	3c88abd0	Florent Chuffart	--------------------------------
586	3c88abd0	Florent Chuffart
587	3c88abd0	Florent Chuffart	This script is used to translate common well-positioned nucleosome
588	3961deb6	Florent Chuffart	positions from a strain to another strain and stores it into a table.
589	3c88abd0	Florent Chuffart
590	3c88abd0	Florent Chuffart	For example, the file results/2014-04/RM_wp_tr_2_BY.tab contains RM
591	3961deb6	Florent Chuffart	well-positioned nucleosomes translated into the BY genome coordinates.
592	3c88abd0	Florent Chuffart	It is composed of following columns:
593	3c88abd0	Florent Chuffart
594	3c88abd0	Florent Chuffart	* strain_ref, the reference genome (in which positioned are
595	3c88abd0	Florent Chuffart	defined)
596	3c88abd0	Florent Chuffart
597	3c88abd0	Florent Chuffart	* begin, the translated lower bound of the nucleosome
598	3c88abd0	Florent Chuffart
599	3c88abd0	Florent Chuffart	* end, the translated upper bound of the nucleosome
600	3c88abd0	Florent Chuffart
601	3c88abd0	Florent Chuffart	* chr, the number of chromosomes for the reference genome (in
602	3c88abd0	Florent Chuffart	which positioned are defined)
603	3c88abd0	Florent Chuffart
604	3c88abd0	Florent Chuffart	* length, the length of the nucleosome (could be negative)
605	3c88abd0	Florent Chuffart
606	3c88abd0	Florent Chuffart	* cur_index, the index of the CUR
607	3c88abd0	Florent Chuffart
608	3c88abd0	Florent Chuffart	* index_nuc, the index of the nucleosome in the CUR
609	3c88abd0	Florent Chuffart
610	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
611	3c88abd0	Florent Chuffart
612	3c88abd0	Florent Chuffart	source("src/current/translate_common_wp.R")
613	3c88abd0	Florent Chuffart
614	3c88abd0	Florent Chuffart
615	3c88abd0	Florent Chuffart	The Script split_samples.R
616	3c88abd0	Florent Chuffart	--------------------------
617	3c88abd0	Florent Chuffart
618	3961deb6	Florent Chuffart	To optimize memory space usage, we split and compress TemplateFilter
619	3c88abd0	Florent Chuffart	input files according to their corresponding chromosome. for example,
620	3c88abd0	Florent Chuffart	sample_1_TF.tab will be split into :
621	3c88abd0	Florent Chuffart
622	3c88abd0	Florent Chuffart	* sample_1_chr_1_splited_sample.tab.gz
623	3c88abd0	Florent Chuffart
624	3c88abd0	Florent Chuffart	* sample_1_chr_2_splited_sample.tab.gz
625	3c88abd0	Florent Chuffart
626	3c88abd0	Florent Chuffart	* ...
627	3c88abd0	Florent Chuffart
628	3c88abd0	Florent Chuffart	* sample_1_chr_17_splited_sample.tab.gz
629	3c88abd0	Florent Chuffart
630	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
631	3c88abd0	Florent Chuffart
632	3c88abd0	Florent Chuffart	source("src/current/split_samples.R")
633	3c88abd0	Florent Chuffart
634	3c88abd0	Florent Chuffart
635	3c88abd0	Florent Chuffart	The Script count_reads.R
636	3c88abd0	Florent Chuffart	------------------------
637	3c88abd0	Florent Chuffart
638	3c88abd0	Florent Chuffart	To associate a number of observations (read) to each nucleosome we run
639	3c88abd0	Florent Chuffart	the script count_reads.R. It produces the files
640	3c88abd0	Florent Chuffart	BY_RM_H3K14ac_wp_and_nbreads.tab,
641	3c88abd0	Florent Chuffart	BY_RM_H3K14ac_unr_and_nbreads.tab
642	3c88abd0	Florent Chuffart	BY_RM_Mnase_Seq_wp_and_nbreads.tab and
643	3c88abd0	Florent Chuffart	BY_RM_Mnase_Seq_unr_and_nbreads.tab for H3K14ac common well-
644	3c88abd0	Florent Chuffart	positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned
645	3c88abd0	Florent Chuffart	nucleosomes and Mnase UNRs respectively.
646	3c88abd0	Florent Chuffart
647	3c88abd0	Florent Chuffart	For example, the file BY_RM_H3K14ac_unr_and_nbreads.tab contains
648	3c88abd0	Florent Chuffart	counted reads for well-positioned nucleosomes with the experimental
649	3c88abd0	Florent Chuffart	condition ChIP H3K14ac. It is composed of the following columns:
650	3c88abd0	Florent Chuffart
651	3c88abd0	Florent Chuffart	* chr_BY, the number of the chromosome for BY
652	3c88abd0	Florent Chuffart
653	3c88abd0	Florent Chuffart	* lower_bound_BY, the lower bound of the nucleosome for BY
654	3c88abd0	Florent Chuffart
655	3c88abd0	Florent Chuffart	* upper_bound_BY, the upper bound of the nucleosome for BY
656	3c88abd0	Florent Chuffart
657	3c88abd0	Florent Chuffart	* index_nuc_BY, the index of the BY nucleosome in the CUR for BY
658	3c88abd0	Florent Chuffart
659	3c88abd0	Florent Chuffart	* chr_RM, the number of the chromosome for RM
660	3c88abd0	Florent Chuffart
661	3c88abd0	Florent Chuffart	* lower_bound_RM, the lower bound of the nucleosome for RM
662	3c88abd0	Florent Chuffart
663	3c88abd0	Florent Chuffart	* upper_bound_RM, the upper bound of the nucleosome for RM
664	3c88abd0	Florent Chuffart
665	3c88abd0	Florent Chuffart	* index_nuc_RM,the index of the RM nucleosome in the CUR for RM
666	3c88abd0	Florent Chuffart
667	3c88abd0	Florent Chuffart	* cur_index, index of the CUR
668	3c88abd0	Florent Chuffart
669	3c88abd0	Florent Chuffart	* BY_H3K14ac_36, the number of reads for the current nucleosome
670	3c88abd0	Florent Chuffart	for the sample 36
671	3c88abd0	Florent Chuffart
672	3c88abd0	Florent Chuffart	* BY_H3K14ac_37, #reads for sample 37
673	3c88abd0	Florent Chuffart
674	3c88abd0	Florent Chuffart	* BY_H3K14ac_53, #reads for sample 53
675	3c88abd0	Florent Chuffart
676	3c88abd0	Florent Chuffart	* RM_H3K14ac_38, #reads for sample 38
677	3c88abd0	Florent Chuffart
678	3c88abd0	Florent Chuffart	* RM_H3K14ac_39, #reads for sample 39
679	3c88abd0	Florent Chuffart
680	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
681	3c88abd0	Florent Chuffart
682	3c88abd0	Florent Chuffart	source("src/current/count_reads.R")
683	3c88abd0	Florent Chuffart
684	3c88abd0	Florent Chuffart
685	3c88abd0	Florent Chuffart	The Script get_size_factors.R
686	3c88abd0	Florent Chuffart	-----------------------------
687	3c88abd0	Florent Chuffart
688	3c88abd0	Florent Chuffart	This script uses the DESeq function estimateSizeFactors to compute
689	3c88abd0	Florent Chuffart	the size factor of each sample. It corresponds to normalisation of
690	3c88abd0	Florent Chuffart	read counts from sample to sample, as determined by DESeq. When a
691	3c88abd0	Florent Chuffart	sample has n reads for a nucleosome or a UNR, the normalised count is
692	3c88abd0	Florent Chuffart	n/f where f is the factor contained in this file. The script dumps
693	3c88abd0	Florent Chuffart	computed size factors into the file size_factors.tab. This file has
694	3c88abd0	Florent Chuffart	the form:
695	3c88abd0	Florent Chuffart
696	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
697	3c88abd0	Florent Chuffart	\| sample_id \| wp \| unr \| wpunr \|
698	3c88abd0	Florent Chuffart	+===========+=========+=========+=========+
699	3c88abd0	Florent Chuffart	\| 1 \| 0.87396 \| 0.88097 \| 0.87584 \|
700	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
701	3c88abd0	Florent Chuffart	\| 2 \| 1.07890 \| 1.07440 \| 1.07760 \|
702	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
703	3c88abd0	Florent Chuffart	\| 3 \| 1.06400 \| 1.05890 \| 1.06250 \|
704	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
705	3c88abd0	Florent Chuffart	\| 4 \| 0.85782 \| 0.87948 \| 0.86305 \|
706	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
707	3c88abd0	Florent Chuffart	\| 5 \| 0.97577 \| 0.96590 \| 0.97307 \|
708	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
709	3c88abd0	Florent Chuffart	\| 6 \| 1.19630 \| 1.18120 \| 1.19190 \|
710	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
711	3c88abd0	Florent Chuffart	\| 36 \| 0.93318 \| 0.92762 \| 0.93166 \|
712	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
713	3c88abd0	Florent Chuffart	\| 37 \| 0.48315 \| 0.48453 \| 0.48350 \|
714	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
715	3c88abd0	Florent Chuffart	\| 38 \| 1.11240 \| 1.11210 \| 1.11230 \|
716	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
717	3c88abd0	Florent Chuffart	\| 39 \| 0.89897 \| 0.89917 \| 0.89903 \|
718	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
719	3c88abd0	Florent Chuffart	\| 53 \| 2.22650 \| 2.22700 \| 2.22660 \|
720	3c88abd0	Florent Chuffart	+-----------+---------+---------+---------+
721	3c88abd0	Florent Chuffart
722	3c88abd0	Florent Chuffart	sample_id are given in file samples.csv
723	3c88abd0	Florent Chuffart
724	3961deb6	Florent Chuffart	If you don't know which column to use for normalization, we recommend
725	3961deb6	Florent Chuffart	using wpunr.
726	3c88abd0	Florent Chuffart
727	3961deb6	Florent Chuffart	Here are the details of the factors produced:
728	3c88abd0	Florent Chuffart
729	3c88abd0	Florent Chuffart	* unr: factor computed from data of UNR regions. These regions
730	3c88abd0	Florent Chuffart	are defined for every pairs of aligned genomes (e.g. BY_RM)
731	3c88abd0	Florent Chuffart
732	3c88abd0	Florent Chuffart	* wp: same, but for well-positioned nucleosomes.
733	3c88abd0	Florent Chuffart
734	3c88abd0	Florent Chuffart	* wpunr: both types of regions.
735	3c88abd0	Florent Chuffart
736	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
737	3c88abd0	Florent Chuffart
738	3c88abd0	Florent Chuffart	source("src/current/get_size_factors.R")
739	3c88abd0	Florent Chuffart
740	3c88abd0	Florent Chuffart
741	3c88abd0	Florent Chuffart	The Script launch_deseq.R
742	3c88abd0	Florent Chuffart	-------------------------
743	3c88abd0	Florent Chuffart
744	3c88abd0	Florent Chuffart	Finally, the script launch_deseq.R perform statistical analysis on
745	3c88abd0	Florent Chuffart	each nucleosome using DESeq. It produces files:
746	3c88abd0	Florent Chuffart
747	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_wp_snep.tab
748	3c88abd0	Florent Chuffart
749	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_unr_snep.tab
750	3c88abd0	Florent Chuffart
751	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_wpunr_snep.tab
752	3c88abd0	Florent Chuffart
753	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_wp_mnase.tab
754	3c88abd0	Florent Chuffart
755	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_unr_mnase.tab
756	3c88abd0	Florent Chuffart
757	3c88abd0	Florent Chuffart	* results/current/BY_RM_H3K14ac_wpunr_mnase.tab
758	3c88abd0	Florent Chuffart
759	3c88abd0	Florent Chuffart	These files are organised with the following columns (see file
760	3c88abd0	Florent Chuffart	BY_RM_H3K14ac_wp_snep.tab for an example):
761	3c88abd0	Florent Chuffart
762	3c88abd0	Florent Chuffart	* chr_BY, the number of the chromosome for BY
763	3c88abd0	Florent Chuffart
764	3c88abd0	Florent Chuffart	* lower_bound_BY, the lower bound of the nucleosome for BY
765	3c88abd0	Florent Chuffart
766	3c88abd0	Florent Chuffart	* upper_bound_BY, the upper bound of the nucleosome for BY
767	3c88abd0	Florent Chuffart
768	3c88abd0	Florent Chuffart	* index_nuc_BY, the index of the BY nucleosome in the CUR for BY
769	3c88abd0	Florent Chuffart
770	3c88abd0	Florent Chuffart	* chr_RM, the number of the chromosome for RM
771	3c88abd0	Florent Chuffart
772	3c88abd0	Florent Chuffart	* lower_bound_RM, the lower bound of the nucleosome for RM
773	3c88abd0	Florent Chuffart
774	3c88abd0	Florent Chuffart	* upper_bound_RM, the upper bound of the nucleosome for RM
775	3c88abd0	Florent Chuffart
776	3c88abd0	Florent Chuffart	* index_nuc_RM,the index of the RM nucleosome in the CUR for RM
777	3c88abd0	Florent Chuffart
778	3c88abd0	Florent Chuffart	* cur_index, index of the CUR
779	3c88abd0	Florent Chuffart
780	3c88abd0	Florent Chuffart	* form
781	3c88abd0	Florent Chuffart
782	3c88abd0	Florent Chuffart	* BY_Mnase_Seq_1, the number of reads for the current nucleosome
783	3c88abd0	Florent Chuffart	for the sample 1
784	3c88abd0	Florent Chuffart
785	3c88abd0	Florent Chuffart	Next columns concern indicators for each sample:
786	3c88abd0	Florent Chuffart
787	3c88abd0	Florent Chuffart	* BY_Mnase_Seq_2, #reads for sample 2
788	3c88abd0	Florent Chuffart
789	3c88abd0	Florent Chuffart	* BY_Mnase_Seq_3, #reads for sample 3
790	3c88abd0	Florent Chuffart
791	3c88abd0	Florent Chuffart	* RM_Mnase_Seq_4, #reads for sample 4
792	3c88abd0	Florent Chuffart
793	3c88abd0	Florent Chuffart	* RM_Mnase_Seq_5, #reads for sample 5
794	3c88abd0	Florent Chuffart
795	3c88abd0	Florent Chuffart	* RM_Mnase_Seq_6, #reads for sample 6
796	3c88abd0	Florent Chuffart
797	3c88abd0	Florent Chuffart	* BY_H3K14ac_36, #reads for sample 36
798	3c88abd0	Florent Chuffart
799	3c88abd0	Florent Chuffart	* BY_H3K14ac_37, #reads for sample 37
800	3c88abd0	Florent Chuffart
801	3c88abd0	Florent Chuffart	* BY_H3K14ac_53, #reads for sample 53
802	3c88abd0	Florent Chuffart
803	3c88abd0	Florent Chuffart	* RM_H3K14ac_38, #reads for sample 38
804	3c88abd0	Florent Chuffart
805	3c88abd0	Florent Chuffart	* RM_H3K14ac_39, #reads for sample 39
806	3c88abd0	Florent Chuffart
807	3c88abd0	Florent Chuffart	The 5 last columns concern DESeq analysis:
808	3c88abd0	Florent Chuffart
809	3c88abd0	Florent Chuffart	* manip[a_manip] strain[a_strain]
810	3c88abd0	Florent Chuffart	manip[a_strain]:strain[a_strain], the manip (marker) effect, the
811	3c88abd0	Florent Chuffart	strain effect and the snep effect. These are the coefficients of
812	3c88abd0	Florent Chuffart	the fitted generalized linear model.
813	3c88abd0	Florent Chuffart
814	3961deb6	Florent Chuffart	* pvalsGLM, the pvalue resulting from the comparison of the GLM
815	3961deb6	Florent Chuffart	model considering the interaction term marker:strain to the GLM
816	3961deb6	Florent Chuffart	model that does not consider it. This is the statsitcial
817	3961deb6	Florent Chuffart	significance of the interaction term and therefore the
818	3961deb6	Florent Chuffart	statistical significance of the SNEP.
819	3c88abd0	Florent Chuffart
820	3c88abd0	Florent Chuffart	* snep_index, a boolean set to TRUE if the pvalueGLM value is
821	3c88abd0	Florent Chuffart	under the threshold computed with FDR function with a rate set to
822	3c88abd0	Florent Chuffart	0.0001.
823	3c88abd0	Florent Chuffart
824	3c88abd0	Florent Chuffart	To execute this script, run the following command in your R console:
825	3c88abd0	Florent Chuffart
826	3c88abd0	Florent Chuffart	source("src/current/launch_deseq.R")
827	3c88abd0	Florent Chuffart
828	3c88abd0	Florent Chuffart
829	3c88abd0	Florent Chuffart	Results: Number of SNEPs
830	3c88abd0	Florent Chuffart	========================
831	3c88abd0	Florent Chuffart
832	3c88abd0	Florent Chuffart	Here are the number of computed SNEPs for each forms.
833	3c88abd0	Florent Chuffart
834	3c88abd0	Florent Chuffart	+-------+---------+-------+---------+
835	3c88abd0	Florent Chuffart	\| form \| strains \| #nucs \| H3K14ac \|
836	3c88abd0	Florent Chuffart	+=======+=========+=======+=========+
837	3c88abd0	Florent Chuffart	\| wp \| BY-RM \| 30464 \| 3549 \|
838	3c88abd0	Florent Chuffart	+-------+---------+-------+---------+
839	3c88abd0	Florent Chuffart	\| unr \| BY-RM \| 9497 \| 1559 \|
840	3c88abd0	Florent Chuffart	+-------+---------+-------+---------+
841	3c88abd0	Florent Chuffart	\| wpunr \| BY-RM \| 39961 \| 5240 \|
842	3c88abd0	Florent Chuffart	+-------+---------+-------+---------+
843	3c88abd0	Florent Chuffart
844	3c88abd0	Florent Chuffart
845	3c88abd0	Florent Chuffart	APPENDICE: Generate .c2c Files
846	3c88abd0	Florent Chuffart	==============================
847	3c88abd0	Florent Chuffart
848	3961deb6	Florent Chuffart	The .c2c files is a simple table that describes how two genome
849	3961deb6	Florent Chuffart	sequences are aligned. This file can be generated by using scripts
850	3961deb6	Florent Chuffart	that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics
851	3961deb6	Florent Chuffart	2010) and which we provide in this release of NucleoMiner2.
852	3c88abd0	Florent Chuffart
853	3961deb6	Florent Chuffart	To use these scripts on your UNIX/LINUX computer you need first to
854	3961deb6	Florent Chuffart	install MUMmer which is designed to rapidly align entire genomes,
855	3961deb6	Florent Chuffart	whether in complete or draft form.
856	3c88abd0	Florent Chuffart
857	3c88abd0	Florent Chuffart
858	3961deb6	Florent Chuffart	Installing MUMmer
859	3961deb6	Florent Chuffart	-----------------
860	3c88abd0	Florent Chuffart
861	c25275e2	Florent Chuffart	Get the last version of MUMmer archive on your computer
862	3961deb6	Florent Chuffart	(MUMmer3.23.tar.gz is provided in the directory deps of your working
863	c25275e2	Florent Chuffart	directory). Copy it in a dedicated directory. Install it locally into
864	c25275e2	Florent Chuffart	the src folder of you working directory by typing (working directory):
865	3c88abd0	Florent Chuffart
866	3961deb6	Florent Chuffart	tar -xvzf MUMmer3.23.tar.gz
867	3c88abd0	Florent Chuffart
868	c25275e2	Florent Chuffart	cd src
869	c25275e2	Florent Chuffart	tar xfvz ../deps/MUMmer3.23.tar.gz
870	c25275e2	Florent Chuffart	cd MUMmer3.23
871	c25275e2	Florent Chuffart	make check
872	c25275e2	Florent Chuffart	make install
873	3c88abd0	Florent Chuffart
874	3c88abd0	Florent Chuffart
875	c25275e2	Florent Chuffart	Installing NucleoMiner 1.0 scripts
876	c25275e2	Florent Chuffart	----------------------------------
877	3c88abd0	Florent Chuffart
878	3961deb6	Florent Chuffart	Get the nucleominer-1.0.tar.gz archive on your computer (this archive
879	3961deb6	Florent Chuffart	is provided in the directory deps of your working directory). Install
880	3961deb6	Florent Chuffart	it locally into the src folder of you working directory by typing
881	3961deb6	Florent Chuffart	(working directory):
882	3c88abd0	Florent Chuffart
883	c25275e2	Florent Chuffart	cd src
884	c25275e2	Florent Chuffart	tar xfvz ../deps/nucleominer-1.0.tar.gz
885	c25275e2	Florent Chuffart	cd ..
886	3c88abd0	Florent Chuffart
887	3961deb6	Florent Chuffart	This creates a directory that contains NucleoMiner 1.0 scripts
888	c25275e2	Florent Chuffart	(src/nucleominer-1.0/scripts).
889	3c88abd0	Florent Chuffart
890	3c88abd0	Florent Chuffart
891	3c88abd0	Florent Chuffart	Generate .c2c Files
892	3c88abd0	Florent Chuffart	-------------------
893	3c88abd0	Florent Chuffart
894	3c88abd0	Florent Chuffart	To generate .c2c files you need to type the following command in a
895	3c88abd0	Florent Chuffart	terminal:
896	3c88abd0	Florent Chuffart
897	c25275e2	Florent Chuffart	export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
898	c25275e2	Florent Chuffart	export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
899	c25275e2	Florent Chuffart	NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
900	c25275e2	Florent Chuffart	data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
901	c25275e2	Florent Chuffart	data/byxrm 2>NMgxcomp.log
902	3c88abd0	Florent Chuffart
903	c25275e2	Florent Chuffart	After execution, the directory data will hold the .c2c files.

LBMC » NucleoMiner

root / README @ master