/doc/sphinx_doc/tuto.rst - Annoter - NucleoMiner - Forge du Centre Blaise Pascal

root / doc / sphinx_doc / tuto.rst @ 780f632a

Historique | Voir | Annoter | Télécharger (25,93 ko)

1	935a568c	Florent Chuffart	Tutorial
2	935a568c	Florent Chuffart	========
3	935a568c	Florent Chuffart
4	3961deb6	Florent Chuffart	This tutorial describes steps allowing to perform quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
5	935a568c	Florent Chuffart
6	3961deb6	Florent Chuffart	This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts nucleosome-level information (nucleosome position and indicators) from the dataset.
7	935a568c	Florent Chuffart
8	935a568c	Florent Chuffart
9	dadb6a4d	Florent Chuffart
10	dadb6a4d	Florent Chuffart
11	e5603c3f	Florent Chuffart	Experimental Dataset, Working Directory and Configuration File
12	e5603c3f	Florent Chuffart	--------------------------------------------------------------
13	dadb6a4d	Florent Chuffart
14	e5603c3f	Florent Chuffart	Working Directory Organisation
15	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16	dadb6a4d	Florent Chuffart
17	3961deb6	Florent Chuffart	After having installed NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
18	dadb6a4d	Florent Chuffart
19	5badc2fd	Florent Chuffart	.. code:: bash
20	dadb6a4d	Florent Chuffart
21	5badc2fd	Florent Chuffart	cd doc/Chuffart_NM2_workdir/
22	dadb6a4d	Florent Chuffart
23	dadb6a4d	Florent Chuffart
24	e5603c3f	Florent Chuffart	Retrieving Experimental Dataset
25	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26	935a568c	Florent Chuffart
27	e5603c3f	Florent Chuffart	The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
28	935a568c	Florent Chuffart
29	e5603c3f	Florent Chuffart	$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
30	935a568c	Florent Chuffart
31	935a568c	Florent Chuffart
32	3961deb6	Florent Chuffart	In this tutorial, we want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark. Illumina sequencing was done in single-read of 50 bp long.
33	935a568c	Florent Chuffart
34	e5603c3f	Florent Chuffart	The dataset is composed of 55 files organised as follows:
35	935a568c	Florent Chuffart
36	e5603c3f	Florent Chuffart	- 3 replicates for BY MNase Seq
37	e5603c3f	Florent Chuffart
38	e5603c3f	Florent Chuffart	- sample 1 (5 fastq.gz files)
39	e5603c3f	Florent Chuffart	- sample 2 (5 fastq.gz files)
40	e5603c3f	Florent Chuffart	- sample 3 (4 fastq.gz files)
41	e5603c3f	Florent Chuffart
42	e5603c3f	Florent Chuffart	- 3 replicates for RM MNase Seq
43	e5603c3f	Florent Chuffart
44	e5603c3f	Florent Chuffart	- sample 4 (4 fastq.gz files)
45	e5603c3f	Florent Chuffart	- sample 5 (4 fastq.gz files)
46	e5603c3f	Florent Chuffart	- sample 6 (5 fastq.gz files)
47	e5603c3f	Florent Chuffart
48	e5603c3f	Florent Chuffart	- 3 replicates for BY ChIP Seq H3K14ac
49	e5603c3f	Florent Chuffart
50	e5603c3f	Florent Chuffart	- sample 36 (5 fastq.gz files)
51	e5603c3f	Florent Chuffart	- sample 37 (5 fastq.gz files)
52	e5603c3f	Florent Chuffart	- sample 53 (9 fastq.gz files)
53	e5603c3f	Florent Chuffart
54	e5603c3f	Florent Chuffart	- 2 replicates for RM ChIP Seq H3K14ac
55	e5603c3f	Florent Chuffart
56	e5603c3f	Florent Chuffart	- sample 38 (5 fastq.gz files)
57	e5603c3f	Florent Chuffart	- sample 39 (4 fastq.gz files)
58	e5603c3f	Florent Chuffart
59	935a568c	Florent Chuffart
60	935a568c	Florent Chuffart
61	935a568c	Florent Chuffart
62	e5603c3f	Florent Chuffart	Python and R Common Configuration File
63	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64	935a568c	Florent Chuffart
65	3961deb6	Florent Chuffart	First, we need to define useful configuration variables that will be passed to python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
66	935a568c	Florent Chuffart
67	3961deb6	Florent Chuffart	The initialization of this variables is done in the configurator.py file. If you need to adapt variable values (path, default parameters...) you need to edit this file. Then, go to the root directory of your project and run the following command to dump the configuration file:
68	935a568c	Florent Chuffart
69	e5603c3f	Florent Chuffart	.. code:: bash
70	935a568c	Florent Chuffart
71	e5603c3f	Florent Chuffart	python src/current/configurator.py
72	e5603c3f	Florent Chuffart
73	935a568c	Florent Chuffart
74	935a568c	Florent Chuffart
75	935a568c	Florent Chuffart
76	935a568c	Florent Chuffart
77	935a568c	Florent Chuffart	Preprocessing Illumina Fastq Reads for Each Sample
78	935a568c	Florent Chuffart	--------------------------------------------------
79	935a568c	Florent Chuffart
80	3961deb6	Florent Chuffart	Once variables and design have been specified, the script wf.py will automatically run all the analysis. You don't need to do anything.
81	3961deb6	Florent Chuffart	To run the full analysis, run the following command:
82	3961deb6	Florent Chuffart
83	3961deb6	Florent Chuffart	.. code:: bash
84	3961deb6	Florent Chuffart
85	3961deb6	Florent Chuffart	python src/current/wf.py
86	3961deb6	Florent Chuffart
87	3961deb6	Florent Chuffart	The details of the steps performed by this script are explained below.
88	3961deb6	Florent Chuffart	This preprocessing consists of 4 steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
89	e5603c3f	Florent Chuffart
90	935a568c	Florent Chuffart
91	935a568c	Florent Chuffart	.. autodata:: wf.samples
92	935a568c	Florent Chuffart	:noindex:
93	935a568c	Florent Chuffart
94	935a568c	Florent Chuffart	.. autodata:: wf.samples_mnase
95	935a568c	Florent Chuffart	:noindex:
96	935a568c	Florent Chuffart
97	935a568c	Florent Chuffart	.. autodata:: wf.strains
98	935a568c	Florent Chuffart	:noindex:
99	935a568c	Florent Chuffart
100	935a568c	Florent Chuffart
101	935a568c	Florent Chuffart	Creating Bowtie Index from each Reference Genome
102	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
103	935a568c	Florent Chuffart
104	3961deb6	Florent Chuffart	For each strain, the script wf.py then creates bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. The part of the script performing this is the following:
105	935a568c	Florent Chuffart
106	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
107	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_1
108	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_1
109	935a568c	Florent Chuffart	:language: python
110	935a568c	Florent Chuffart
111	3961deb6	Florent Chuffart	As an indication, the following table summarizes the file sizes and process durations that we experienced when running this step on a Linux server***.
112	935a568c	Florent Chuffart
113	935a568c	Florent Chuffart	====== ====================== ====================== ================
114	935a568c	Florent Chuffart	strain fasta genome file size bowtie index file size process duration
115	935a568c	Florent Chuffart	====== ====================== ====================== ================
116	935a568c	Florent Chuffart	BY 12 Mo 25 Mo 11 s.
117	935a568c	Florent Chuffart	RM 12 Mo 24 Mo 9 s.
118	935a568c	Florent Chuffart	====== ====================== ====================== ================
119	935a568c	Florent Chuffart
120	935a568c	Florent Chuffart
121	935a568c	Florent Chuffart
122	935a568c	Florent Chuffart
123	935a568c	Florent Chuffart	Aligning Reads to Reference Genome
124	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
125	935a568c	Florent Chuffart
126	3961deb6	Florent Chuffart	Next, the wf.py script launches bowtie to align reads to the reference genome. It produces a `.sam` file that is converted into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the script:
127	935a568c	Florent Chuffart
128	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
129	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_2
130	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_2
131	935a568c	Florent Chuffart	:language: python
132	935a568c	Florent Chuffart
133	e5603c3f	Florent Chuffart	Convert Aligned Reads into TemplateFilter Format
134	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
135	e5603c3f	Florent Chuffart
136	3961deb6	Florent Chuffart	TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads in the following format: `chr`, `coord`, `strand` and `#read` where:
137	935a568c	Florent Chuffart
138	e5603c3f	Florent Chuffart	- `chr` is the number of the chromosome;
139	e5603c3f	Florent Chuffart	- `coord` is the coordinate of the reads;
140	e5603c3f	Florent Chuffart	- `strand` is `F` for forward and `R` for reverse;
141	e5603c3f	Florent Chuffart	- `#reads` the number of reads covering this position.
142	935a568c	Florent Chuffart
143	935a568c	Florent Chuffart	Each entry is tab-separated.
144	935a568c	Florent Chuffart
145	3961deb6	Florent Chuffart	WARNING for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side. This is taken into account in NucleoMiner2 by adding the read length (in our case 50) to the reverse reads coordinates.
146	935a568c	Florent Chuffart
147	3961deb6	Florent Chuffart	This step is performed by the following part of the wf.py script:
148	935a568c	Florent Chuffart
149	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
150	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_3
151	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_3
152	935a568c	Florent Chuffart	:language: python
153	935a568c	Florent Chuffart
154	3961deb6	Florent Chuffart	The following table summarizes the number of reads, the involved file sizes and process durations that we experienced when running the two last steps. In our case, alignment process were multithreaded over 3 cores.
155	935a568c	Florent Chuffart
156	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
157	935a568c	Florent Chuffart	id Illumina reads aligned and filtred reads ratio `.bed` file size TF input file size process duration
158	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
159	935a568c	Florent Chuffart	1 16436138 10199695 62,06% 1064 Mo 60 Mo 383 s.
160	935a568c	Florent Chuffart	2 16911132 12512727 73,99% 1298 Mo 64 Mo 437 s.
161	935a568c	Florent Chuffart	3 15946902 12340426 77,38% 1280 Mo 65 Mo 423 s.
162	935a568c	Florent Chuffart	4 13765584 10381903 75,42% 931 Mo 59 Mo 352 s.
163	935a568c	Florent Chuffart	5 15168268 11502855 75,83% 1031 Mo 64 Mo 386 s.
164	935a568c	Florent Chuffart	6 18850820 14024905 74,40% 1254 Mo 69 Mo 482 s.
165	935a568c	Florent Chuffart	36 17715118 14092985 79,55% 1404 Mo 68 Mo 483 s.
166	935a568c	Florent Chuffart	37 17288466 7402082 42,82% 741 Mo 48 Mo 339 s.
167	935a568c	Florent Chuffart	38 16116394 13178457 81,77% 1101 Mo 63 Mo 420 s.
168	935a568c	Florent Chuffart	39 14241106 10537228 73,99% 880 Mo 57 Mo 348 s.
169	935a568c	Florent Chuffart	53 40876476 33780065 82,64% 3316 Mo 103 Mo 1165 s.
170	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
171	935a568c	Florent Chuffart
172	935a568c	Florent Chuffart	Run TemplateFilter on Mnase Samples
173	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
174	935a568c	Florent Chuffart
175	e5603c3f	Florent Chuffart	Finally, for each sample we perform TemplateFilter analysis.
176	935a568c	Florent Chuffart
177	935a568c	Florent Chuffart	WARNING TemplateFilter returns a list of nucleosomes. Each nucleosome is
178	3961deb6	Florent Chuffart	defined by its center and its width. An odd width leads us to consider non-
179	e5603c3f	Florent Chuffart	integer lower and upper bound.
180	935a568c	Florent Chuffart
181	3961deb6	Florent Chuffart	WARNING TemplateFilter was not designed to handle replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
182	935a568c	Florent Chuffart
183	e5603c3f	Florent Chuffart	This step is performed by the following part of the `wf.py` script:
184	935a568c	Florent Chuffart
185	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
186	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_4
187	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_4
188	935a568c	Florent Chuffart	:language: python
189	935a568c	Florent Chuffart
190	935a568c	Florent Chuffart	== ====== ========== ============= ================
191	935a568c	Florent Chuffart	id strain found nucs nuc file size process duration
192	935a568c	Florent Chuffart	== ====== ========== ============= ================
193	935a568c	Florent Chuffart	1 BY 96214 68 Mo 1022 s.
194	935a568c	Florent Chuffart	2 BY 91694 65 Mo 1038 s.
195	935a568c	Florent Chuffart	3 BY 91205 65 Mo 1036 s.
196	935a568c	Florent Chuffart	4 RM 88076 62 Mo 984 s.
197	935a568c	Florent Chuffart	5 RM 90141 64 Mo 967 s.
198	935a568c	Florent Chuffart	6 RM 87517 62 Mo 980 s.
199	935a568c	Florent Chuffart	== ====== ========== ============= ================
200	935a568c	Florent Chuffart
201	935a568c	Florent Chuffart
202	935a568c	Florent Chuffart
203	935a568c	Florent Chuffart
204	935a568c	Florent Chuffart
205	935a568c	Florent Chuffart
206	935a568c	Florent Chuffart
207	935a568c	Florent Chuffart
208	935a568c	Florent Chuffart
209	935a568c	Florent Chuffart
210	935a568c	Florent Chuffart
211	935a568c	Florent Chuffart
212	935a568c	Florent Chuffart
213	e5603c3f	Florent Chuffart
214	e5603c3f	Florent Chuffart
215	e5603c3f	Florent Chuffart
216	e5603c3f	Florent Chuffart
217	e5603c3f	Florent Chuffart
218	e5603c3f	Florent Chuffart
219	e5603c3f	Florent Chuffart
220	e5603c3f	Florent Chuffart
221	e5603c3f	Florent Chuffart
222	e5603c3f	Florent Chuffart
223	e5603c3f	Florent Chuffart
224	e5603c3f	Florent Chuffart
225	e5603c3f	Florent Chuffart
226	e5603c3f	Florent Chuffart
227	e5603c3f	Florent Chuffart
228	e5603c3f	Florent Chuffart
229	e5603c3f	Florent Chuffart
230	e5603c3f	Florent Chuffart
231	e5603c3f	Florent Chuffart
232	e5603c3f	Florent Chuffart
233	e5603c3f	Florent Chuffart
234	e5603c3f	Florent Chuffart
235	e5603c3f	Florent Chuffart
236	e5603c3f	Florent Chuffart
237	e5603c3f	Florent Chuffart
238	e5603c3f	Florent Chuffart
239	e5603c3f	Florent Chuffart
240	e5603c3f	Florent Chuffart
241	e5603c3f	Florent Chuffart
242	e5603c3f	Florent Chuffart
243	e5603c3f	Florent Chuffart
244	e5603c3f	Florent Chuffart
245	e5603c3f	Florent Chuffart
246	e5603c3f	Florent Chuffart
247	e5603c3f	Florent Chuffart
248	e5603c3f	Florent Chuffart
249	e5603c3f	Florent Chuffart
250	e5603c3f	Florent Chuffart
251	e5603c3f	Florent Chuffart
252	e5603c3f	Florent Chuffart
253	e5603c3f	Florent Chuffart
254	e5603c3f	Florent Chuffart	..
255	e5603c3f	Florent Chuffart	..
256	e5603c3f	Florent Chuffart	.. - libcoverage.py
257	e5603c3f	Florent Chuffart	.. - wf.py
258	e5603c3f	Florent Chuffart	..
259	e5603c3f	Florent Chuffart	..
260	e5603c3f	Florent Chuffart	..
261	e5603c3f	Florent Chuffart	..
262	e5603c3f	Florent Chuffart	..
263	e5603c3f	Florent Chuffart	..
264	e5603c3f	Florent Chuffart	.. In order to simplify the design of experiment, we consider Mnase as a marker.
265	e5603c3f	Florent Chuffart	.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
266	e5603c3f	Florent Chuffart	.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
267	e5603c3f	Florent Chuffart	.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
268	e5603c3f	Florent Chuffart	.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
269	e5603c3f	Florent Chuffart	..
270	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.CSV_SAMPLE_FILE
271	e5603c3f	Florent Chuffart	.. :noindex:
272	e5603c3f	Florent Chuffart	..
273	e5603c3f	Florent Chuffart	.. We use a convention to link sample and Illumina fastq outputs. Illumina output
274	e5603c3f	Florent Chuffart	.. files of the sample `ID` will be stored in the directory
275	e5603c3f	Florent Chuffart	.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
276	e5603c3f	Florent Chuffart	.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
277	e5603c3f	Florent Chuffart	..
278	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
279	e5603c3f	Florent Chuffart	.. :noindex:
280	e5603c3f	Florent Chuffart	..
281	e5603c3f	Florent Chuffart	.. For BY (resp. RM and YJM) we use following reference genome
282	e5603c3f	Florent Chuffart	.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
283	e5603c3f	Florent Chuffart	.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
284	e5603c3f	Florent Chuffart	.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
285	e5603c3f	Florent Chuffart	.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
286	e5603c3f	Florent Chuffart	..
287	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
288	e5603c3f	Florent Chuffart	.. :noindex:
289	e5603c3f	Florent Chuffart	..
290	e5603c3f	Florent Chuffart	.. Each chromosome/contig is identify in the fasta file by an obscure identifier.
291	e5603c3f	Florent Chuffart	.. For example, BY chromosome I is identify by `gi\|144228165\|ref\|NC_001133.7\|` when
292	e5603c3f	Florent Chuffart	.. TemplateFilter is waiting for an integer. So, we translate it. The index
293	e5603c3f	Florent Chuffart	.. `FASTA_INDEXES` stores this translation.
294	e5603c3f	Florent Chuffart	..
295	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.FASTA_INDEXES
296	e5603c3f	Florent Chuffart	.. :noindex:
297	e5603c3f	Florent Chuffart	..
298	e5603c3f	Florent Chuffart	.. From a pragamatical point of view we discard some part of the genome (repeated
299	e5603c3f	Florent Chuffart	.. sequence etc...). The list of the black listed area is explicitely detailled in
300	e5603c3f	Florent Chuffart	.. `AREA_BLACK_LIST`.
301	e5603c3f	Florent Chuffart	..
302	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.AREA_BLACK_LIST
303	e5603c3f	Florent Chuffart	.. :noindex:
304	e5603c3f	Florent Chuffart	..
305	e5603c3f	Florent Chuffart	.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
306	e5603c3f	Florent Chuffart	.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
307	e5603c3f	Florent Chuffart	.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
308	e5603c3f	Florent Chuffart	.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
309	e5603c3f	Florent Chuffart	.. version of `NucleoMiner2`
310	e5603c3f	Florent Chuffart	.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
311	e5603c3f	Florent Chuffart	..
312	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.C2C_FILES
313	e5603c3f	Florent Chuffart	.. :noindex:
314	e5603c3f	Florent Chuffart	..
315	e5603c3f	Florent Chuffart	.. `nucleominer` uses specific directory to work in, these are described in
316	e5603c3f	Florent Chuffart	.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
317	e5603c3f	Florent Chuffart	..
318	e5603c3f	Florent Chuffart	.. Finally, `nucleominer` use external ressources, the path to these resspources
319	e5603c3f	Florent Chuffart	.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
320	e5603c3f	Florent Chuffart	.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
321	e5603c3f	Florent Chuffart	..
322	e5603c3f	Florent Chuffart	.. All paths, prefixes and indexes could be change in the
323	e5603c3f	Florent Chuffart	.. `src/current/nucleominer_config.json` file.
324	e5603c3f	Florent Chuffart	..
325	e5603c3f	Florent Chuffart	.. .. autodata:: wf.json_conf_file
326	e5603c3f	Florent Chuffart	.. :noindex:
327	e5603c3f	Florent Chuffart	..
328	e5603c3f	Florent Chuffart
329	e5603c3f	Florent Chuffart
330	e5603c3f	Florent Chuffart
331	e5603c3f	Florent Chuffart
332	e5603c3f	Florent Chuffart
333	e5603c3f	Florent Chuffart
334	e5603c3f	Florent Chuffart
335	e5603c3f	Florent Chuffart
336	e5603c3f	Florent Chuffart
337	e5603c3f	Florent Chuffart
338	e5603c3f	Florent Chuffart
339	e5603c3f	Florent Chuffart
340	e5603c3f	Florent Chuffart
341	e5603c3f	Florent Chuffart
342	e5603c3f	Florent Chuffart
343	e5603c3f	Florent Chuffart
344	e5603c3f	Florent Chuffart
345	e5603c3f	Florent Chuffart
346	e5603c3f	Florent Chuffart
347	e5603c3f	Florent Chuffart
348	e5603c3f	Florent Chuffart
349	e5603c3f	Florent Chuffart
350	e5603c3f	Florent Chuffart
351	e5603c3f	Florent Chuffart
352	e5603c3f	Florent Chuffart
353	e5603c3f	Florent Chuffart
354	e5603c3f	Florent Chuffart
355	e5603c3f	Florent Chuffart
356	e5603c3f	Florent Chuffart
357	e5603c3f	Florent Chuffart
358	e5603c3f	Florent Chuffart
359	e5603c3f	Florent Chuffart
360	e5603c3f	Florent Chuffart
361	e5603c3f	Florent Chuffart
362	e5603c3f	Florent Chuffart
363	e5603c3f	Florent Chuffart
364	e5603c3f	Florent Chuffart
365	e5603c3f	Florent Chuffart
366	e5603c3f	Florent Chuffart
367	e5603c3f	Florent Chuffart
368	e5603c3f	Florent Chuffart
369	e5603c3f	Florent Chuffart
370	935a568c	Florent Chuffart	Inferring Nucleosome Position and Extracting Read Counts
371	935a568c	Florent Chuffart	--------------------------------------------------------
372	935a568c	Florent Chuffart
373	935a568c	Florent Chuffart
374	935a568c	Florent Chuffart
375	3961deb6	Florent Chuffart	The second part of the tutorial uses R (http://http://www.r-project.org). NucleoMiner2 contains a set of R scripts that will be sourced in R from a console launched at the root of your project. These scripts are:
376	935a568c	Florent Chuffart
377	dadb6a4d	Florent Chuffart	- headers.R
378	935a568c	Florent Chuffart	- extract_maps.R
379	e5603c3f	Florent Chuffart	- translate_common_wp.R
380	b20637ed	Florent Chuffart	- split_samples.R
381	935a568c	Florent Chuffart	- count_reads.R
382	935a568c	Florent Chuffart	- get_size_factors
383	935a568c	Florent Chuffart	- launch_deseq.R
384	935a568c	Florent Chuffart
385	dadb6a4d	Florent Chuffart	The Script headers.R
386	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^
387	dadb6a4d	Florent Chuffart
388	3961deb6	Florent Chuffart	The script headers.R is included in all other R scripts. It is in charge of:
389	dadb6a4d	Florent Chuffart
390	e5603c3f	Florent Chuffart	- launching libraries used in the scripts
391	dadb6a4d	Florent Chuffart	- launching configuration (design, strain, marker...)
392	3961deb6	Florent Chuffart	- computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory.
393	e5603c3f	Florent Chuffart
394	3961deb6	Florent Chuffart	Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks.
395	e5603c3f	Florent Chuffart
396	3961deb6	Florent Chuffart	- You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral.
397	e5603c3f	Florent Chuffart
398	3961deb6	Florent Chuffart	- If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice "Generate .c2c Files"). thenuse it to produce the list of regions and customise “translate”.
399	e5603c3f	Florent Chuffart
400	3961deb6	Florent Chuffart	In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console:
401	935a568c	Florent Chuffart
402	935a568c	Florent Chuffart	.. code:: bash
403	935a568c	Florent Chuffart
404	e5603c3f	Florent Chuffart	source("src/current/headers.R")
405	935a568c	Florent Chuffart
406	935a568c	Florent Chuffart
407	dadb6a4d	Florent Chuffart	The Script extract_maps.R
408	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^
409	3961deb6	Florent Chuffart	This script is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions .
410	dadb6a4d	Florent Chuffart
411	3961deb6	Florent Chuffart	The map of well-positioned nucleosomes for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
412	dadb6a4d	Florent Chuffart
413	dadb6a4d	Florent Chuffart	- chr, the number of the chromosome
414	dadb6a4d	Florent Chuffart	- lower_bound, the lower bound of the nucleosome
415	dadb6a4d	Florent Chuffart	- upper_bound, the upper bound of the nucleosome
416	dadb6a4d	Florent Chuffart	- cur_index, index of the CUR
417	dadb6a4d	Florent Chuffart	- index_nuc, the index of the nucleosome in the CUR
418	e5603c3f	Florent Chuffart	- wp, 1 if it is a well positioned nucleosome, 0 otherwise
419	e5603c3f	Florent Chuffart	- nb_reads, the number of reads that support this nucleosome
420	e5603c3f	Florent Chuffart	- nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
421	e5603c3f	Florent Chuffart	- llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
422	e5603c3f	Florent Chuffart	- llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
423	e5603c3f	Florent Chuffart	- wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
424	3961deb6	Florent Chuffart	- wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (`1-pchisq(2.LLR2, df=4)`)
425	e5603c3f	Florent Chuffart	- dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions.
426	dadb6a4d	Florent Chuffart
427	3961deb6	Florent Chuffart	The sensitive map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
428	dadb6a4d	Florent Chuffart
429	dadb6a4d	Florent Chuffart	- chr, the number of the chromosome
430	dadb6a4d	Florent Chuffart	- lower_bound, the lower bound of the nucleosome
431	dadb6a4d	Florent Chuffart	- upper_bound, the upper bound of the nucleosome
432	dadb6a4d	Florent Chuffart	- cur_index, index of the CUR
433	dadb6a4d	Florent Chuffart
434	e5603c3f	Florent Chuffart	The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
435	dadb6a4d	Florent Chuffart
436	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
437	dadb6a4d	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR
438	e5603c3f	Florent Chuffart	- index_nuc_RM, the index of the RM nucleosome in the CUR
439	e5603c3f	Florent Chuffart	- llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM
440	e5603c3f	Florent Chuffart	- common_wp_pval, the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
441	3961deb6	Florent Chuffart	- diff, the dyads shift between the positions in the two strains (in bp)
442	dadb6a4d	Florent Chuffart
443	e5603c3f	Florent Chuffart	The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
444	dadb6a4d	Florent Chuffart
445	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
446	dadb6a4d	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR
447	dadb6a4d	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR
448	dadb6a4d	Florent Chuffart
449	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
450	935a568c	Florent Chuffart
451	935a568c	Florent Chuffart	.. code:: bash
452	935a568c	Florent Chuffart
453	dadb6a4d	Florent Chuffart	source("src/current/extract_maps.R")
454	dadb6a4d	Florent Chuffart
455	dadb6a4d	Florent Chuffart
456	e5603c3f	Florent Chuffart	The Script translate_common_wp.R
457	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
458	dadb6a4d	Florent Chuffart
459	3961deb6	Florent Chuffart	This script is used to translate common well-positioned nucleosome positions from a strain to another strain and stores it into a table.
460	dadb6a4d	Florent Chuffart
461	3961deb6	Florent Chuffart	For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosomes translated into the BY genome coordinates. It is composed of following columns:
462	dadb6a4d	Florent Chuffart
463	dadb6a4d	Florent Chuffart	- strain_ref, the reference genome (in which positioned are defined)
464	dadb6a4d	Florent Chuffart	- begin, the translated lower bound of the nucleosome
465	dadb6a4d	Florent Chuffart	- end, the translated upper bound of the nucleosome
466	e5603c3f	Florent Chuffart	- chr, the number of chromosomes for the reference genome (in which positioned are defined)
467	dadb6a4d	Florent Chuffart	- length, the length of the nucleosome (could be negative)
468	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
469	dadb6a4d	Florent Chuffart	- index_nuc, the index of the nucleosome in the CUR
470	dadb6a4d	Florent Chuffart
471	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
472	935a568c	Florent Chuffart
473	e5603c3f	Florent Chuffart	.. code:: bash
474	935a568c	Florent Chuffart
475	e5603c3f	Florent Chuffart	source("src/current/translate_common_wp.R")
476	b20637ed	Florent Chuffart
477	b20637ed	Florent Chuffart
478	e5603c3f	Florent Chuffart	The Script split_samples.R
479	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^
480	b20637ed	Florent Chuffart
481	3961deb6	Florent Chuffart	To optimize memory space usage, we split and compress TemplateFilter input files according to their corresponding chromosome. for example, `sample_1_TF.tab` will be split into :
482	b20637ed	Florent Chuffart
483	e5603c3f	Florent Chuffart	- sample_1_chr_1_splited_sample.tab.gz
484	e5603c3f	Florent Chuffart	- sample_1_chr_2_splited_sample.tab.gz
485	e5603c3f	Florent Chuffart	- ...
486	e5603c3f	Florent Chuffart	- sample_1_chr_17_splited_sample.tab.gz
487	e5603c3f	Florent Chuffart
488	e5603c3f	Florent Chuffart
489	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
490	b20637ed	Florent Chuffart
491	b20637ed	Florent Chuffart	.. code:: bash
492	b20637ed	Florent Chuffart
493	e5603c3f	Florent Chuffart	source("src/current/split_samples.R")
494	b20637ed	Florent Chuffart
495	b20637ed	Florent Chuffart
496	e5603c3f	Florent Chuffart	The Script count_reads.R
497	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^
498	e5603c3f	Florent Chuffart
499	e5603c3f	Florent Chuffart	To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`
500	e5603c3f	Florent Chuffart	for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively.
501	e5603c3f	Florent Chuffart
502	e5603c3f	Florent Chuffart	For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
503	e5603c3f	Florent Chuffart
504	e5603c3f	Florent Chuffart	- chr_BY, the number of the chromosome for BY
505	e5603c3f	Florent Chuffart	- lower_bound_BY, the lower bound of the nucleosome for BY
506	e5603c3f	Florent Chuffart	- upper_bound_BY, the upper bound of the nucleosome for BY
507	e5603c3f	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR for BY
508	e5603c3f	Florent Chuffart	- chr_RM, the number of the chromosome for RM
509	e5603c3f	Florent Chuffart	- lower_bound_RM, the lower bound of the nucleosome for RM
510	e5603c3f	Florent Chuffart	- upper_bound_RM, the upper bound of the nucleosome for RM
511	e5603c3f	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR for RM
512	e5603c3f	Florent Chuffart	- cur_index, index of the CUR
513	e5603c3f	Florent Chuffart	- BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
514	e5603c3f	Florent Chuffart	- BY_H3K14ac_37, #reads for sample 37
515	e5603c3f	Florent Chuffart	- BY_H3K14ac_53, #reads for sample 53
516	e5603c3f	Florent Chuffart	- RM_H3K14ac_38, #reads for sample 38
517	e5603c3f	Florent Chuffart	- RM_H3K14ac_39, #reads for sample 39
518	e5603c3f	Florent Chuffart
519	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
520	935a568c	Florent Chuffart
521	935a568c	Florent Chuffart	.. code:: bash
522	935a568c	Florent Chuffart
523	e5603c3f	Florent Chuffart	source("src/current/count_reads.R")
524	e5603c3f	Florent Chuffart
525	e5603c3f	Florent Chuffart
526	e5603c3f	Florent Chuffart	The Script get_size_factors.R
527	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
528	e5603c3f	Florent Chuffart
529	e5603c3f	Florent Chuffart
530	e5603c3f	Florent Chuffart	This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
531	e5603c3f	Florent Chuffart	the normalised count is n/f where f is the factor contained in this file.
532	e5603c3f	Florent Chuffart	The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
533	e5603c3f	Florent Chuffart
534	e5603c3f	Florent Chuffart	========= ======= ======= =======
535	e5603c3f	Florent Chuffart	sample_id wp unr wpunr
536	e5603c3f	Florent Chuffart	========= ======= ======= =======
537	e5603c3f	Florent Chuffart	1 0.87396 0.88097 0.87584
538	e5603c3f	Florent Chuffart	2 1.07890 1.07440 1.07760
539	e5603c3f	Florent Chuffart	3 1.06400 1.05890 1.06250
540	e5603c3f	Florent Chuffart	4 0.85782 0.87948 0.86305
541	e5603c3f	Florent Chuffart	5 0.97577 0.96590 0.97307
542	e5603c3f	Florent Chuffart	6 1.19630 1.18120 1.19190
543	e5603c3f	Florent Chuffart	36 0.93318 0.92762 0.93166
544	e5603c3f	Florent Chuffart	37 0.48315 0.48453 0.48350
545	e5603c3f	Florent Chuffart	38 1.11240 1.11210 1.11230
546	e5603c3f	Florent Chuffart	39 0.89897 0.89917 0.89903
547	e5603c3f	Florent Chuffart	53 2.22650 2.22700 2.22660
548	e5603c3f	Florent Chuffart	========= ======= ======= =======
549	e5603c3f	Florent Chuffart
550	e5603c3f	Florent Chuffart	sample_id are given in file samples.csv
551	935a568c	Florent Chuffart
552	3961deb6	Florent Chuffart	If you don't know which column to use for normalization, we recommend using wpunr.
553	935a568c	Florent Chuffart
554	3961deb6	Florent Chuffart	Here are the details of the factors produced:
555	e5603c3f	Florent Chuffart
556	e5603c3f	Florent Chuffart	- unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
557	e5603c3f	Florent Chuffart	- wp: same, but for well-positioned nucleosomes.
558	e5603c3f	Florent Chuffart	- wpunr: both types of regions.
559	e5603c3f	Florent Chuffart
560	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
561	935a568c	Florent Chuffart
562	935a568c	Florent Chuffart	.. code:: bash
563	935a568c	Florent Chuffart
564	e5603c3f	Florent Chuffart	source("src/current/get_size_factors.R")
565	935a568c	Florent Chuffart
566	935a568c	Florent Chuffart
567	e5603c3f	Florent Chuffart	The Script launch_deseq.R
568	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^
569	935a568c	Florent Chuffart
570	e5603c3f	Florent Chuffart	Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
571	e5603c3f	Florent Chuffart
572	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wp_snep.tab
573	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_unr_snep.tab
574	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wpunr_snep.tab
575	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wp_mnase.tab
576	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_unr_mnase.tab
577	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wpunr_mnase.tab
578	e5603c3f	Florent Chuffart
579	e5603c3f	Florent Chuffart	These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
580	e5603c3f	Florent Chuffart
581	e5603c3f	Florent Chuffart	- chr_BY, the number of the chromosome for BY
582	e5603c3f	Florent Chuffart	- lower_bound_BY, the lower bound of the nucleosome for BY
583	e5603c3f	Florent Chuffart	- upper_bound_BY, the upper bound of the nucleosome for BY
584	e5603c3f	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR for BY
585	e5603c3f	Florent Chuffart	- chr_RM, the number of the chromosome for RM
586	e5603c3f	Florent Chuffart	- lower_bound_RM, the lower bound of the nucleosome for RM
587	e5603c3f	Florent Chuffart	- upper_bound_RM, the upper bound of the nucleosome for RM
588	e5603c3f	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR for RM
589	e5603c3f	Florent Chuffart	- cur_index, index of the CUR
590	e5603c3f	Florent Chuffart	- form
591	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1
592	e5603c3f	Florent Chuffart
593	e5603c3f	Florent Chuffart	Next columns concern indicators for each sample:
594	e5603c3f	Florent Chuffart
595	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_2, #reads for sample 2
596	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_3, #reads for sample 3
597	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_4, #reads for sample 4
598	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_5, #reads for sample 5
599	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_6, #reads for sample 6
600	e5603c3f	Florent Chuffart	- BY_H3K14ac_36, #reads for sample 36
601	e5603c3f	Florent Chuffart	- BY_H3K14ac_37, #reads for sample 37
602	e5603c3f	Florent Chuffart	- BY_H3K14ac_53, #reads for sample 53
603	e5603c3f	Florent Chuffart	- RM_H3K14ac_38, #reads for sample 38
604	e5603c3f	Florent Chuffart	- RM_H3K14ac_39, #reads for sample 39
605	e5603c3f	Florent Chuffart
606	e5603c3f	Florent Chuffart	The 5 last columns concern DESeq analysis:
607	e5603c3f	Florent Chuffart
608	e5603c3f	Florent Chuffart	- manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
609	3961deb6	Florent Chuffart	- pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term marker:strain to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
610	e5603c3f	Florent Chuffart	- snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
611	e5603c3f	Florent Chuffart
612	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
613	e5603c3f	Florent Chuffart
614	935a568c	Florent Chuffart	.. code:: bash
615	935a568c	Florent Chuffart
616	e5603c3f	Florent Chuffart	source("src/current/launch_deseq.R")
617	935a568c	Florent Chuffart
618	935a568c	Florent Chuffart
619	e5603c3f	Florent Chuffart	Results: Number of SNEPs
620	e5603c3f	Florent Chuffart	------------------------
621	935a568c	Florent Chuffart
622	e5603c3f	Florent Chuffart	Here are the number of computed SNEPs for each forms.
623	935a568c	Florent Chuffart
624	e5603c3f	Florent Chuffart	===== ======= ===== =======
625	e5603c3f	Florent Chuffart	form strains #nucs H3K14ac
626	e5603c3f	Florent Chuffart	===== ======= ===== =======
627	e5603c3f	Florent Chuffart	wp BY-RM 30464 3549
628	e5603c3f	Florent Chuffart	unr BY-RM 9497 1559
629	e5603c3f	Florent Chuffart	wpunr BY-RM 39961 5240
630	e5603c3f	Florent Chuffart	===== ======= ===== =======
631	e5603c3f	Florent Chuffart
632	935a568c	Florent Chuffart
633	935a568c	Florent Chuffart
634	935a568c	Florent Chuffart
635	935a568c	Florent Chuffart
636	935a568c	Florent Chuffart
637	e5603c3f	Florent Chuffart	APPENDICE: Generate .c2c Files
638	e5603c3f	Florent Chuffart	------------------------------
639	935a568c	Florent Chuffart
640	3961deb6	Florent Chuffart	The `.c2c` files is a simple table that describes how two genome
641	3961deb6	Florent Chuffart	sequences are aligned. This file can be generated by using scripts that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics 2010) and which we provide in this release of NucleoMiner2.
642	5badc2fd	Florent Chuffart
643	5badc2fd	Florent Chuffart
644	3961deb6	Florent Chuffart	To use these scripts on your UNIX/LINUX computer you need first to install MUMmer which is designed to rapidly align entire genomes, whether in complete or draft form.
645	935a568c	Florent Chuffart
646	3961deb6	Florent Chuffart	Installing MUMmer
647	3961deb6	Florent Chuffart	^^^^^^^^^^^^^^^^^
648	935a568c	Florent Chuffart
649	3961deb6	Florent Chuffart	Get the last version of MUMmer archive on your computer (MUMmer3.23.tar.gz is provided in the directory deps of your working directory). Copy it in a dedicated directory. Install it locally into the src folder of you working directory by typing (working directory):
650	935a568c	Florent Chuffart
651	3961deb6	Florent Chuffart	tar -xvzf MUMmer3.23.tar.gz
652	935a568c	Florent Chuffart
653	935a568c	Florent Chuffart
654	935a568c	Florent Chuffart	.. code:: bash
655	935a568c	Florent Chuffart
656	c25275e2	Florent Chuffart	cd src
657	c25275e2	Florent Chuffart	tar xfvz ../deps/MUMmer3.23.tar.gz
658	c25275e2	Florent Chuffart	cd MUMmer3.23
659	c25275e2	Florent Chuffart	make check
660	c25275e2	Florent Chuffart	make install
661	935a568c	Florent Chuffart
662	c25275e2	Florent Chuffart	Installing NucleoMiner 1.0 scripts
663	c25275e2	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
664	e5603c3f	Florent Chuffart
665	3961deb6	Florent Chuffart	Get the nucleominer-1.0.tar.gz archive on your computer (this archive is provided in the directory deps of your working directory). Install it locally into the src folder of you working directory by typing (working directory):
666	e5603c3f	Florent Chuffart
667	935a568c	Florent Chuffart
668	e5603c3f	Florent Chuffart	.. code:: bash
669	935a568c	Florent Chuffart
670	c25275e2	Florent Chuffart	cd src
671	c25275e2	Florent Chuffart	tar xfvz ../deps/nucleominer-1.0.tar.gz
672	c25275e2	Florent Chuffart	cd ..
673	935a568c	Florent Chuffart
674	3961deb6	Florent Chuffart	This creates a directory that contains NucleoMiner 1.0 scripts (src/nucleominer-1.0/scripts).
675	935a568c	Florent Chuffart
676	e5603c3f	Florent Chuffart
677	e5603c3f	Florent Chuffart	Generate .c2c Files
678	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^
679	e5603c3f	Florent Chuffart
680	e5603c3f	Florent Chuffart	To generate .c2c files you need to type the following command in a terminal:
681	e5603c3f	Florent Chuffart
682	e5603c3f	Florent Chuffart	.. code:: bash
683	935a568c	Florent Chuffart
684	c25275e2	Florent Chuffart	export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
685	c25275e2	Florent Chuffart	export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
686	c25275e2	Florent Chuffart	NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
687	c25275e2	Florent Chuffart	data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
688	c25275e2	Florent Chuffart	data/byxrm 2>NMgxcomp.log
689	e5603c3f	Florent Chuffart
690	c25275e2	Florent Chuffart	After execution, the directory `data` will hold the .c2c files.

LBMC » NucleoMiner

root / doc / sphinx_doc / tuto.rst @ 780f632a