/doc/sphinx_doc/tuto.rst - Annoter - NucleoMiner - Forge du Centre Blaise Pascal

root / doc / sphinx_doc / tuto.rst @ 5badc2fd

Historique | Voir | Annoter | Télécharger (25,57 ko)

1	935a568c	Florent Chuffart	Tutorial
2	935a568c	Florent Chuffart	========
3	935a568c	Florent Chuffart
4	e5603c3f	Florent Chuffart	This tutorial describes steps allowing performing quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
5	935a568c	Florent Chuffart
6	e5603c3f	Florent Chuffart	This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts information (nucleosome position and indicators) from the dataset.
7	935a568c	Florent Chuffart
8	935a568c	Florent Chuffart
9	dadb6a4d	Florent Chuffart
10	dadb6a4d	Florent Chuffart
11	e5603c3f	Florent Chuffart	Experimental Dataset, Working Directory and Configuration File
12	e5603c3f	Florent Chuffart	--------------------------------------------------------------
13	dadb6a4d	Florent Chuffart
14	e5603c3f	Florent Chuffart	Working Directory Organisation
15	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
16	dadb6a4d	Florent Chuffart
17	5badc2fd	Florent Chuffart	After having install NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
18	dadb6a4d	Florent Chuffart
19	5badc2fd	Florent Chuffart	.. code:: bash
20	dadb6a4d	Florent Chuffart
21	5badc2fd	Florent Chuffart	cd doc/Chuffart_NM2_workdir/
22	dadb6a4d	Florent Chuffart
23	dadb6a4d	Florent Chuffart
24	e5603c3f	Florent Chuffart	Retrieving Experimental Dataset
25	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
26	935a568c	Florent Chuffart
27	e5603c3f	Florent Chuffart	The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
28	935a568c	Florent Chuffart
29	e5603c3f	Florent Chuffart	$$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
30	935a568c	Florent Chuffart
31	935a568c	Florent Chuffart
32	e5603c3f	Florent Chuffart	We want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain we performed Mnase-Seq and ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark.
33	935a568c	Florent Chuffart
34	e5603c3f	Florent Chuffart	The dataset is composed of 55 files organised as follows:
35	935a568c	Florent Chuffart
36	e5603c3f	Florent Chuffart	- 3 replicates for BY MNase Seq
37	e5603c3f	Florent Chuffart
38	e5603c3f	Florent Chuffart	- sample 1 (5 fastq.gz files)
39	e5603c3f	Florent Chuffart	- sample 2 (5 fastq.gz files)
40	e5603c3f	Florent Chuffart	- sample 3 (4 fastq.gz files)
41	e5603c3f	Florent Chuffart
42	e5603c3f	Florent Chuffart	- 3 replicates for RM MNase Seq
43	e5603c3f	Florent Chuffart
44	e5603c3f	Florent Chuffart	- sample 4 (4 fastq.gz files)
45	e5603c3f	Florent Chuffart	- sample 5 (4 fastq.gz files)
46	e5603c3f	Florent Chuffart	- sample 6 (5 fastq.gz files)
47	e5603c3f	Florent Chuffart
48	e5603c3f	Florent Chuffart	- 3 replicates for BY ChIP Seq H3K14ac
49	e5603c3f	Florent Chuffart
50	e5603c3f	Florent Chuffart	- sample 36 (5 fastq.gz files)
51	e5603c3f	Florent Chuffart	- sample 37 (5 fastq.gz files)
52	e5603c3f	Florent Chuffart	- sample 53 (9 fastq.gz files)
53	e5603c3f	Florent Chuffart
54	e5603c3f	Florent Chuffart	- 2 replicates for RM ChIP Seq H3K14ac
55	e5603c3f	Florent Chuffart
56	e5603c3f	Florent Chuffart	- sample 38 (5 fastq.gz files)
57	e5603c3f	Florent Chuffart	- sample 39 (4 fastq.gz files)
58	e5603c3f	Florent Chuffart
59	935a568c	Florent Chuffart
60	935a568c	Florent Chuffart
61	935a568c	Florent Chuffart
62	e5603c3f	Florent Chuffart	Python and R Common Configuration File
63	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
64	935a568c	Florent Chuffart
65	e5603c3f	Florent Chuffart	First of all we define in one place some configuration variables that will be launched by python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
66	935a568c	Florent Chuffart
67	e5603c3f	Florent Chuffart	To do this, go to the root directory of your project and run the following command:
68	935a568c	Florent Chuffart
69	e5603c3f	Florent Chuffart	.. code:: bash
70	935a568c	Florent Chuffart
71	e5603c3f	Florent Chuffart	python src/current/configurator.py
72	e5603c3f	Florent Chuffart
73	935a568c	Florent Chuffart
74	935a568c	Florent Chuffart
75	935a568c	Florent Chuffart
76	935a568c	Florent Chuffart
77	935a568c	Florent Chuffart
78	935a568c	Florent Chuffart
79	935a568c	Florent Chuffart	Preprocessing Illumina Fastq Reads for Each Sample
80	935a568c	Florent Chuffart	--------------------------------------------------
81	935a568c	Florent Chuffart
82	e5603c3f	Florent Chuffart	This preprocessing step consists of 4 main steps embedded in the `wf.py` script. They are described bellow. As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
83	e5603c3f	Florent Chuffart
84	935a568c	Florent Chuffart
85	935a568c	Florent Chuffart	.. autodata:: wf.samples
86	935a568c	Florent Chuffart	:noindex:
87	935a568c	Florent Chuffart
88	935a568c	Florent Chuffart	.. autodata:: wf.samples_mnase
89	935a568c	Florent Chuffart	:noindex:
90	935a568c	Florent Chuffart
91	935a568c	Florent Chuffart	.. autodata:: wf.strains
92	935a568c	Florent Chuffart	:noindex:
93	935a568c	Florent Chuffart
94	935a568c	Florent Chuffart
95	935a568c	Florent Chuffart	Creating Bowtie Index from each Reference Genome
96	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
97	935a568c	Florent Chuffart
98	e5603c3f	Florent Chuffart	For each strain, we need to create bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. This step is performed by the following part of the `wf.py` script:
99	935a568c	Florent Chuffart
100	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
101	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_1
102	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_1
103	935a568c	Florent Chuffart	:language: python
104	935a568c	Florent Chuffart
105	e5603c3f	Florent Chuffart	The following table summarizes the file sizes and process durations concerning this step.
106	935a568c	Florent Chuffart
107	935a568c	Florent Chuffart	====== ====================== ====================== ================
108	935a568c	Florent Chuffart	strain fasta genome file size bowtie index file size process duration
109	935a568c	Florent Chuffart	====== ====================== ====================== ================
110	935a568c	Florent Chuffart	BY 12 Mo 25 Mo 11 s.
111	935a568c	Florent Chuffart	RM 12 Mo 24 Mo 9 s.
112	935a568c	Florent Chuffart	====== ====================== ====================== ================
113	935a568c	Florent Chuffart
114	935a568c	Florent Chuffart
115	935a568c	Florent Chuffart
116	935a568c	Florent Chuffart
117	935a568c	Florent Chuffart	Aligning Reads to Reference Genome
118	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
119	935a568c	Florent Chuffart
120	935a568c	Florent Chuffart	Next, we launch bowtie to align reads to the reference genome. It produces a
121	e5603c3f	Florent Chuffart	`.sam` file that we convert into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the `wf.py` script:
122	935a568c	Florent Chuffart
123	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
124	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_2
125	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_2
126	935a568c	Florent Chuffart	:language: python
127	935a568c	Florent Chuffart
128	e5603c3f	Florent Chuffart	Convert Aligned Reads into TemplateFilter Format
129	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
130	e5603c3f	Florent Chuffart
131	e5603c3f	Florent Chuffart	TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads as follows: `chr`, `coord`, `strand` and `#read` where:
132	935a568c	Florent Chuffart
133	e5603c3f	Florent Chuffart	- `chr` is the number of the chromosome;
134	e5603c3f	Florent Chuffart	- `coord` is the coordinate of the reads;
135	e5603c3f	Florent Chuffart	- `strand` is `F` for forward and `R` for reverse;
136	e5603c3f	Florent Chuffart	- `#reads` the number of reads covering this position.
137	935a568c	Florent Chuffart
138	935a568c	Florent Chuffart	Each entry is tab-separated.
139	935a568c	Florent Chuffart
140	e5603c3f	Florent Chuffart	WARNING for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side. This step takes this into account by adding the read length (in our case 50) to the reverse reads coordinates.
141	935a568c	Florent Chuffart
142	e5603c3f	Florent Chuffart	This step is performed by the following part of the `wf.py` script:
143	935a568c	Florent Chuffart
144	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
145	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_3
146	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_3
147	935a568c	Florent Chuffart	:language: python
148	935a568c	Florent Chuffart
149	e5603c3f	Florent Chuffart	The following table summarises the number of reads, the involved file sizes and process durations concerning the two last steps. In our case, alignment process have been multithreaded over 3 cores.
150	935a568c	Florent Chuffart
151	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
152	935a568c	Florent Chuffart	id Illumina reads aligned and filtred reads ratio `.bed` file size TF input file size process duration
153	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
154	935a568c	Florent Chuffart	1 16436138 10199695 62,06% 1064 Mo 60 Mo 383 s.
155	935a568c	Florent Chuffart	2 16911132 12512727 73,99% 1298 Mo 64 Mo 437 s.
156	935a568c	Florent Chuffart	3 15946902 12340426 77,38% 1280 Mo 65 Mo 423 s.
157	935a568c	Florent Chuffart	4 13765584 10381903 75,42% 931 Mo 59 Mo 352 s.
158	935a568c	Florent Chuffart	5 15168268 11502855 75,83% 1031 Mo 64 Mo 386 s.
159	935a568c	Florent Chuffart	6 18850820 14024905 74,40% 1254 Mo 69 Mo 482 s.
160	935a568c	Florent Chuffart	36 17715118 14092985 79,55% 1404 Mo 68 Mo 483 s.
161	935a568c	Florent Chuffart	37 17288466 7402082 42,82% 741 Mo 48 Mo 339 s.
162	935a568c	Florent Chuffart	38 16116394 13178457 81,77% 1101 Mo 63 Mo 420 s.
163	935a568c	Florent Chuffart	39 14241106 10537228 73,99% 880 Mo 57 Mo 348 s.
164	935a568c	Florent Chuffart	53 40876476 33780065 82,64% 3316 Mo 103 Mo 1165 s.
165	935a568c	Florent Chuffart	== ============== ========================= ====== ================ ================== ================
166	935a568c	Florent Chuffart
167	935a568c	Florent Chuffart	Run TemplateFilter on Mnase Samples
168	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
169	935a568c	Florent Chuffart
170	e5603c3f	Florent Chuffart	Finally, for each sample we perform TemplateFilter analysis.
171	935a568c	Florent Chuffart
172	935a568c	Florent Chuffart	WARNING TemplateFilter returns a list of nucleosomes. Each nucleosome is
173	e5603c3f	Florent Chuffart	define by its center and its width. An odd width leads us to consider non-
174	e5603c3f	Florent Chuffart	integer lower and upper bound.
175	935a568c	Florent Chuffart
176	e5603c3f	Florent Chuffart	WARNING TemplateFilter is not designed to deal with replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
177	935a568c	Florent Chuffart
178	e5603c3f	Florent Chuffart	This step is performed by the following part of the `wf.py` script:
179	935a568c	Florent Chuffart
180	8e9facd8	Florent Chuffart	.. literalinclude:: ../../../snep/src/current/wf.py
181	935a568c	Florent Chuffart	:start-after: # _STARTOF_ step_4
182	935a568c	Florent Chuffart	:end-before: # _ENDOF_ step_4
183	935a568c	Florent Chuffart	:language: python
184	935a568c	Florent Chuffart
185	935a568c	Florent Chuffart	== ====== ========== ============= ================
186	935a568c	Florent Chuffart	id strain found nucs nuc file size process duration
187	935a568c	Florent Chuffart	== ====== ========== ============= ================
188	935a568c	Florent Chuffart	1 BY 96214 68 Mo 1022 s.
189	935a568c	Florent Chuffart	2 BY 91694 65 Mo 1038 s.
190	935a568c	Florent Chuffart	3 BY 91205 65 Mo 1036 s.
191	935a568c	Florent Chuffart	4 RM 88076 62 Mo 984 s.
192	935a568c	Florent Chuffart	5 RM 90141 64 Mo 967 s.
193	935a568c	Florent Chuffart	6 RM 87517 62 Mo 980 s.
194	935a568c	Florent Chuffart	== ====== ========== ============= ================
195	935a568c	Florent Chuffart
196	935a568c	Florent Chuffart
197	935a568c	Florent Chuffart
198	935a568c	Florent Chuffart
199	935a568c	Florent Chuffart
200	935a568c	Florent Chuffart
201	935a568c	Florent Chuffart
202	935a568c	Florent Chuffart
203	935a568c	Florent Chuffart
204	935a568c	Florent Chuffart
205	935a568c	Florent Chuffart
206	935a568c	Florent Chuffart
207	935a568c	Florent Chuffart
208	e5603c3f	Florent Chuffart
209	e5603c3f	Florent Chuffart
210	e5603c3f	Florent Chuffart
211	e5603c3f	Florent Chuffart
212	e5603c3f	Florent Chuffart
213	e5603c3f	Florent Chuffart
214	e5603c3f	Florent Chuffart
215	e5603c3f	Florent Chuffart
216	e5603c3f	Florent Chuffart
217	e5603c3f	Florent Chuffart
218	e5603c3f	Florent Chuffart
219	e5603c3f	Florent Chuffart
220	e5603c3f	Florent Chuffart
221	e5603c3f	Florent Chuffart
222	e5603c3f	Florent Chuffart
223	e5603c3f	Florent Chuffart
224	e5603c3f	Florent Chuffart
225	e5603c3f	Florent Chuffart
226	e5603c3f	Florent Chuffart
227	e5603c3f	Florent Chuffart
228	e5603c3f	Florent Chuffart
229	e5603c3f	Florent Chuffart
230	e5603c3f	Florent Chuffart
231	e5603c3f	Florent Chuffart
232	e5603c3f	Florent Chuffart
233	e5603c3f	Florent Chuffart
234	e5603c3f	Florent Chuffart
235	e5603c3f	Florent Chuffart
236	e5603c3f	Florent Chuffart
237	e5603c3f	Florent Chuffart
238	e5603c3f	Florent Chuffart
239	e5603c3f	Florent Chuffart
240	e5603c3f	Florent Chuffart
241	e5603c3f	Florent Chuffart
242	e5603c3f	Florent Chuffart
243	e5603c3f	Florent Chuffart
244	e5603c3f	Florent Chuffart
245	e5603c3f	Florent Chuffart
246	e5603c3f	Florent Chuffart
247	e5603c3f	Florent Chuffart
248	e5603c3f	Florent Chuffart
249	e5603c3f	Florent Chuffart	..
250	e5603c3f	Florent Chuffart	..
251	e5603c3f	Florent Chuffart	.. - libcoverage.py
252	e5603c3f	Florent Chuffart	.. - wf.py
253	e5603c3f	Florent Chuffart	..
254	e5603c3f	Florent Chuffart	..
255	e5603c3f	Florent Chuffart	..
256	e5603c3f	Florent Chuffart	..
257	e5603c3f	Florent Chuffart	..
258	e5603c3f	Florent Chuffart	..
259	e5603c3f	Florent Chuffart	.. In order to simplify the design of experiment, we consider Mnase as a marker.
260	e5603c3f	Florent Chuffart	.. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
261	e5603c3f	Florent Chuffart	.. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
262	e5603c3f	Florent Chuffart	.. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
263	e5603c3f	Florent Chuffart	.. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
264	e5603c3f	Florent Chuffart	..
265	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.CSV_SAMPLE_FILE
266	e5603c3f	Florent Chuffart	.. :noindex:
267	e5603c3f	Florent Chuffart	..
268	e5603c3f	Florent Chuffart	.. We use a convention to link sample and Illumina fastq outputs. Illumina output
269	e5603c3f	Florent Chuffart	.. files of the sample `ID` will be stored in the directory
270	e5603c3f	Florent Chuffart	.. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
271	e5603c3f	Florent Chuffart	.. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
272	e5603c3f	Florent Chuffart	..
273	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
274	e5603c3f	Florent Chuffart	.. :noindex:
275	e5603c3f	Florent Chuffart	..
276	e5603c3f	Florent Chuffart	.. For BY (resp. RM and YJM) we use following reference genome
277	e5603c3f	Florent Chuffart	.. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
278	e5603c3f	Florent Chuffart	.. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
279	e5603c3f	Florent Chuffart	.. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
280	e5603c3f	Florent Chuffart	.. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
281	e5603c3f	Florent Chuffart	..
282	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
283	e5603c3f	Florent Chuffart	.. :noindex:
284	e5603c3f	Florent Chuffart	..
285	e5603c3f	Florent Chuffart	.. Each chromosome/contig is identify in the fasta file by an obscure identifier.
286	e5603c3f	Florent Chuffart	.. For example, BY chromosome I is identify by `gi\|144228165\|ref\|NC_001133.7\|` when
287	e5603c3f	Florent Chuffart	.. TemplateFilter is waiting for an integer. So, we translate it. The index
288	e5603c3f	Florent Chuffart	.. `FASTA_INDEXES` stores this translation.
289	e5603c3f	Florent Chuffart	..
290	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.FASTA_INDEXES
291	e5603c3f	Florent Chuffart	.. :noindex:
292	e5603c3f	Florent Chuffart	..
293	e5603c3f	Florent Chuffart	.. From a pragamatical point of view we discard some part of the genome (repeated
294	e5603c3f	Florent Chuffart	.. sequence etc...). The list of the black listed area is explicitely detailled in
295	e5603c3f	Florent Chuffart	.. `AREA_BLACK_LIST`.
296	e5603c3f	Florent Chuffart	..
297	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.AREA_BLACK_LIST
298	e5603c3f	Florent Chuffart	.. :noindex:
299	e5603c3f	Florent Chuffart	..
300	e5603c3f	Florent Chuffart	.. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
301	e5603c3f	Florent Chuffart	.. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
302	e5603c3f	Florent Chuffart	.. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
303	e5603c3f	Florent Chuffart	.. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
304	e5603c3f	Florent Chuffart	.. version of `NucleoMiner2`
305	e5603c3f	Florent Chuffart	.. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
306	e5603c3f	Florent Chuffart	..
307	e5603c3f	Florent Chuffart	.. .. autodata:: configurator.C2C_FILES
308	e5603c3f	Florent Chuffart	.. :noindex:
309	e5603c3f	Florent Chuffart	..
310	e5603c3f	Florent Chuffart	.. `nucleominer` uses specific directory to work in, these are described in
311	e5603c3f	Florent Chuffart	.. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
312	e5603c3f	Florent Chuffart	..
313	e5603c3f	Florent Chuffart	.. Finally, `nucleominer` use external ressources, the path to these resspources
314	e5603c3f	Florent Chuffart	.. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
315	e5603c3f	Florent Chuffart	.. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
316	e5603c3f	Florent Chuffart	..
317	e5603c3f	Florent Chuffart	.. All paths, prefixes and indexes could be change in the
318	e5603c3f	Florent Chuffart	.. `src/current/nucleominer_config.json` file.
319	e5603c3f	Florent Chuffart	..
320	e5603c3f	Florent Chuffart	.. .. autodata:: wf.json_conf_file
321	e5603c3f	Florent Chuffart	.. :noindex:
322	e5603c3f	Florent Chuffart	..
323	e5603c3f	Florent Chuffart
324	e5603c3f	Florent Chuffart
325	e5603c3f	Florent Chuffart
326	e5603c3f	Florent Chuffart
327	e5603c3f	Florent Chuffart
328	e5603c3f	Florent Chuffart
329	e5603c3f	Florent Chuffart
330	e5603c3f	Florent Chuffart
331	e5603c3f	Florent Chuffart
332	e5603c3f	Florent Chuffart
333	e5603c3f	Florent Chuffart
334	e5603c3f	Florent Chuffart
335	e5603c3f	Florent Chuffart
336	e5603c3f	Florent Chuffart
337	e5603c3f	Florent Chuffart
338	e5603c3f	Florent Chuffart
339	e5603c3f	Florent Chuffart
340	e5603c3f	Florent Chuffart
341	e5603c3f	Florent Chuffart
342	e5603c3f	Florent Chuffart
343	e5603c3f	Florent Chuffart
344	e5603c3f	Florent Chuffart
345	e5603c3f	Florent Chuffart
346	e5603c3f	Florent Chuffart
347	e5603c3f	Florent Chuffart
348	e5603c3f	Florent Chuffart
349	e5603c3f	Florent Chuffart
350	e5603c3f	Florent Chuffart
351	e5603c3f	Florent Chuffart
352	e5603c3f	Florent Chuffart
353	e5603c3f	Florent Chuffart
354	e5603c3f	Florent Chuffart
355	e5603c3f	Florent Chuffart
356	e5603c3f	Florent Chuffart
357	e5603c3f	Florent Chuffart
358	e5603c3f	Florent Chuffart
359	e5603c3f	Florent Chuffart
360	e5603c3f	Florent Chuffart
361	e5603c3f	Florent Chuffart
362	e5603c3f	Florent Chuffart
363	e5603c3f	Florent Chuffart
364	e5603c3f	Florent Chuffart
365	935a568c	Florent Chuffart	Inferring Nucleosome Position and Extracting Read Counts
366	935a568c	Florent Chuffart	--------------------------------------------------------
367	935a568c	Florent Chuffart
368	935a568c	Florent Chuffart
369	935a568c	Florent Chuffart
370	e5603c3f	Florent Chuffart	The second part of the tutorial uses R (http://http://www.r-project.org). It consists of a set of R scripts that will be sourced in an R from a console launched at the root of your project. These scripts are:
371	935a568c	Florent Chuffart
372	dadb6a4d	Florent Chuffart	- headers.R
373	935a568c	Florent Chuffart	- extract_maps.R
374	e5603c3f	Florent Chuffart	- translate_common_wp.R
375	b20637ed	Florent Chuffart	- split_samples.R
376	935a568c	Florent Chuffart	- count_reads.R
377	935a568c	Florent Chuffart	- get_size_factors
378	935a568c	Florent Chuffart	- launch_deseq.R
379	935a568c	Florent Chuffart
380	dadb6a4d	Florent Chuffart	The Script headers.R
381	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^
382	dadb6a4d	Florent Chuffart
383	e5603c3f	Florent Chuffart	The script headers.R is included in each other scripts. It is in charge of:
384	dadb6a4d	Florent Chuffart
385	e5603c3f	Florent Chuffart	- launching libraries used in the scripts
386	dadb6a4d	Florent Chuffart	- launching configuration (design, strain, marker...)
387	e5603c3f	Florent Chuffart	- computing and caching CURs (caching means storing the information in the computer's memory)
388	e5603c3f	Florent Chuffart
389	e5603c3f	Florent Chuffart	Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks. You may be using NucleoMiner2 to analyse data of a single strain, or of several strains.
390	e5603c3f	Florent Chuffart
391	e5603c3f	Florent Chuffart	- All the data corresponds to the same strain (e.g. treatment/control, or only few mutations): Then in step 1), the regions to use are entire chromosomes. Instep 2) simply use the default translate function which is neutral.
392	e5603c3f	Florent Chuffart
393	e5603c3f	Florent Chuffart	- The data come from two or more strains: In this case, edit a list of regions and customize the translate function which performs the correspondence between the different genomes. How we did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to the Appendice "Generate .c2c Files"), then use it to produce the list of regions and customise “translate”.
394	e5603c3f	Florent Chuffart
395	e5603c3f	Florent Chuffart
396	e5603c3f	Florent Chuffart
397	dadb6a4d	Florent Chuffart
398	dadb6a4d	Florent Chuffart	In your R console, run the following command line:
399	935a568c	Florent Chuffart
400	935a568c	Florent Chuffart	.. code:: bash
401	935a568c	Florent Chuffart
402	e5603c3f	Florent Chuffart	source("src/current/headers.R")
403	935a568c	Florent Chuffart
404	935a568c	Florent Chuffart
405	dadb6a4d	Florent Chuffart	The Script extract_maps.R
406	dadb6a4d	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^
407	e5603c3f	Florent Chuffart	This script is in charge of extracting Maps for well-positioned and fuzzy nucleosomes. First of all, this script computes intra and inter-strain nucleosome maps for each CUR. This step is executed in parallel on many cores using the BoT library. Next, it collects results and produces well-positioned, fuzzy and UNR maps.
408	dadb6a4d	Florent Chuffart
409	e5603c3f	Florent Chuffart	The well-positioned map for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
410	dadb6a4d	Florent Chuffart
411	dadb6a4d	Florent Chuffart	- chr, the number of the chromosome
412	dadb6a4d	Florent Chuffart	- lower_bound, the lower bound of the nucleosome
413	dadb6a4d	Florent Chuffart	- upper_bound, the upper bound of the nucleosome
414	dadb6a4d	Florent Chuffart	- cur_index, index of the CUR
415	dadb6a4d	Florent Chuffart	- index_nuc, the index of the nucleosome in the CUR
416	e5603c3f	Florent Chuffart	- wp, 1 if it is a well positioned nucleosome, 0 otherwise
417	e5603c3f	Florent Chuffart	- nb_reads, the number of reads that support this nucleosome
418	e5603c3f	Florent Chuffart	- nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
419	e5603c3f	Florent Chuffart	- llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
420	e5603c3f	Florent Chuffart	- llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
421	e5603c3f	Florent Chuffart	- wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
422	e5603c3f	Florent Chuffart	- wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained with the LLR2 (`1-pchisq(2.LLR2, df=4)`)
423	e5603c3f	Florent Chuffart	- dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions.
424	dadb6a4d	Florent Chuffart
425	e5603c3f	Florent Chuffart	The fuzzy map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
426	dadb6a4d	Florent Chuffart
427	dadb6a4d	Florent Chuffart	- chr, the number of the chromosome
428	dadb6a4d	Florent Chuffart	- lower_bound, the lower bound of the nucleosome
429	dadb6a4d	Florent Chuffart	- upper_bound, the upper bound of the nucleosome
430	dadb6a4d	Florent Chuffart	- cur_index, index of the CUR
431	dadb6a4d	Florent Chuffart
432	e5603c3f	Florent Chuffart	The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
433	dadb6a4d	Florent Chuffart
434	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
435	dadb6a4d	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR
436	e5603c3f	Florent Chuffart	- index_nuc_RM, the index of the RM nucleosome in the CUR
437	e5603c3f	Florent Chuffart	- llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM
438	e5603c3f	Florent Chuffart	- common_wp_pval, the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
439	e5603c3f	Florent Chuffart	- diff, the dyads shift between the positions in the two strains
440	dadb6a4d	Florent Chuffart
441	e5603c3f	Florent Chuffart	The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
442	dadb6a4d	Florent Chuffart
443	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
444	dadb6a4d	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR
445	dadb6a4d	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR
446	dadb6a4d	Florent Chuffart
447	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
448	935a568c	Florent Chuffart
449	935a568c	Florent Chuffart	.. code:: bash
450	935a568c	Florent Chuffart
451	dadb6a4d	Florent Chuffart	source("src/current/extract_maps.R")
452	dadb6a4d	Florent Chuffart
453	dadb6a4d	Florent Chuffart
454	e5603c3f	Florent Chuffart	The Script translate_common_wp.R
455	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
456	dadb6a4d	Florent Chuffart
457	e5603c3f	Florent Chuffart	This script is used to translate common well-positioned nucleosome maps from a strain to another strain and stores it into a table.
458	dadb6a4d	Florent Chuffart
459	e5603c3f	Florent Chuffart	For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosome translated into the BY genome coordinates. It is composed of following columns:
460	dadb6a4d	Florent Chuffart
461	dadb6a4d	Florent Chuffart	- strain_ref, the reference genome (in which positioned are defined)
462	dadb6a4d	Florent Chuffart	- begin, the translated lower bound of the nucleosome
463	dadb6a4d	Florent Chuffart	- end, the translated upper bound of the nucleosome
464	e5603c3f	Florent Chuffart	- chr, the number of chromosomes for the reference genome (in which positioned are defined)
465	dadb6a4d	Florent Chuffart	- length, the length of the nucleosome (could be negative)
466	dadb6a4d	Florent Chuffart	- cur_index, the index of the CUR
467	dadb6a4d	Florent Chuffart	- index_nuc, the index of the nucleosome in the CUR
468	dadb6a4d	Florent Chuffart
469	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
470	935a568c	Florent Chuffart
471	e5603c3f	Florent Chuffart	.. code:: bash
472	935a568c	Florent Chuffart
473	e5603c3f	Florent Chuffart	source("src/current/translate_common_wp.R")
474	b20637ed	Florent Chuffart
475	b20637ed	Florent Chuffart
476	e5603c3f	Florent Chuffart	The Script split_samples.R
477	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^
478	b20637ed	Florent Chuffart
479	e5603c3f	Florent Chuffart	For memory space usage reasons, we split and compress TemplateFilter input files according to their corresponding chromosome. for example, `sample_1_TF.tab` will be split into :
480	b20637ed	Florent Chuffart
481	e5603c3f	Florent Chuffart	- sample_1_chr_1_splited_sample.tab.gz
482	e5603c3f	Florent Chuffart	- sample_1_chr_2_splited_sample.tab.gz
483	e5603c3f	Florent Chuffart	- ...
484	e5603c3f	Florent Chuffart	- sample_1_chr_17_splited_sample.tab.gz
485	e5603c3f	Florent Chuffart
486	e5603c3f	Florent Chuffart
487	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
488	b20637ed	Florent Chuffart
489	b20637ed	Florent Chuffart	.. code:: bash
490	b20637ed	Florent Chuffart
491	e5603c3f	Florent Chuffart	source("src/current/split_samples.R")
492	b20637ed	Florent Chuffart
493	b20637ed	Florent Chuffart
494	e5603c3f	Florent Chuffart	The Script count_reads.R
495	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^
496	e5603c3f	Florent Chuffart
497	e5603c3f	Florent Chuffart	To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`
498	e5603c3f	Florent Chuffart	for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively.
499	e5603c3f	Florent Chuffart
500	e5603c3f	Florent Chuffart	For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
501	e5603c3f	Florent Chuffart
502	e5603c3f	Florent Chuffart	- chr_BY, the number of the chromosome for BY
503	e5603c3f	Florent Chuffart	- lower_bound_BY, the lower bound of the nucleosome for BY
504	e5603c3f	Florent Chuffart	- upper_bound_BY, the upper bound of the nucleosome for BY
505	e5603c3f	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR for BY
506	e5603c3f	Florent Chuffart	- chr_RM, the number of the chromosome for RM
507	e5603c3f	Florent Chuffart	- lower_bound_RM, the lower bound of the nucleosome for RM
508	e5603c3f	Florent Chuffart	- upper_bound_RM, the upper bound of the nucleosome for RM
509	e5603c3f	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR for RM
510	e5603c3f	Florent Chuffart	- cur_index, index of the CUR
511	e5603c3f	Florent Chuffart	- BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
512	e5603c3f	Florent Chuffart	- BY_H3K14ac_37, #reads for sample 37
513	e5603c3f	Florent Chuffart	- BY_H3K14ac_53, #reads for sample 53
514	e5603c3f	Florent Chuffart	- RM_H3K14ac_38, #reads for sample 38
515	e5603c3f	Florent Chuffart	- RM_H3K14ac_39, #reads for sample 39
516	e5603c3f	Florent Chuffart
517	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
518	935a568c	Florent Chuffart
519	935a568c	Florent Chuffart	.. code:: bash
520	935a568c	Florent Chuffart
521	e5603c3f	Florent Chuffart	source("src/current/count_reads.R")
522	e5603c3f	Florent Chuffart
523	e5603c3f	Florent Chuffart
524	e5603c3f	Florent Chuffart	The Script get_size_factors.R
525	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
526	e5603c3f	Florent Chuffart
527	e5603c3f	Florent Chuffart
528	e5603c3f	Florent Chuffart	This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
529	e5603c3f	Florent Chuffart	the normalised count is n/f where f is the factor contained in this file.
530	e5603c3f	Florent Chuffart	The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
531	e5603c3f	Florent Chuffart
532	e5603c3f	Florent Chuffart	========= ======= ======= =======
533	e5603c3f	Florent Chuffart	sample_id wp unr wpunr
534	e5603c3f	Florent Chuffart	========= ======= ======= =======
535	e5603c3f	Florent Chuffart	1 0.87396 0.88097 0.87584
536	e5603c3f	Florent Chuffart	2 1.07890 1.07440 1.07760
537	e5603c3f	Florent Chuffart	3 1.06400 1.05890 1.06250
538	e5603c3f	Florent Chuffart	4 0.85782 0.87948 0.86305
539	e5603c3f	Florent Chuffart	5 0.97577 0.96590 0.97307
540	e5603c3f	Florent Chuffart	6 1.19630 1.18120 1.19190
541	e5603c3f	Florent Chuffart	36 0.93318 0.92762 0.93166
542	e5603c3f	Florent Chuffart	37 0.48315 0.48453 0.48350
543	e5603c3f	Florent Chuffart	38 1.11240 1.11210 1.11230
544	e5603c3f	Florent Chuffart	39 0.89897 0.89917 0.89903
545	e5603c3f	Florent Chuffart	53 2.22650 2.22700 2.22660
546	e5603c3f	Florent Chuffart	========= ======= ======= =======
547	e5603c3f	Florent Chuffart
548	e5603c3f	Florent Chuffart	sample_id are given in file samples.csv
549	935a568c	Florent Chuffart
550	e5603c3f	Florent Chuffart	If you don't know which column to use, we recommend using wpunr.
551	935a568c	Florent Chuffart
552	e5603c3f	Florent Chuffart	If you want the very detailed factors produced by DESeq, here are the information:
553	e5603c3f	Florent Chuffart
554	e5603c3f	Florent Chuffart	- unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
555	e5603c3f	Florent Chuffart	- wp: same, but for well-positioned nucleosomes.
556	e5603c3f	Florent Chuffart	- wpunr: both types of regions.
557	e5603c3f	Florent Chuffart
558	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
559	935a568c	Florent Chuffart
560	935a568c	Florent Chuffart	.. code:: bash
561	935a568c	Florent Chuffart
562	e5603c3f	Florent Chuffart	source("src/current/get_size_factors.R")
563	935a568c	Florent Chuffart
564	935a568c	Florent Chuffart
565	e5603c3f	Florent Chuffart	The Script launch_deseq.R
566	935a568c	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^
567	935a568c	Florent Chuffart
568	e5603c3f	Florent Chuffart	Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
569	e5603c3f	Florent Chuffart
570	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wp_snep.tab
571	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_unr_snep.tab
572	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wpunr_snep.tab
573	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wp_mnase.tab
574	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_unr_mnase.tab
575	e5603c3f	Florent Chuffart	- results/current/BY_RM_H3K14ac_wpunr_mnase.tab
576	e5603c3f	Florent Chuffart
577	e5603c3f	Florent Chuffart	These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
578	e5603c3f	Florent Chuffart
579	e5603c3f	Florent Chuffart	- chr_BY, the number of the chromosome for BY
580	e5603c3f	Florent Chuffart	- lower_bound_BY, the lower bound of the nucleosome for BY
581	e5603c3f	Florent Chuffart	- upper_bound_BY, the upper bound of the nucleosome for BY
582	e5603c3f	Florent Chuffart	- index_nuc_BY, the index of the BY nucleosome in the CUR for BY
583	e5603c3f	Florent Chuffart	- chr_RM, the number of the chromosome for RM
584	e5603c3f	Florent Chuffart	- lower_bound_RM, the lower bound of the nucleosome for RM
585	e5603c3f	Florent Chuffart	- upper_bound_RM, the upper bound of the nucleosome for RM
586	e5603c3f	Florent Chuffart	- index_nuc_RM,the index of the RM nucleosome in the CUR for RM
587	e5603c3f	Florent Chuffart	- cur_index, index of the CUR
588	e5603c3f	Florent Chuffart	- form
589	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1
590	e5603c3f	Florent Chuffart
591	e5603c3f	Florent Chuffart	Next columns concern indicators for each sample:
592	e5603c3f	Florent Chuffart
593	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_2, #reads for sample 2
594	e5603c3f	Florent Chuffart	- BY_Mnase_Seq_3, #reads for sample 3
595	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_4, #reads for sample 4
596	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_5, #reads for sample 5
597	e5603c3f	Florent Chuffart	- RM_Mnase_Seq_6, #reads for sample 6
598	e5603c3f	Florent Chuffart	- BY_H3K14ac_36, #reads for sample 36
599	e5603c3f	Florent Chuffart	- BY_H3K14ac_37, #reads for sample 37
600	e5603c3f	Florent Chuffart	- BY_H3K14ac_53, #reads for sample 53
601	e5603c3f	Florent Chuffart	- RM_H3K14ac_38, #reads for sample 38
602	e5603c3f	Florent Chuffart	- RM_H3K14ac_39, #reads for sample 39
603	e5603c3f	Florent Chuffart
604	e5603c3f	Florent Chuffart	The 5 last columns concern DESeq analysis:
605	e5603c3f	Florent Chuffart
606	e5603c3f	Florent Chuffart	- manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
607	e5603c3f	Florent Chuffart	- pvalsGLM, the pvalue resulting of the comparison of the GLM model considering or not the interaction term marker:strain. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
608	e5603c3f	Florent Chuffart	- snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
609	e5603c3f	Florent Chuffart	To execute this script, run the following command
610	e5603c3f	Florent Chuffart
611	e5603c3f	Florent Chuffart	To execute this script, run the following command in your R console:
612	e5603c3f	Florent Chuffart
613	935a568c	Florent Chuffart	.. code:: bash
614	935a568c	Florent Chuffart
615	e5603c3f	Florent Chuffart	source("src/current/launch_deseq.R")
616	935a568c	Florent Chuffart
617	935a568c	Florent Chuffart
618	e5603c3f	Florent Chuffart	Results: Number of SNEPs
619	e5603c3f	Florent Chuffart	------------------------
620	935a568c	Florent Chuffart
621	e5603c3f	Florent Chuffart	Here are the number of computed SNEPs for each forms.
622	935a568c	Florent Chuffart
623	e5603c3f	Florent Chuffart	===== ======= ===== =======
624	e5603c3f	Florent Chuffart	form strains #nucs H3K14ac
625	e5603c3f	Florent Chuffart	===== ======= ===== =======
626	e5603c3f	Florent Chuffart	wp BY-RM 30464 3549
627	e5603c3f	Florent Chuffart	unr BY-RM 9497 1559
628	e5603c3f	Florent Chuffart	wpunr BY-RM 39961 5240
629	e5603c3f	Florent Chuffart	===== ======= ===== =======
630	e5603c3f	Florent Chuffart
631	935a568c	Florent Chuffart
632	935a568c	Florent Chuffart
633	935a568c	Florent Chuffart
634	935a568c	Florent Chuffart
635	935a568c	Florent Chuffart
636	e5603c3f	Florent Chuffart	APPENDICE: Generate .c2c Files
637	e5603c3f	Florent Chuffart	------------------------------
638	935a568c	Florent Chuffart
639	5badc2fd	Florent Chuffart	$$$ TODO make it works properly.
640	5badc2fd	Florent Chuffart	working directory.
641	5badc2fd	Florent Chuffart
642	5badc2fd	Florent Chuffart
643	e5603c3f	Florent Chuffart	The `.c2c` files is a simple table that describes how the genome sequence can be aligned. We generate it using NucleoMiner 1.0.
644	935a568c	Florent Chuffart
645	e5603c3f	Florent Chuffart	To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first to install the Genetic Data analysis Library (GDL), which is a dynamic library of useful C functions derived from the GNU Scientific Library.
646	935a568c	Florent Chuffart
647	e5603c3f	Florent Chuffart	Installing the GDL library
648	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^
649	935a568c	Florent Chuffart
650	e5603c3f	Florent Chuffart	Get the gdl-1.0.tar.gz archive on your computer (in the directory deps of your working directory). Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
651	935a568c	Florent Chuffart
652	e5603c3f	Florent Chuffart	tar -xvzf gdl-1.0.tar.gz
653	935a568c	Florent Chuffart
654	e5603c3f	Florent Chuffart	This creates a directory called gdl-1.0. You now need to go into this directory and compile the library, by typing:
655	935a568c	Florent Chuffart
656	935a568c	Florent Chuffart	.. code:: bash
657	935a568c	Florent Chuffart
658	5badc2fd	Florent Chuffart	mkdir tmp_c2c_workdir
659	5badc2fd	Florent Chuffart	cd tmp_c2c_workdir
660	5badc2fd	Florent Chuffart	cp ../deps/gdl-1.0.tar.gz .
661	5badc2fd	Florent Chuffart	tar -xvzf gdl-1.0.tar.gz
662	e5603c3f	Florent Chuffart	cd gdl-1.0
663	e5603c3f	Florent Chuffart	./configure
664	e5603c3f	Florent Chuffart	make
665	5badc2fd	Florent Chuffart
666	5badc2fd	Florent Chuffart	cd ..
667	5badc2fd	Florent Chuffart
668	935a568c	Florent Chuffart
669	e5603c3f	Florent Chuffart	Now you need to install the library on your system. This needs root priviledges:
670	935a568c	Florent Chuffart
671	e5603c3f	Florent Chuffart	.. code:: bash
672	e5603c3f	Florent Chuffart
673	e5603c3f	Florent Chuffart	sudo make install
674	e5603c3f	Florent Chuffart
675	5badc2fd	Florent Chuffart	Installing NucleoMiner 1.0
676	5badc2fd	Florent Chuffart	^^^^^^^^^^^^^^^^^^^^^^^^^^
677	e5603c3f	Florent Chuffart
678	e5603c3f	Florent Chuffart	Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a dedicated directory. Go into this directory using the cd command, and then unfold the archive by typing:
679	e5603c3f	Florent Chuffart
680	e5603c3f	Florent Chuffart	This creates a directory called nucleominer-1.0. You now need to go into this directory and compile the library, by typing:
681	935a568c	Florent Chuffart
682	e5603c3f	Florent Chuffart	.. code:: bash
683	935a568c	Florent Chuffart
684	5badc2fd	Florent Chuffart	cp ../deps/nucleominer-1.0.tar.gz .
685	5badc2fd	Florent Chuffart	tar -xvzf nucleominer-1.0.tar.gz
686	e5603c3f	Florent Chuffart	cd nucleominer-1.0
687	5badc2fd	Florent Chuffart	ln -s ../gdl-1.0/gdl
688	e5603c3f	Florent Chuffart	./configure
689	e5603c3f	Florent Chuffart	make
690	935a568c	Florent Chuffart
691	e5603c3f	Florent Chuffart	You can then use the binaries dircetly from this folder (best then is to add the path to this folder in your PATH environment variable). If you want to install nucleominer at the system's level (useful if mutiple users will need it) then type, with root priviledges:
692	935a568c	Florent Chuffart
693	935a568c	Florent Chuffart	.. code:: bash
694	935a568c	Florent Chuffart
695	e5603c3f	Florent Chuffart	sudo make install
696	e5603c3f	Florent Chuffart
697	e5603c3f	Florent Chuffart	Generate .c2c Files
698	e5603c3f	Florent Chuffart	^^^^^^^^^^^^^^^^^^^
699	e5603c3f	Florent Chuffart
700	e5603c3f	Florent Chuffart	To generate .c2c files you need to type the following command in a terminal:
701	e5603c3f	Florent Chuffart
702	e5603c3f	Florent Chuffart	.. code:: bash
703	935a568c	Florent Chuffart
704	e5603c3f	Florent Chuffart	mkdir dir_4_c2c
705	5badc2fd	Florent Chuffart	NMgxcomp ../data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta\
706	5badc2fd	Florent Chuffart	../data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta\
707	e5603c3f	Florent Chuffart	dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log
708	e5603c3f	Florent Chuffart
709	5badc2fd	Florent Chuffart	After execution, the directory `dir_4_c2c` will hold the .c2c files.

LBMC » NucleoMiner

root / doc / sphinx_doc / tuto.rst @ 5badc2fd