Révision 3c88abd0
b/README | ||
---|---|---|
1 | 1 |
|
2 |
Welcome to *NucleoMiner2* |
|
3 |
************************* |
|
4 |
|
|
5 |
* Readme / Documentation for *NucleoMiner2* |
|
6 |
|
|
7 |
* License |
|
8 |
|
|
9 |
* Installation Instructions |
|
10 |
|
|
11 |
* Tutorial |
|
12 |
|
|
13 |
* Experimental Dataset, Working Directory and Configuration File |
|
14 |
|
|
15 |
* Preprocessing Illumina Fastq Reads for Each Sample |
|
16 |
|
|
17 |
* Inferring Nucleosome Position and Extracting Read Counts |
|
18 |
|
|
19 |
* Results: Number of SNEPs |
|
20 |
|
|
21 |
* APPENDICE: Generate .c2c Files |
|
22 |
|
|
23 |
* References |
|
24 |
|
|
25 |
* Python Reference |
|
26 |
|
|
27 |
* R Reference |
|
28 |
|
|
2 | 29 |
Readme / Documentation for *NucleoMiner2* |
3 | 30 |
***************************************** |
4 | 31 |
|
... | ... | |
130 | 157 |
R CMD INSTALL doc/Chuffart_NM2_workdir/deps/bot_0.14.tar.gz\ |
131 | 158 |
doc/Chuffart_NM2_workdir/deps/cachecache_0.1.tar.gz\ |
132 | 159 |
build/nucleominer_2.3.46.tar.gz |
160 |
|
|
161 |
Tutorial |
|
162 |
******** |
|
163 |
|
|
164 |
This tutorial describes steps allowing performing quantitative |
|
165 |
analysis of epigenetic marks on individual nucleosomes. We assume that |
|
166 |
files are organised according to a given hierarchy and that all |
|
167 |
command lines are launched from the project’s root directory. |
|
168 |
|
|
169 |
This tutorial is divided into two main parts. The first part covers |
|
170 |
the python script *wf.py* that aligns and converts short sequence |
|
171 |
reads. The second part covers the R scripts that extracts information |
|
172 |
(nucleosome position and indicators) from the dataset. |
|
173 |
|
|
174 |
|
|
175 |
Experimental Dataset, Working Directory and Configuration File |
|
176 |
============================================================== |
|
177 |
|
|
178 |
|
|
179 |
Working Directory Organisation |
|
180 |
------------------------------ |
|
181 |
|
|
182 |
After having install NucleoMiner2 environment (Previous section), go |
|
183 |
to the root working directory of the tutorial by typing the following |
|
184 |
command in a terminal: |
|
185 |
|
|
186 |
cd doc/Chuffart_NM2_workdir/ |
|
187 |
|
|
188 |
|
|
189 |
Retrieving Experimental Dataset |
|
190 |
------------------------------- |
|
191 |
|
|
192 |
The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress |
|
193 |
(http://www.ebi.ac.uk/arrayexpress/) under accession number |
|
194 |
E-MTAB-2671. |
|
195 |
|
|
196 |
$$$ TODO explain how organise Experimental Dataset into the *data* |
|
197 |
directory of the working directory. |
|
198 |
|
|
199 |
We want to compare nucleosomes of 2 yeast strains: BY and RM. For each |
|
200 |
strain we performed Mnase-Seq and ChIP-Seq using an antibody |
|
201 |
recognizing the H3K14ac epigenetic mark. |
|
202 |
|
|
203 |
The dataset is composed of 55 files organised as follows: |
|
204 |
|
|
205 |
* 3 replicates for BY MNase Seq |
|
206 |
|
|
207 |
* sample 1 (5 fastq.gz files) |
|
208 |
|
|
209 |
* sample 2 (5 fastq.gz files) |
|
210 |
|
|
211 |
* sample 3 (4 fastq.gz files) |
|
212 |
|
|
213 |
* 3 replicates for RM MNase Seq |
|
214 |
|
|
215 |
* sample 4 (4 fastq.gz files) |
|
216 |
|
|
217 |
* sample 5 (4 fastq.gz files) |
|
218 |
|
|
219 |
* sample 6 (5 fastq.gz files) |
|
220 |
|
|
221 |
* 3 replicates for BY ChIP Seq H3K14ac |
|
222 |
|
|
223 |
* sample 36 (5 fastq.gz files) |
|
224 |
|
|
225 |
* sample 37 (5 fastq.gz files) |
|
226 |
|
|
227 |
* sample 53 (9 fastq.gz files) |
|
228 |
|
|
229 |
* 2 replicates for RM ChIP Seq H3K14ac |
|
230 |
|
|
231 |
* sample 38 (5 fastq.gz files) |
|
232 |
|
|
233 |
* sample 39 (4 fastq.gz files) |
|
234 |
|
|
235 |
|
|
236 |
Python and R Common Configuration File |
|
237 |
-------------------------------------- |
|
238 |
|
|
239 |
First of all we define in one place some configuration variables that |
|
240 |
will be launched by python and R scripts. These variables are |
|
241 |
contained in file *configurator.py*. The execution of this python |
|
242 |
script dumps variables into the *nucleominer_config.json* file that |
|
243 |
will then be used by both R and python scripts. |
|
244 |
|
|
245 |
To do this, go to the root directory of your project and run the |
|
246 |
following command: |
|
247 |
|
|
248 |
python src/current/configurator.py |
|
249 |
|
|
250 |
|
|
251 |
Preprocessing Illumina Fastq Reads for Each Sample |
|
252 |
================================================== |
|
253 |
|
|
254 |
This preprocessing step consists of 4 main steps embedded in the |
|
255 |
*wf.py* script. They are described bellow. As a preamble, this script |
|
256 |
computes *samples*, *samples_mnase* and *strains* that will be used |
|
257 |
along the 4 steps. |
|
258 |
|
|
259 |
wf.samples = [] |
|
260 |
|
|
261 |
List of samples where a sample is identified by an id (key: *id*) |
|
262 |
and a strain name (key *strain*). |
|
263 |
|
|
264 |
wf.samples_mnase = [] |
|
265 |
|
|
266 |
List of Mnase samples. |
|
267 |
|
|
268 |
wf.strains = [] |
|
269 |
|
|
270 |
List of reference strains. |
|
271 |
|
|
272 |
|
|
273 |
Creating Bowtie Index from each Reference Genome |
|
274 |
------------------------------------------------ |
|
275 |
|
|
276 |
For each strain, we need to create bowtie index. Bowtie index of a |
|
277 |
strain is a tree view of the genome of this strain. It will be used by |
|
278 |
bowtie to align reads. This step is performed by the following part of |
|
279 |
the *wf.py* script: |
|
280 |
|
|
281 |
for strain in strains: |
|
282 |
per_strain_stats[strain] = create_bowtie_index(strain, |
|
283 |
config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], |
|
284 |
config["BOWTIE_BUILD_BIN"]) |
|
285 |
|
|
286 |
The following table summarizes the file sizes and process durations |
|
287 |
concerning this step. |
|
288 |
|
|
289 |
+--------+------------------------+------------------------+------------------+ |
|
290 |
| strain | fasta genome file size | bowtie index file size | process duration | |
|
291 |
+========+========================+========================+==================+ |
|
292 |
| BY | 12 Mo | 25 Mo | 11 s. | |
|
293 |
+--------+------------------------+------------------------+------------------+ |
|
294 |
| RM | 12 Mo | 24 Mo | 9 s. | |
|
295 |
+--------+------------------------+------------------------+------------------+ |
|
296 |
|
|
297 |
|
|
298 |
Aligning Reads to Reference Genome |
|
299 |
---------------------------------- |
|
300 |
|
|
301 |
Next, we launch bowtie to align reads to the reference genome. It |
|
302 |
produces a *.sam* file that we convert into a *.bed* file. Binaries |
|
303 |
for *bowtie*, *samtools* and *bedtools* are wrapped using python |
|
304 |
*subprocess* class. This step is performed by the following part of |
|
305 |
the *wf.py* script: |
|
306 |
|
|
307 |
for sample in samples: |
|
308 |
per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, |
|
309 |
config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], |
|
310 |
config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], |
|
311 |
config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"]) |
|
312 |
|
|
313 |
|
|
314 |
Convert Aligned Reads into TemplateFilter Format |
|
315 |
------------------------------------------------ |
|
316 |
|
|
317 |
TemplateFilter uses particular input formats for reads, so it is |
|
318 |
necessary to convert the *.bed* files. TemplateFilter expect reads as |
|
319 |
follows: *chr*, *coord*, *strand* and *#read* where: |
|
320 |
|
|
321 |
* *chr* is the number of the chromosome; |
|
322 |
|
|
323 |
* *coord* is the coordinate of the reads; |
|
324 |
|
|
325 |
* *strand* is *F* for forward and *R* for reverse; |
|
326 |
|
|
327 |
* *#reads* the number of reads covering this position. |
|
328 |
|
|
329 |
Each entry is *tab*-separated. |
|
330 |
|
|
331 |
**WARNING** for reverse strands, bowtie returns the position of the |
|
332 |
first nucleotide on the left hand side, whereas TemplateFilter expects |
|
333 |
the first one on the right hand side. This step takes this into |
|
334 |
account by adding the read length (in our case 50) to the reverse |
|
335 |
reads coordinates. |
|
336 |
|
|
337 |
This step is performed by the following part of the *wf.py* script: |
|
338 |
|
|
339 |
for sample in samples: |
|
340 |
per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, |
|
341 |
config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], |
|
342 |
config["READ_LENGTH"],config["MAPQ_THRES"]) |
|
343 |
|
|
344 |
The following table summarises the number of reads, the involved file |
|
345 |
sizes and process durations concerning the two last steps. In our |
|
346 |
case, alignment process have been multithreaded over 3 cores. |
|
347 |
|
|
348 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
349 |
| id | Illumina reads | aligned and filtred reads | ratio | *.bed* file size | TF input file size | process duration | |
|
350 |
+====+================+===========================+========+==================+====================+==================+ |
|
351 |
| 1 | 16436138 | 10199695 | 62,06% | 1064 Mo | 60 Mo | 383 s. | |
|
352 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
353 |
| 2 | 16911132 | 12512727 | 73,99% | 1298 Mo | 64 Mo | 437 s. | |
|
354 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
355 |
| 3 | 15946902 | 12340426 | 77,38% | 1280 Mo | 65 Mo | 423 s. | |
|
356 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
357 |
| 4 | 13765584 | 10381903 | 75,42% | 931 Mo | 59 Mo | 352 s. | |
|
358 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
359 |
| 5 | 15168268 | 11502855 | 75,83% | 1031 Mo | 64 Mo | 386 s. | |
|
360 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
361 |
| 6 | 18850820 | 14024905 | 74,40% | 1254 Mo | 69 Mo | 482 s. | |
|
362 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
363 |
| 36 | 17715118 | 14092985 | 79,55% | 1404 Mo | 68 Mo | 483 s. | |
|
364 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
365 |
| 37 | 17288466 | 7402082 | 42,82% | 741 Mo | 48 Mo | 339 s. | |
|
366 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
367 |
| 38 | 16116394 | 13178457 | 81,77% | 1101 Mo | 63 Mo | 420 s. | |
|
368 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
369 |
| 39 | 14241106 | 10537228 | 73,99% | 880 Mo | 57 Mo | 348 s. | |
|
370 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
371 |
| 53 | 40876476 | 33780065 | 82,64% | 3316 Mo | 103 Mo | 1165 s. | |
|
372 |
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+ |
|
373 |
|
|
374 |
|
|
375 |
Run TemplateFilter on Mnase Samples |
|
376 |
----------------------------------- |
|
377 |
|
|
378 |
Finally, for each sample we perform TemplateFilter analysis. |
|
379 |
|
|
380 |
**WARNING** TemplateFilter returns a list of nucleosomes. Each |
|
381 |
nucleosome is define by its center and its width. An odd width leads |
|
382 |
us to consider non- integer lower and upper bound. |
|
383 |
|
|
384 |
**WARNING** TemplateFilter is not designed to deal with replicates. So |
|
385 |
we recommend to keep a maximum of nucleosomes and filter the aberrant |
|
386 |
ones afterwards using the benefits of having replicates. To do this, |
|
387 |
we set a low correlation threshold parameter (0.5) and a particularly |
|
388 |
high value of overlap (300%). |
|
389 |
|
|
390 |
This step is performed by the following part of the *wf.py* script: |
|
391 |
|
|
392 |
for sample in samples_mnase: |
|
393 |
per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, |
|
394 |
config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], |
|
395 |
config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], |
|
396 |
config["TF_MAXW"], config["TF_OL"]) |
|
397 |
|
|
398 |
+----+--------+------------+---------------+------------------+ |
|
399 |
| id | strain | found nucs | nuc file size | process duration | |
|
400 |
+====+========+============+===============+==================+ |
|
401 |
| 1 | BY | 96214 | 68 Mo | 1022 s. | |
|
402 |
+----+--------+------------+---------------+------------------+ |
|
403 |
| 2 | BY | 91694 | 65 Mo | 1038 s. | |
|
404 |
+----+--------+------------+---------------+------------------+ |
|
405 |
| 3 | BY | 91205 | 65 Mo | 1036 s. | |
|
406 |
+----+--------+------------+---------------+------------------+ |
|
407 |
| 4 | RM | 88076 | 62 Mo | 984 s. | |
|
408 |
+----+--------+------------+---------------+------------------+ |
|
409 |
| 5 | RM | 90141 | 64 Mo | 967 s. | |
|
410 |
+----+--------+------------+---------------+------------------+ |
|
411 |
| 6 | RM | 87517 | 62 Mo | 980 s. | |
|
412 |
+----+--------+------------+---------------+------------------+ |
|
413 |
|
|
414 |
|
|
415 |
Inferring Nucleosome Position and Extracting Read Counts |
|
416 |
======================================================== |
|
417 |
|
|
418 |
The second part of the tutorial uses R |
|
419 |
(http://http://www.r-project.org). It consists of a set of R scripts |
|
420 |
that will be sourced in an R from a console launched at the root of |
|
421 |
your project. These scripts are: |
|
422 |
|
|
423 |
* headers.R |
|
424 |
|
|
425 |
* extract_maps.R |
|
426 |
|
|
427 |
* translate_common_wp.R |
|
428 |
|
|
429 |
* split_samples.R |
|
430 |
|
|
431 |
* count_reads.R |
|
432 |
|
|
433 |
* get_size_factors |
|
434 |
|
|
435 |
* launch_deseq.R |
|
436 |
|
|
437 |
|
|
438 |
The Script headers.R |
|
439 |
-------------------- |
|
440 |
|
|
441 |
The script headers.R is included in each other scripts. It is in |
|
442 |
charge of: |
|
443 |
|
|
444 |
* launching libraries used in the scripts |
|
445 |
|
|
446 |
* launching configuration (design, strain, marker...) |
|
447 |
|
|
448 |
* computing and caching CURs (caching means storing the |
|
449 |
information in the computer's memory) |
|
450 |
|
|
451 |
Note that you can customize the function “translate”. This function |
|
452 |
allows you to use the alignments between genomes when performing |
|
453 |
various tasks. You may be using NucleoMiner2 to analyse data of a |
|
454 |
single strain, or of several strains. |
|
455 |
|
|
456 |
* All the data corresponds to the same strain (e.g. |
|
457 |
treatment/control, or only few mutations): Then in step 1), the |
|
458 |
regions to use are entire chromosomes. Instep 2) simply use the |
|
459 |
default translate function which is neutral. |
|
460 |
|
|
461 |
* The data come from two or more strains: In this case, edit a |
|
462 |
list of regions and customize the translate function which |
|
463 |
performs the correspondence between the different genomes. How we |
|
464 |
did it: a .c2c file is obtained with NucleoMiner 1.0 (refer to |
|
465 |
the Appendice "Generate .c2c Files"), then use it to produce the |
|
466 |
list of regions and customise “translate”. |
|
467 |
|
|
468 |
In your R console, run the following command line: |
|
469 |
|
|
470 |
source("src/current/headers.R") |
|
471 |
|
|
472 |
|
|
473 |
The Script extract_maps.R |
|
474 |
------------------------- |
|
475 |
|
|
476 |
This script is in charge of extracting Maps for well-positioned and |
|
477 |
fuzzy nucleosomes. First of all, this script computes intra and inter- |
|
478 |
strain nucleosome maps for each CUR. This step is executed in parallel |
|
479 |
on many cores using the BoT library. Next, it collects results and |
|
480 |
produces well-positioned, fuzzy and UNR maps. |
|
481 |
|
|
482 |
The well-positioned map for BY is collected in the result directory |
|
483 |
and is called *BY_wp.tab*. It is composed of following columns: |
|
484 |
|
|
485 |
* chr, the number of the chromosome |
|
486 |
|
|
487 |
* lower_bound, the lower bound of the nucleosome |
|
488 |
|
|
489 |
* upper_bound, the upper bound of the nucleosome |
|
490 |
|
|
491 |
* cur_index, index of the CUR |
|
492 |
|
|
493 |
* index_nuc, the index of the nucleosome in the CUR |
|
494 |
|
|
495 |
* wp, 1 if it is a well positioned nucleosome, 0 otherwise |
|
496 |
|
|
497 |
* nb_reads, the number of reads that support this nucleosome |
|
498 |
|
|
499 |
* nb_nucs, the number of TemplateFilter nucleosome across |
|
500 |
replicates (= the number of replicates in which it is a well- |
|
501 |
positioned nucleosome) |
|
502 |
|
|
503 |
* llr_1, for a well-positioned nucleosome, it is the LLR1 (log- |
|
504 |
likelihood ratio) between the first and the second TemplateFilter |
|
505 |
nucleosome on the chain. |
|
506 |
|
|
507 |
* llr_2, for a well-positioned nucleosome, it is the LLR1 between |
|
508 |
the second and the third TemplateFilter nucleosome on the chain. |
|
509 |
|
|
510 |
* wp_llr, for a well-positioned nucleosome, it is the LLR2 that |
|
511 |
compares consistency of the positioning over all TemplateFilter |
|
512 |
nucleosomes. |
|
513 |
|
|
514 |
* wp_pval, for a well-positioned nucleosome, it is the p-value |
|
515 |
chi square test obtained with the LLR2 (*1-pchisq(2.LLR2, df=4)*) |
|
516 |
|
|
517 |
* dyad_shift, for a well-positioned nucleosome, it is the shift |
|
518 |
between the two extreme TemplateFilter nucleosome dyad positions. |
|
519 |
|
|
520 |
The fuzzy map for BY is collected in the result directory and is |
|
521 |
called *BY_fuzzy.tab*. It is composed of following columns: |
|
522 |
|
|
523 |
* chr, the number of the chromosome |
|
524 |
|
|
525 |
* lower_bound, the lower bound of the nucleosome |
|
526 |
|
|
527 |
* upper_bound, the upper bound of the nucleosome |
|
528 |
|
|
529 |
* cur_index, index of the CUR |
|
530 |
|
|
531 |
The map of common well-positioned nucleosomes aligned between the BY |
|
532 |
and RM strains is collected in the result directory and is called |
|
533 |
*BY_RM_common_wp.tab*. It is composed of following columns: |
|
534 |
|
|
535 |
* cur_index, the index of the CUR |
|
536 |
|
|
537 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
|
538 |
|
|
539 |
* index_nuc_RM, the index of the RM nucleosome in the CUR |
|
540 |
|
|
541 |
* llr_score, , the LLR3 score that estimates conservation between |
|
542 |
the positions in BY and RM |
|
543 |
|
|
544 |
* common_wp_pval, the p-value chi square test obtained from LLR3 |
|
545 |
(*1-pchisq(2.LLR3, df=2)*) |
|
546 |
|
|
547 |
* diff, the dyads shift between the positions in the two strains |
|
548 |
|
|
549 |
The common UNR map for BY and RM strains is collected in the result |
|
550 |
directory and is called *BY_RM_common_unr.tab*. It is composed of the |
|
551 |
following columns: |
|
552 |
|
|
553 |
* cur_index, the index of the CUR |
|
554 |
|
|
555 |
* index_nuc_BY, the index of the BY nucleosome in the CUR |
|
556 |
|
|
557 |
* index_nuc_RM,the index of the RM nucleosome in the CUR |
|
558 |
|
|
559 |
To execute this script, run the following command in your R console: |
|
560 |
|
|
561 |
source("src/current/extract_maps.R") |
|
562 |
|
|
563 |
|
|
564 |
The Script translate_common_wp.R |
|
565 |
-------------------------------- |
|
566 |
|
|
567 |
This script is used to translate common well-positioned nucleosome |
|
568 |
maps from a strain to another strain and stores it into a table. |
|
569 |
|
|
570 |
For example, the file *results/2014-04/RM_wp_tr_2_BY.tab* contains RM |
|
571 |
well-positioned nucleosome translated into the BY genome coordinates. |
|
572 |
It is composed of following columns: |
|
573 |
|
|
574 |
* strain_ref, the reference genome (in which positioned are |
|
575 |
defined) |
|
576 |
|
|
577 |
* begin, the translated lower bound of the nucleosome |
|
578 |
|
|
579 |
* end, the translated upper bound of the nucleosome |
|
580 |
|
|
581 |
* chr, the number of chromosomes for the reference genome (in |
|
582 |
which positioned are defined) |
|
583 |
|
|
584 |
* length, the length of the nucleosome (could be negative) |
|
585 |
|
|
586 |
* cur_index, the index of the CUR |
|
587 |
|
|
588 |
* index_nuc, the index of the nucleosome in the CUR |
|
589 |
|
|
590 |
To execute this script, run the following command in your R console: |
|
591 |
|
|
592 |
source("src/current/translate_common_wp.R") |
|
593 |
|
|
594 |
|
|
595 |
The Script split_samples.R |
|
596 |
-------------------------- |
|
597 |
|
|
598 |
For memory space usage reasons, we split and compress TemplateFilter |
|
599 |
input files according to their corresponding chromosome. for example, |
|
600 |
*sample_1_TF.tab* will be split into : |
|
601 |
|
|
602 |
* sample_1_chr_1_splited_sample.tab.gz |
|
603 |
|
|
604 |
* sample_1_chr_2_splited_sample.tab.gz |
|
605 |
|
|
606 |
* ... |
|
607 |
|
|
608 |
* sample_1_chr_17_splited_sample.tab.gz |
|
609 |
|
|
610 |
To execute this script, run the following command in your R console: |
|
611 |
|
|
612 |
source("src/current/split_samples.R") |
|
613 |
|
|
614 |
|
|
615 |
The Script count_reads.R |
|
616 |
------------------------ |
|
617 |
|
|
618 |
To associate a number of observations (read) to each nucleosome we run |
|
619 |
the script *count_reads.R*. It produces the files |
|
620 |
*BY_RM_H3K14ac_wp_and_nbreads.tab*, |
|
621 |
*BY_RM_H3K14ac_unr_and_nbreads.tab* |
|
622 |
*BY_RM_Mnase_Seq_wp_and_nbreads.tab* and |
|
623 |
*BY_RM_Mnase_Seq_unr_and_nbreads.tab* for H3K14ac common well- |
|
624 |
positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned |
|
625 |
nucleosomes and Mnase UNRs respectively. |
|
626 |
|
|
627 |
For example, the file *BY_RM_H3K14ac_unr_and_nbreads.tab* contains |
|
628 |
counted reads for well-positioned nucleosomes with the experimental |
|
629 |
condition ChIP H3K14ac. It is composed of the following columns: |
|
630 |
|
|
631 |
* chr_BY, the number of the chromosome for BY |
|
632 |
|
|
633 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
|
634 |
|
|
635 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
|
636 |
|
|
637 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
|
638 |
|
|
639 |
* chr_RM, the number of the chromosome for RM |
|
640 |
|
|
641 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
|
642 |
|
|
643 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
|
644 |
|
|
645 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
|
646 |
|
|
647 |
* cur_index, index of the CUR |
|
648 |
|
|
649 |
* BY_H3K14ac_36, the number of reads for the current nucleosome |
|
650 |
for the sample 36 |
|
651 |
|
|
652 |
* BY_H3K14ac_37, #reads for sample 37 |
|
653 |
|
|
654 |
* BY_H3K14ac_53, #reads for sample 53 |
|
655 |
|
|
656 |
* RM_H3K14ac_38, #reads for sample 38 |
|
657 |
|
|
658 |
* RM_H3K14ac_39, #reads for sample 39 |
|
659 |
|
|
660 |
To execute this script, run the following command in your R console: |
|
661 |
|
|
662 |
source("src/current/count_reads.R") |
|
663 |
|
|
664 |
|
|
665 |
The Script get_size_factors.R |
|
666 |
----------------------------- |
|
667 |
|
|
668 |
This script uses the DESeq function *estimateSizeFactors* to compute |
|
669 |
the size factor of each sample. It corresponds to normalisation of |
|
670 |
read counts from sample to sample, as determined by DESeq. When a |
|
671 |
sample has n reads for a nucleosome or a UNR, the normalised count is |
|
672 |
n/f where f is the factor contained in this file. The script dumps |
|
673 |
computed size factors into the file *size_factors.tab*. This file has |
|
674 |
the form: |
|
675 |
|
|
676 |
+-----------+---------+---------+---------+ |
|
677 |
| sample_id | wp | unr | wpunr | |
|
678 |
+===========+=========+=========+=========+ |
|
679 |
| 1 | 0.87396 | 0.88097 | 0.87584 | |
|
680 |
+-----------+---------+---------+---------+ |
|
681 |
| 2 | 1.07890 | 1.07440 | 1.07760 | |
|
682 |
+-----------+---------+---------+---------+ |
|
683 |
| 3 | 1.06400 | 1.05890 | 1.06250 | |
|
684 |
+-----------+---------+---------+---------+ |
|
685 |
| 4 | 0.85782 | 0.87948 | 0.86305 | |
|
686 |
+-----------+---------+---------+---------+ |
|
687 |
| 5 | 0.97577 | 0.96590 | 0.97307 | |
|
688 |
+-----------+---------+---------+---------+ |
|
689 |
| 6 | 1.19630 | 1.18120 | 1.19190 | |
|
690 |
+-----------+---------+---------+---------+ |
|
691 |
| 36 | 0.93318 | 0.92762 | 0.93166 | |
|
692 |
+-----------+---------+---------+---------+ |
|
693 |
| 37 | 0.48315 | 0.48453 | 0.48350 | |
|
694 |
+-----------+---------+---------+---------+ |
|
695 |
| 38 | 1.11240 | 1.11210 | 1.11230 | |
|
696 |
+-----------+---------+---------+---------+ |
|
697 |
| 39 | 0.89897 | 0.89917 | 0.89903 | |
|
698 |
+-----------+---------+---------+---------+ |
|
699 |
| 53 | 2.22650 | 2.22700 | 2.22660 | |
|
700 |
+-----------+---------+---------+---------+ |
|
701 |
|
|
702 |
sample_id are given in file samples.csv |
|
703 |
|
|
704 |
If you don't know which column to use, we recommend using wpunr. |
|
705 |
|
|
706 |
If you want the very detailed factors produced by DESeq, here are the |
|
707 |
information: |
|
708 |
|
|
709 |
* unr: factor computed from data of UNR regions. These regions |
|
710 |
are defined for every pairs of aligned genomes (e.g. BY_RM) |
|
711 |
|
|
712 |
* wp: same, but for well-positioned nucleosomes. |
|
713 |
|
|
714 |
* wpunr: both types of regions. |
|
715 |
|
|
716 |
To execute this script, run the following command in your R console: |
|
717 |
|
|
718 |
source("src/current/get_size_factors.R") |
|
719 |
|
|
720 |
|
|
721 |
The Script launch_deseq.R |
|
722 |
------------------------- |
|
723 |
|
|
724 |
Finally, the script *launch_deseq.R* perform statistical analysis on |
|
725 |
each nucleosome using *DESeq*. It produces files: |
|
726 |
|
|
727 |
* results/current/BY_RM_H3K14ac_wp_snep.tab |
|
728 |
|
|
729 |
* results/current/BY_RM_H3K14ac_unr_snep.tab |
|
730 |
|
|
731 |
* results/current/BY_RM_H3K14ac_wpunr_snep.tab |
|
732 |
|
|
733 |
* results/current/BY_RM_H3K14ac_wp_mnase.tab |
|
734 |
|
|
735 |
* results/current/BY_RM_H3K14ac_unr_mnase.tab |
|
736 |
|
|
737 |
* results/current/BY_RM_H3K14ac_wpunr_mnase.tab |
|
738 |
|
|
739 |
These files are organised with the following columns (see file |
|
740 |
*BY_RM_H3K14ac_wp_snep.tab* for an example): |
|
741 |
|
|
742 |
* chr_BY, the number of the chromosome for BY |
|
743 |
|
|
744 |
* lower_bound_BY, the lower bound of the nucleosome for BY |
|
745 |
|
|
746 |
* upper_bound_BY, the upper bound of the nucleosome for BY |
|
747 |
|
|
748 |
* index_nuc_BY, the index of the BY nucleosome in the CUR for BY |
|
749 |
|
|
750 |
* chr_RM, the number of the chromosome for RM |
|
751 |
|
|
752 |
* lower_bound_RM, the lower bound of the nucleosome for RM |
|
753 |
|
|
754 |
* upper_bound_RM, the upper bound of the nucleosome for RM |
|
755 |
|
|
756 |
* index_nuc_RM,the index of the RM nucleosome in the CUR for RM |
|
757 |
|
|
758 |
* cur_index, index of the CUR |
|
759 |
|
|
760 |
* form |
|
761 |
|
|
762 |
* BY_Mnase_Seq_1, the number of reads for the current nucleosome |
|
763 |
for the sample 1 |
|
764 |
|
|
765 |
Next columns concern indicators for each sample: |
|
766 |
|
|
767 |
* BY_Mnase_Seq_2, #reads for sample 2 |
|
768 |
|
|
769 |
* BY_Mnase_Seq_3, #reads for sample 3 |
|
770 |
|
|
771 |
* RM_Mnase_Seq_4, #reads for sample 4 |
|
772 |
|
|
773 |
* RM_Mnase_Seq_5, #reads for sample 5 |
|
774 |
|
|
775 |
* RM_Mnase_Seq_6, #reads for sample 6 |
|
776 |
|
|
777 |
* BY_H3K14ac_36, #reads for sample 36 |
|
778 |
|
|
779 |
* BY_H3K14ac_37, #reads for sample 37 |
|
780 |
|
|
781 |
* BY_H3K14ac_53, #reads for sample 53 |
|
782 |
|
|
783 |
* RM_H3K14ac_38, #reads for sample 38 |
|
784 |
|
|
785 |
* RM_H3K14ac_39, #reads for sample 39 |
|
786 |
|
|
787 |
The 5 last columns concern DESeq analysis: |
|
788 |
|
|
789 |
* manip[a_manip] strain[a_strain] |
|
790 |
manip[a_strain]:strain[a_strain], the manip (marker) effect, the |
|
791 |
strain effect and the snep effect. These are the coefficients of |
|
792 |
the fitted generalized linear model. |
|
793 |
|
|
794 |
* pvalsGLM, the pvalue resulting of the comparison of the GLM |
|
795 |
model considering or not the interaction term marker:strain. This |
|
796 |
is the statsitcial significance of the interaction term and |
|
797 |
therefore the statistical significance of the SNEP. |
|
798 |
|
|
799 |
* snep_index, a boolean set to TRUE if the pvalueGLM value is |
|
800 |
under the threshold computed with FDR function with a rate set to |
|
801 |
0.0001. |
|
802 |
|
|
803 |
To execute this script, run the following command |
|
804 |
|
|
805 |
To execute this script, run the following command in your R console: |
|
806 |
|
|
807 |
source("src/current/launch_deseq.R") |
|
808 |
|
|
809 |
|
|
810 |
Results: Number of SNEPs |
|
811 |
======================== |
|
812 |
|
|
813 |
Here are the number of computed SNEPs for each forms. |
|
814 |
|
|
815 |
+-------+---------+-------+---------+ |
|
816 |
| form | strains | #nucs | H3K14ac | |
|
817 |
+=======+=========+=======+=========+ |
|
818 |
| wp | BY-RM | 30464 | 3549 | |
|
819 |
+-------+---------+-------+---------+ |
|
820 |
| unr | BY-RM | 9497 | 1559 | |
|
821 |
+-------+---------+-------+---------+ |
|
822 |
| wpunr | BY-RM | 39961 | 5240 | |
|
823 |
+-------+---------+-------+---------+ |
|
824 |
|
|
825 |
|
|
826 |
APPENDICE: Generate .c2c Files |
|
827 |
============================== |
|
828 |
|
|
829 |
$$$ TODO make it works properly. working directory. |
|
830 |
|
|
831 |
The *.c2c* files is a simple table that describes how the genome |
|
832 |
sequence can be aligned. We generate it using NucleoMiner 1.0. |
|
833 |
|
|
834 |
To install NucleoMiner 1.0 on your UNIX/LINUX computer you need first |
|
835 |
to install the Genetic Data analysis Library (GDL), which is a dynamic |
|
836 |
library of useful C functions derived from the GNU Scientific Library. |
|
837 |
|
|
838 |
|
|
839 |
Installing the GDL library |
|
840 |
-------------------------- |
|
841 |
|
|
842 |
Get the gdl-1.0.tar.gz archive on your computer (in the directory deps |
|
843 |
of your working directory). Copy it in a dedicated directory. Go into |
|
844 |
this directory using the cd command, and then unfold the archive by |
|
845 |
typing: |
|
846 |
|
|
847 |
tar -xvzf gdl-1.0.tar.gz |
|
848 |
|
|
849 |
This creates a directory called gdl-1.0. You now need to go into this |
|
850 |
directory and compile the library, by typing: |
|
851 |
|
|
852 |
mkdir tmp_c2c_workdir |
|
853 |
cd tmp_c2c_workdir |
|
854 |
cp ../deps/gdl-1.0.tar.gz . |
|
855 |
tar -xvzf gdl-1.0.tar.gz |
|
856 |
cd gdl-1.0 |
|
857 |
./configure |
|
858 |
make |
|
859 |
|
|
860 |
cd .. |
|
861 |
|
|
862 |
Now you need to install the library on your system. This needs root |
|
863 |
priviledges: |
|
864 |
|
|
865 |
sudo make install |
|
866 |
|
|
867 |
|
|
868 |
Installing NucleoMiner 1.0 |
|
869 |
-------------------------- |
|
870 |
|
|
871 |
Get the nucleominer-1.0.tar.gz archive on your computer. Copy it in a |
|
872 |
dedicated directory. Go into this directory using the cd command, and |
|
873 |
then unfold the archive by typing: |
|
874 |
|
|
875 |
This creates a directory called nucleominer-1.0. You now need to go |
|
876 |
into this directory and compile the library, by typing: |
|
877 |
|
|
878 |
cp ../deps/nucleominer-1.0.tar.gz . |
|
879 |
tar -xvzf nucleominer-1.0.tar.gz |
|
880 |
cd nucleominer-1.0 |
|
881 |
ln -s ../gdl-1.0/gdl |
|
882 |
./configure |
|
883 |
make |
|
884 |
|
|
885 |
You can then use the binaries dircetly from this folder (best then is |
|
886 |
to add the path to this folder in your PATH environment variable). If |
|
887 |
you want to install nucleominer at the system's level (useful if |
|
888 |
mutiple users will need it) then type, with root priviledges: |
|
889 |
|
|
890 |
sudo make install |
|
891 |
|
|
892 |
|
|
893 |
Generate .c2c Files |
|
894 |
------------------- |
|
895 |
|
|
896 |
To generate .c2c files you need to type the following command in a |
|
897 |
terminal: |
|
898 |
|
|
899 |
mkdir dir_4_c2c |
|
900 |
NMgxcomp ../data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta\ |
|
901 |
../data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta\ |
|
902 |
dir_4_c2c/BY_RM 2>dir_4_c2c/BY_RM.log |
|
903 |
|
|
904 |
After execution, the directory *dir_4_c2c* will hold the .c2c files. |
Formats disponibles : Unified diff