Statistiques
| Branche: | Révision :

root / doc / sphinx_doc / build / text / tuto.txt @ b20637ed

Historique | Voir | Annoter | Télécharger (27,3 ko)

1 935a568c Florent Chuffart
2 935a568c Florent Chuffart
Tutorial
3 935a568c Florent Chuffart
********
4 935a568c Florent Chuffart
5 935a568c Florent Chuffart
This tutorial describes steps allowing to perform quantitave analysis
6 935a568c Florent Chuffart
of nucleosomal epigenome. We assume that files are organised around a
7 935a568c Florent Chuffart
given hierarchie and that all command lines are launched from
8 935a568c Florent Chuffart
project's root.
9 935a568c Florent Chuffart
10 935a568c Florent Chuffart
This tutorial is divided into t=wo main parts. First one consists in
11 935a568c Florent Chuffart
the python script *wf.py* that aligns and convert Illumina reads.
12 935a568c Florent Chuffart
Second one is the R script *main.r* that extracts information
13 935a568c Florent Chuffart
(nucleosome position and indicators) from the dataset.
14 935a568c Florent Chuffart
15 935a568c Florent Chuffart
16 935a568c Florent Chuffart
Dataset and Configuration File
17 935a568c Florent Chuffart
==============================
18 935a568c Florent Chuffart
19 935a568c Florent Chuffart
We want to compare nucleosomes of 3 yeast strains:
20 935a568c Florent Chuffart
21 935a568c Florent Chuffart
* BY
22 935a568c Florent Chuffart
23 935a568c Florent Chuffart
* RM
24 935a568c Florent Chuffart
25 935a568c Florent Chuffart
* YJM
26 935a568c Florent Chuffart
27 935a568c Florent Chuffart
For each strain we perform Mnase-Seq and ChIP-Seq using the 5
28 935a568c Florent Chuffart
following markers:
29 935a568c Florent Chuffart
30 935a568c Florent Chuffart
* H3K4me1
31 935a568c Florent Chuffart
32 935a568c Florent Chuffart
* H3K4me3
33 935a568c Florent Chuffart
34 935a568c Florent Chuffart
* H3K9ac
35 935a568c Florent Chuffart
36 935a568c Florent Chuffart
* H3K14ac
37 935a568c Florent Chuffart
38 935a568c Florent Chuffart
* H4K12ac
39 935a568c Florent Chuffart
40 935a568c Florent Chuffart
In order to simplify the design of exeriment, we considere Mnase as a
41 935a568c Florent Chuffart
marker. For each couple *(strain, marker)* we perform 3 replicates.
42 935a568c Florent Chuffart
So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In
43 935a568c Florent Chuffart
practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of
44 935a568c Florent Chuffart
the 53 samples is indentify by a uniq identifier. The file
45 935a568c Florent Chuffart
*CSV_SAMPLE_FILE* sums up this information.
46 935a568c Florent Chuffart
47 935a568c Florent Chuffart
configurator.CSV_SAMPLE_FILE = None
48 935a568c Florent Chuffart
49 935a568c Florent Chuffart
   Path to cvs file that contains sample information.
50 935a568c Florent Chuffart
51 935a568c Florent Chuffart
We use a convention to link sample and Illumina fastq outputs.
52 935a568c Florent Chuffart
Illumina output files of the sample *ID* will be stored in the
53 935a568c Florent Chuffart
directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41
54 935a568c Florent Chuffart
outputs will be stored in the directory
55 935a568c Florent Chuffart
*data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*.
56 935a568c Florent Chuffart
57 935a568c Florent Chuffart
configurator.ILLUMINA_OUTPUTFILE_PREFIX = None
58 935a568c Florent Chuffart
59 935a568c Florent Chuffart
   Prefix for Illumina fastq output files.
60 935a568c Florent Chuffart
61 935a568c Florent Chuffart
For BY (resp. RM and YJM) we use following reference genome
62 935a568c Florent Chuffart
*saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp.
63 935a568c Florent Chuffart
*saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and
64 935a568c Florent Chuffart
*saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index
65 935a568c Florent Chuffart
*FASTA_REFERENCE_GENOME_FILES* stores this information.
66 935a568c Florent Chuffart
67 935a568c Florent Chuffart
configurator.FASTA_REFERENCE_GENOME_FILES = None
68 935a568c Florent Chuffart
69 935a568c Florent Chuffart
   Dictionary where each fasta reference genomes is indexed by
70 935a568c Florent Chuffart
   reference strain that it corresponds.
71 935a568c Florent Chuffart
72 935a568c Florent Chuffart
Each chromosome/contig is identify in the fasta file by an obscure
73 935a568c Florent Chuffart
identifier. For example, BY chromosome I is identify by
74 935a568c Florent Chuffart
*gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an
75 935a568c Florent Chuffart
integer. So, we translate it. The index *FASTA_INDEXES* stores this
76 935a568c Florent Chuffart
translation.
77 935a568c Florent Chuffart
78 935a568c Florent Chuffart
configurator.FASTA_INDEXES = None
79 935a568c Florent Chuffart
80 935a568c Florent Chuffart
   Dictionary of strain that indexes dictionaries where keys are
81 935a568c Florent Chuffart
   chromosome reference from Fastq file and value are its
82 935a568c Florent Chuffart
   correspondance for Templatefilter.
83 935a568c Florent Chuffart
84 935a568c Florent Chuffart
From a pragamatical point of view we discard some part of the genome
85 935a568c Florent Chuffart
(repeated sequence etc...). The list of the black listed area is
86 935a568c Florent Chuffart
explicitely detailled in *AREA_BLACK_LIST*.
87 935a568c Florent Chuffart
88 935a568c Florent Chuffart
configurator.AREA_BLACK_LIST = None
89 935a568c Florent Chuffart
90 935a568c Florent Chuffart
   Dictionary where keys are strain and values are black listed of
91 935a568c Florent Chuffart
   geneome region.
92 935a568c Florent Chuffart
93 935a568c Florent Chuffart
For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use
94 935a568c Florent Chuffart
previously compute .c2c file
95 935a568c Florent Chuffart
*data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp.
96 935a568c Florent Chuffart
*BY_YJM_GComp_All.c2c* and *RM_YJM_gxcomp.c2c*). For more information
97 935a568c Florent Chuffart
about .c2c files, please read section 5 of the manual of
98 935a568c Florent Chuffart
*NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens-
99 935a568c Florent Chuffart
lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
100 935a568c Florent Chuffart
101 935a568c Florent Chuffart
configurator.C2C_FILES = None
102 935a568c Florent Chuffart
103 935a568c Florent Chuffart
   Dictionary where each strain combination indexes genome aligment.
104 935a568c Florent Chuffart
105 935a568c Florent Chuffart
*nucleominer* uses specific directory to work in, these are described
106 935a568c Florent Chuffart
in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*.
107 935a568c Florent Chuffart
108 935a568c Florent Chuffart
Finally, *nucleominer* use external ressources, the path to these
109 935a568c Florent Chuffart
resspources are describe in *BOWTIE_BUILD_BIN*, *BOWTIE2_BIN*,
110 935a568c Florent Chuffart
*SAMTOOLS_BIN*, *BEDTOOLS_BIN* and *TF_BIN* and *TF_TEMPLATES_FILE*.
111 935a568c Florent Chuffart
112 935a568c Florent Chuffart
All paths, prefixes and indexes could be change in the
113 8e9facd8 Florent Chuffart
*src/current/nucleominer_config.json* file.
114 935a568c Florent Chuffart
115 935a568c Florent Chuffart
wf.json_conf_file = 'src/nucleo_miner/nucleo_miner_config.json'
116 935a568c Florent Chuffart
117 935a568c Florent Chuffart
   Path to the json configuration file.
118 935a568c Florent Chuffart
119 935a568c Florent Chuffart
120 935a568c Florent Chuffart
Preprocessing Illumina Fastq Reads for Each Sample
121 935a568c Florent Chuffart
==================================================
122 935a568c Florent Chuffart
123 935a568c Florent Chuffart
This preprocessing step consists in the 4 main steps embed in the
124 935a568c Florent Chuffart
*wf.py* and described bellow. As a preamble, this script computes
125 935a568c Florent Chuffart
*samples* *samples_mnase* and *strains* that will be used along the 4
126 935a568c Florent Chuffart
steps.
127 935a568c Florent Chuffart
128 935a568c Florent Chuffart
wf.samples = []
129 935a568c Florent Chuffart
130 935a568c Florent Chuffart
   List of samples where a sample is identify by an id (key: *id*) and
131 935a568c Florent Chuffart
   a strain name (key *strain*).
132 935a568c Florent Chuffart
133 935a568c Florent Chuffart
wf.samples_mnase = []
134 935a568c Florent Chuffart
135 935a568c Florent Chuffart
   List of Mnase samples.
136 935a568c Florent Chuffart
137 935a568c Florent Chuffart
wf.strains = []
138 935a568c Florent Chuffart
139 935a568c Florent Chuffart
   List of reference strains.
140 935a568c Florent Chuffart
141 935a568c Florent Chuffart
142 935a568c Florent Chuffart
Creating Bowtie Index from each Reference Genome
143 935a568c Florent Chuffart
------------------------------------------------
144 935a568c Florent Chuffart
145 935a568c Florent Chuffart
For each strain, we need to create bowtie index. Bowtie index of a
146 935a568c Florent Chuffart
strain is a tree view of the genemoe reference for this strain. It
147 935a568c Florent Chuffart
will be used by bowtie to align reads. This step is performed by the
148 935a568c Florent Chuffart
following part of the *wf.py* script:
149 935a568c Florent Chuffart
150 935a568c Florent Chuffart
     for strain in strains:
151 935a568c Florent Chuffart
       per_strain_stats[strain] = create_bowtie_index(strain, 
152 935a568c Florent Chuffart
         config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"], 
153 935a568c Florent Chuffart
         config["BOWTIE_BUILD_BIN"])
154 935a568c Florent Chuffart
155 935a568c Florent Chuffart
The following table sum up involved file sizes and process durations
156 935a568c Florent Chuffart
concerning this step.
157 935a568c Florent Chuffart
158 935a568c Florent Chuffart
+--------+------------------------+------------------------+------------------+
159 935a568c Florent Chuffart
| strain | fasta genome file size | bowtie index file size | process duration |
160 935a568c Florent Chuffart
+========+========================+========================+==================+
161 935a568c Florent Chuffart
| BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
162 935a568c Florent Chuffart
+--------+------------------------+------------------------+------------------+
163 935a568c Florent Chuffart
| RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
164 935a568c Florent Chuffart
+--------+------------------------+------------------------+------------------+
165 935a568c Florent Chuffart
| YJM    | 12 Mo                  | 25 Mo                  | 11 s.            |
166 935a568c Florent Chuffart
+--------+------------------------+------------------------+------------------+
167 935a568c Florent Chuffart
168 935a568c Florent Chuffart
169 935a568c Florent Chuffart
Aligning Reads to Reference Genome
170 935a568c Florent Chuffart
----------------------------------
171 935a568c Florent Chuffart
172 935a568c Florent Chuffart
Next, we launch bowtie to align reads to the reference genome. It
173 935a568c Florent Chuffart
produces a *.sam* file that we convert into a *.bed* file. Binaries
174 935a568c Florent Chuffart
for *bowtie*, *samtools* and *bedtools* are wrapped using python
175 935a568c Florent Chuffart
*subprocess* class. This step is performed by the followinw part of
176 935a568c Florent Chuffart
the *wf.py* script:
177 935a568c Florent Chuffart
178 935a568c Florent Chuffart
     for sample in samples:
179 935a568c Florent Chuffart
       per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample, 
180 935a568c Florent Chuffart
         config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"], 
181 935a568c Florent Chuffart
         config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"], 
182 935a568c Florent Chuffart
         config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
183 935a568c Florent Chuffart
184 935a568c Florent Chuffart
185 935a568c Florent Chuffart
Convert Aligned Reads for TemplateFilter
186 935a568c Florent Chuffart
----------------------------------------
187 935a568c Florent Chuffart
188 935a568c Florent Chuffart
TemplateFilter use particular input format for reads, so we convert
189 935a568c Florent Chuffart
*.bed* file. TemplateFilter expect reads as following: *chr coord
190 935a568c Florent Chuffart
strand #read* where:
191 935a568c Florent Chuffart
192 935a568c Florent Chuffart
* chr is the number of the chromosome;
193 935a568c Florent Chuffart
194 935a568c Florent Chuffart
* coord is the coordinate of the reads;
195 935a568c Florent Chuffart
196 935a568c Florent Chuffart
* strand is *F* for forward and *R* for reverse;
197 935a568c Florent Chuffart
198 935a568c Florent Chuffart
* #reads the number of reads for this position.
199 935a568c Florent Chuffart
200 935a568c Florent Chuffart
Each entry is *tab*-separated.
201 935a568c Florent Chuffart
202 935a568c Florent Chuffart
**WARNING** for reverse strand bowtie returns the position of left
203 935a568c Florent Chuffart
first nucleotid when TemplateFilter is waiting for right one. So this
204 935a568c Florent Chuffart
step takes it into account and add lenght of reads (in our case 50) to
205 935a568c Florent Chuffart
reverse reads coordinate.
206 935a568c Florent Chuffart
207 935a568c Florent Chuffart
This step is performed by the followinw part of the *wf.py* script:
208 935a568c Florent Chuffart
209 935a568c Florent Chuffart
     for sample in samples:
210 935a568c Florent Chuffart
       per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample, 
211 935a568c Florent Chuffart
         config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"], 
212 935a568c Florent Chuffart
         config["READ_LENGTH"],config["MAPQ_THRES"])
213 935a568c Florent Chuffart
214 935a568c Florent Chuffart
The following table sum up number of reads, involved file sizes and
215 935a568c Florent Chuffart
process durations concerning the two last steps. In our case, aligment
216 935a568c Florent Chuffart
process have been multuthreaded over over 3 cores.
217 935a568c Florent Chuffart
218 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
219 935a568c Florent Chuffart
| id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
220 935a568c Florent Chuffart
+====+================+===========================+========+==================+====================+==================+
221 935a568c Florent Chuffart
| 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
222 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
223 935a568c Florent Chuffart
| 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
224 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
225 935a568c Florent Chuffart
| 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
226 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
227 935a568c Florent Chuffart
| 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
228 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
229 935a568c Florent Chuffart
| 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
230 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
231 935a568c Florent Chuffart
| 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
232 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
233 935a568c Florent Chuffart
| 7  | 15591124       | 12126623                  | 77,78% | 1163 Mo          | 72  Mo             | 405   s.         |
234 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
235 935a568c Florent Chuffart
| 8  | 15659905       | 12475664                  | 79,67% | 1194 Mo          | 71  Mo             | 416   s.         |
236 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
237 935a568c Florent Chuffart
| 9  | 14668641       | 10960565                  | 74,72% | 1052 Mo          | 70  Mo             | 375   s.         |
238 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
239 935a568c Florent Chuffart
| 10 | 14339179       | 10454451                  | 72,91% | 1049 Mo          | 51  Mo             | 363   s.         |
240 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
241 935a568c Florent Chuffart
| 11 | 18019895       | 13688774                  | 75,96% | 1378 Mo          | 59  Mo             | 474   s.         |
242 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
243 935a568c Florent Chuffart
| 12 | 13746796       | 10810022                  | 78,64% | 1084 Mo          | 54  Mo             | 360   s.         |
244 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
245 935a568c Florent Chuffart
| 13 | 15205065       | 11766016                  | 77,38% | 990  Mo          | 54  Mo             | 381   s.         |
246 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
247 935a568c Florent Chuffart
| 14 | 17803097       | 13838883                  | 77,73% | 1154 Mo          | 60  Mo             | 452   s.         |
248 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
249 935a568c Florent Chuffart
| 15 | 15434564       | 12307878                  | 79,74% | 1032 Mo          | 57  Mo             | 394   s.         |
250 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
251 935a568c Florent Chuffart
| 16 | 16802587       | 12725665                  | 75,74% | 1221 Mo          | 48  Mo             | 438   s.         |
252 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
253 935a568c Florent Chuffart
| 17 | 16058417       | 12513734                  | 77,93% | 1192 Mo          | 63  Mo             | 422   s.         |
254 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
255 935a568c Florent Chuffart
| 18 | 16154482       | 13204331                  | 81,74% | 1277 Mo          | 52  Mo             | 430   s.         |
256 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
257 935a568c Florent Chuffart
| 19 | 21013924       | 17102120                  | 81,38% | 1646 Mo          | 59  Mo             | 555   s.         |
258 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
259 935a568c Florent Chuffart
| 20 | 17213114       | 14433357                  | 83,85% | 1389 Mo          | 53  Mo             | 459   s.         |
260 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
261 935a568c Florent Chuffart
| 21 | 17360907       | 14733001                  | 84,86% | 1203 Mo          | 55  Mo             | 450   s.         |
262 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
263 935a568c Florent Chuffart
| 22 | 18136816       | 15389581                  | 84,85% | 1257 Mo          | 53  Mo             | 469   s.         |
264 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
265 935a568c Florent Chuffart
| 23 | 14763678       | 12173025                  | 82,45% | 1140 Mo          | 56  Mo             | 393   s.         |
266 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
267 935a568c Florent Chuffart
| 24 | 15541709       | 12890345                  | 82,94% | 1057 Mo          | 48  Mo             | 398   s.         |
268 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
269 935a568c Florent Chuffart
| 25 | 16433215       | 13094314                  | 79,68% | 1241 Mo          | 57  Mo             | 433   s.         |
270 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
271 935a568c Florent Chuffart
| 26 | 17370850       | 14264136                  | 82,12% | 1347 Mo          | 51  Mo             | 466   s.         |
272 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
273 935a568c Florent Chuffart
| 27 | 14613512       | 8654495                   | 59,22% | 887  Mo          | 56  Mo             | 339   s.         |
274 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
275 935a568c Florent Chuffart
| 28 | 15248545       | 11367589                  | 74,55% | 1166 Mo          | 67  Mo             | 405   s.         |
276 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
277 935a568c Florent Chuffart
| 29 | 14316809       | 10767926                  | 75,21% | 1103 Mo          | 63  Mo             | 379   s.         |
278 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
279 935a568c Florent Chuffart
| 30 | 15178058       | 12265794                  | 80,81% | 1030 Mo          | 66  Mo             | 390   s.         |
280 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
281 935a568c Florent Chuffart
| 31 | 14968579       | 11876186                  | 79,34% | 1009 Mo          | 63  Mo             | 387   s.         |
282 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
283 935a568c Florent Chuffart
| 32 | 16912705       | 13550508                  | 80,12% | 1143 Mo          | 70  Mo             | 442   s.         |
284 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
285 935a568c Florent Chuffart
| 33 | 16782154       | 12755111                  | 76,00% | 1227 Mo          | 65  Mo             | 438   s.         |
286 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
287 935a568c Florent Chuffart
| 34 | 16741443       | 13168071                  | 78,66% | 1260 Mo          | 71  Mo             | 442   s.         |
288 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
289 935a568c Florent Chuffart
| 35 | 13096171       | 10367041                  | 79,16% | 992  Mo          | 62  Mo             | 350   s.         |
290 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
291 935a568c Florent Chuffart
| 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
292 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
293 935a568c Florent Chuffart
| 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
294 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
295 935a568c Florent Chuffart
| 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
296 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
297 935a568c Florent Chuffart
| 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
298 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
299 935a568c Florent Chuffart
| 40 | 13784738       | 10598464                  | 76,89% | 1005 Mo          | 64  Mo             | 358   s.         |
300 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
301 935a568c Florent Chuffart
| 41 | 12438007       | 9620975                   | 77,35% | 911  Mo          | 60  Mo             | 326   s.         |
302 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
303 935a568c Florent Chuffart
| 42 | 13853959       | 11031238                  | 79,63% | 1045 Mo          | 64  Mo             | 365   s.         |
304 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
305 935a568c Florent Chuffart
| 43 | 12036162       | 6654780                   | 55,29% | 684  Mo          | 46  Mo             | 268   s.         |
306 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
307 935a568c Florent Chuffart
| 44 | 13873129       | 10251074                  | 73,89% | 1048 Mo          | 61  Mo             | 365   s.         |
308 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
309 935a568c Florent Chuffart
| 45 | 19817751       | 14904502                  | 75,21% | 1520 Mo          | 72  Mo             | 528   s.         |
310 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
311 935a568c Florent Chuffart
| 46 | 13368959       | 10818619                  | 80,92% | 912  Mo          | 63  Mo             | 350   s.         |
312 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
313 935a568c Florent Chuffart
| 47 | 7566467        | 6139001                   | 81,13% | 520  Mo          | 44  Mo             | 201   s.         |
314 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
315 935a568c Florent Chuffart
| 48 | 32586928       | 21191363                  | 65,03% | 1816 Mo          | 82  Mo             | 766   s.         |
316 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
317 935a568c Florent Chuffart
| 49 | 30733184       | 18791373                  | 61,14% | 1801 Mo          | 89  Mo             | 721   s.         |
318 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
319 935a568c Florent Chuffart
| 50 | 41287616       | 30383875                  | 73,59% | 2911 Mo          | 112 Mo             | 1065  s.         |
320 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
321 935a568c Florent Chuffart
| 51 | 40439965       | 31177914                  | 77,10% | 2981 Mo          | 117 Mo             | 1070  s.         |
322 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
323 935a568c Florent Chuffart
| 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
324 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
325 935a568c Florent Chuffart
| 55 | 52424414       | 47117107                  | 89,88% | 3811 Mo          | 119 Mo             | 1477  s.         |
326 935a568c Florent Chuffart
+----+----------------+---------------------------+--------+------------------+--------------------+------------------+
327 935a568c Florent Chuffart
328 935a568c Florent Chuffart
For some reasons (manipulation efficency, e.g. PCR...), we remove
329 935a568c Florent Chuffart
samples 33, 45, 48 and 55.
330 935a568c Florent Chuffart
331 935a568c Florent Chuffart
332 935a568c Florent Chuffart
Run TemplateFilter on Mnase Samples
333 935a568c Florent Chuffart
-----------------------------------
334 935a568c Florent Chuffart
335 935a568c Florent Chuffart
Finally, for each sample we perfome TemplateFilter analysis.
336 935a568c Florent Chuffart
337 935a568c Florent Chuffart
**WARNING** TemplateFilter returns a list of nucleosomes. Each
338 935a568c Florent Chuffart
nucleosome is define by its center and its width. An odd width leads
339 935a568c Florent Chuffart
us to considere non interger lower and upper bound.
340 935a568c Florent Chuffart
341 935a568c Florent Chuffart
**WARNING** TemplateFilter is not design to deal with replicate. So we
342 935a568c Florent Chuffart
choose to keep a maximum of nucleosome and filter in a second time
343 935a568c Florent Chuffart
using the benefit of replicate. To do that we set a low correlation
344 935a568c Florent Chuffart
threshold parameter (*0.5*) and a particularly high value of
345 935a568c Florent Chuffart
overlaping (*300%*).
346 935a568c Florent Chuffart
347 935a568c Florent Chuffart
This step is performed by the followinw part of the *wf.py* script:
348 935a568c Florent Chuffart
349 935a568c Florent Chuffart
     for sample in samples_mnase:
350 935a568c Florent Chuffart
       per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample, 
351 935a568c Florent Chuffart
         config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"], 
352 935a568c Florent Chuffart
         config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"], 
353 935a568c Florent Chuffart
         config["TF_MAXW"], config["TF_OL"])  
354 935a568c Florent Chuffart
355 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
356 935a568c Florent Chuffart
| id | strain | found nucs | nuc file size | process duration |
357 935a568c Florent Chuffart
+====+========+============+===============+==================+
358 935a568c Florent Chuffart
| 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
359 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
360 935a568c Florent Chuffart
| 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
361 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
362 935a568c Florent Chuffart
| 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
363 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
364 935a568c Florent Chuffart
| 4  | RM     | 88076      | 62 Mo         | 984 s.           |
365 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
366 935a568c Florent Chuffart
| 5  | RM     | 90141      | 64 Mo         | 967 s.           |
367 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
368 935a568c Florent Chuffart
| 6  | RM     | 87517      | 62 Mo         | 980 s.           |
369 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
370 935a568c Florent Chuffart
| 7  | YJM    | 88945      | 64 Mo         | 566 s.           |
371 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
372 935a568c Florent Chuffart
| 8  | YJM    | 88689      | 64 Mo         | 570 s.           |
373 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
374 935a568c Florent Chuffart
| 9  | YJM    | 88128      | 63 Mo         | 565 s.           |
375 935a568c Florent Chuffart
+----+--------+------------+---------------+------------------+
376 935a568c Florent Chuffart
377 935a568c Florent Chuffart
378 935a568c Florent Chuffart
Inferring Nucleosome Position and Extracting Read Counts
379 935a568c Florent Chuffart
========================================================
380 935a568c Florent Chuffart
381 935a568c Florent Chuffart
This preprocessing step consists in the 4 main steps embed in the
382 935a568c Florent Chuffart
*wf.py* and described bellow. As a preamble, this script computes
383 935a568c Florent Chuffart
*samples* *samples_mnase* and *strains* that will be used along the 4
384 935a568c Florent Chuffart
steps.
385 935a568c Florent Chuffart
386 935a568c Florent Chuffart
The second part of the tutoriel use *R*
387 b20637ed Florent Chuffart
(http://http://www.r-project.org). It consists in the following main
388 b20637ed Florent Chuffart
steps:
389 935a568c Florent Chuffart
390 935a568c Florent Chuffart
   * compute_rois.R
391 935a568c Florent Chuffart
392 935a568c Florent Chuffart
   * extract_maps.R
393 935a568c Florent Chuffart
394 b20637ed Florent Chuffart
   * compare_common_wp.R
395 b20637ed Florent Chuffart
396 b20637ed Florent Chuffart
   * split_samples.R
397 b20637ed Florent Chuffart
398 935a568c Florent Chuffart
   * count_reads.R
399 935a568c Florent Chuffart
400 935a568c Florent Chuffart
   * get_size_factors
401 935a568c Florent Chuffart
402 935a568c Florent Chuffart
   * launch_deseq.R
403 935a568c Florent Chuffart
404 935a568c Florent Chuffart
405 935a568c Florent Chuffart
Computing Common Genome Region Between Strains
406 935a568c Florent Chuffart
----------------------------------------------
407 935a568c Florent Chuffart
408 8e9facd8 Florent Chuffart
   R CMD BATCH src/current/compute_rois.R
409 935a568c Florent Chuffart
410 935a568c Florent Chuffart
411 935a568c Florent Chuffart
Extracting Maps for Well Positionned and Fuzzy Nucleosomes
412 935a568c Florent Chuffart
----------------------------------------------------------
413 935a568c Florent Chuffart
414 8e9facd8 Florent Chuffart
   R CMD BATCH src/current/extract_maps.R
415 935a568c Florent Chuffart
416 935a568c Florent Chuffart
417 b20637ed Florent Chuffart
Compute Distance Between Well Positionned Nucleosomes
418 b20637ed Florent Chuffart
-----------------------------------------------------
419 b20637ed Florent Chuffart
420 b20637ed Florent Chuffart
   R CMD BATCH src/current/compare_common_wp.R
421 b20637ed Florent Chuffart
422 b20637ed Florent Chuffart
423 b20637ed Florent Chuffart
Split and Compress Samples According CURs
424 b20637ed Florent Chuffart
-----------------------------------------
425 b20637ed Florent Chuffart
426 b20637ed Florent Chuffart
   R CMD BATCH src/current/split_samples.R
427 b20637ed Florent Chuffart
428 b20637ed Florent Chuffart
429 935a568c Florent Chuffart
Count Reads for Each Nucleosome
430 935a568c Florent Chuffart
-------------------------------
431 935a568c Florent Chuffart
432 8e9facd8 Florent Chuffart
   R CMD BATCH src/current/count_reads.R
433 935a568c Florent Chuffart
434 935a568c Florent Chuffart
435 935a568c Florent Chuffart
Get Size Factors Using DESeq
436 935a568c Florent Chuffart
----------------------------
437 935a568c Florent Chuffart
438 8e9facd8 Florent Chuffart
   R CMD BATCH src/current/get_size_factors.R
439 935a568c Florent Chuffart
440 935a568c Florent Chuffart
441 935a568c Florent Chuffart
Performing DESeq Analysis
442 935a568c Florent Chuffart
-------------------------
443 935a568c Florent Chuffart
444 8e9facd8 Florent Chuffart
   R CMD BATCH src/current/launch_deseq.R
445 935a568c Florent Chuffart
446 935a568c Florent Chuffart
447 935a568c Florent Chuffart
Results
448 935a568c Florent Chuffart
=======
449 935a568c Florent Chuffart
450 935a568c Florent Chuffart
451 935a568c Florent Chuffart
Output Files Organisation
452 935a568c Florent Chuffart
-------------------------
453 935a568c Florent Chuffart
454 935a568c Florent Chuffart
Previous steps produce following 45 files. Each filename is under the
455 935a568c Florent Chuffart
form
456 935a568c Florent Chuffart
457 8e9facd8 Florent Chuffart
   results/current/[combi]_[marker]_[form]_snep.tab
458 935a568c Florent Chuffart
459 935a568c Florent Chuffart
Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination,
460 935a568c Florent Chuffart
marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each
461 935a568c Florent Chuffart
post translational histone modification and form is in {wp, fuzzy,
462 935a568c Florent Chuffart
wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes
463 935a568c Florent Chuffart
or both for SNEP computation.
464 935a568c Florent Chuffart
465 935a568c Florent Chuffart
chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM
466 935a568c Florent Chuffart
lower_bound_RM upper_bound_RM index_nuc_RM roi_index form
467 935a568c Florent Chuffart
BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4
468 935a568c Florent Chuffart
RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 BY_H3K14ac_37
469 935a568c Florent Chuffart
BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM
470 935a568c Florent Chuffart
471 935a568c Florent Chuffart
For each file, there is 1 line per nucleosome and each line is
472 935a568c Florent Chuffart
composed of many columns divided into 3 main topics:
473 935a568c Florent Chuffart
   * nuc information
474 935a568c Florent Chuffart
475 935a568c Florent Chuffart
   * number opf reads for each sample
476 935a568c Florent Chuffart
477 935a568c Florent Chuffart
   * DESeq analysis results.
478 935a568c Florent Chuffart
479 935a568c Florent Chuffart
For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are:
480 935a568c Florent Chuffart
   * chr_BY, the BY chr involved
481 935a568c Florent Chuffart
482 935a568c Florent Chuffart
   * lower_bound_BY, the lower bound of the BY nuc
483 935a568c Florent Chuffart
484 935a568c Florent Chuffart
   * upper_bound_BY, the upper_bound of the BY nuc
485 935a568c Florent Chuffart
486 8e9facd8 Florent Chuffart
   * index_nuc_BY, the index of the nuc in the entire list of BY
487 8e9facd8 Florent Chuffart
     nucs
488 935a568c Florent Chuffart
489 935a568c Florent Chuffart
   * chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM
490 8e9facd8 Florent Chuffart
491 935a568c Florent Chuffart
        are the same information for the RM strain
492 935a568c Florent Chuffart
493 935a568c Florent Chuffart
   * roi_index, the index of the region of interrest involved.
494 935a568c Florent Chuffart
495 935a568c Florent Chuffart
Next cols concern indicators for each sample. They are labeled
496 935a568c Florent Chuffart
[strain]_[marker]_[sample_id] and each value represents the number of
497 935a568c Florent Chuffart
reads for the current nuc for the sample *sample_id*.
498 935a568c Florent Chuffart
499 935a568c Florent Chuffart
The 5 final columns concern DESeq analysis:
500 8e9facd8 Florent Chuffart
   * manip[a_manip] strain[a_strain]
501 8e9facd8 Florent Chuffart
     manip[a_strain]:strain[a_strain], the manip (marker) effect, the
502 8e9facd8 Florent Chuffart
     strain effect and the snep effect.
503 935a568c Florent Chuffart
504 8e9facd8 Florent Chuffart
   * pvalsGLM, the pvalue resulting of the comparison of the GLM
505 8e9facd8 Florent Chuffart
     model considering or the interaction term *marker:strain*
506 935a568c Florent Chuffart
507 935a568c Florent Chuffart
   * snep_index, a boolean set to TRUE if the *pvalueGLM* value is
508 935a568c Florent Chuffart
     under the threshold computed with FDR function with a rate set to
509 935a568c Florent Chuffart
     0.01%.
510 935a568c Florent Chuffart
511 935a568c Florent Chuffart
It also produces the file that explicts size factor for each involved
512 935a568c Florent Chuffart
sample in differents strain combination and nucleosomal region type:
513 935a568c Florent Chuffart
514 8e9facd8 Florent Chuffart
TODO: include this file...
515 8e9facd8 Florent Chuffart
/home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt
516 935a568c Florent Chuffart
517 8e9facd8 Florent Chuffart
   results/current/size_factors.tab
518 935a568c Florent Chuffart
519 935a568c Florent Chuffart
520 935a568c Florent Chuffart
Number of SNEPs
521 935a568c Florent Chuffart
---------------
522 935a568c Florent Chuffart
523 935a568c Florent Chuffart
Here are the number of computed for each forms.
524 935a568c Florent Chuffart
525 935a568c Florent Chuffart
   [1] "wp"
526 935a568c Florent Chuffart
          #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
527 935a568c Florent Chuffart
   BY-RM  30234     520     798     83    3566      26
528 935a568c Florent Chuffart
   BY-YJM 31298     303     619    102     103     128
529 935a568c Florent Chuffart
   RM-YJM 29863     129     340     46    3177      18
530 935a568c Florent Chuffart
   [1] "fuzzy"
531 935a568c Florent Chuffart
          #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
532 935a568c Florent Chuffart
   BY-RM  10748     294     308    101    1681      42
533 935a568c Florent Chuffart
   BY-YJM 10669     122     176    124      93      87
534 935a568c Florent Chuffart
   RM-YJM 11478      54     112     41    1389      20
535 935a568c Florent Chuffart
   [1] "wpfuzzy"
536 935a568c Florent Chuffart
          #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
537 935a568c Florent Chuffart
   BY-RM  40982     770    1136    183    5404      73
538 935a568c Florent Chuffart
   BY-YJM 41967     439     804    214     198     199
539 935a568c Florent Chuffart
   RM-YJM 41341     184     468     87    4687      37
540 935a568c Florent Chuffart
541 935a568c Florent Chuffart
TODO:
542 935a568c Florent Chuffart
   * Print/study intra/inter strain LODs.
543 935a568c Florent Chuffart
544 8e9facd8 Florent Chuffart
   * Check the normality of sample using Shapiro–Wilk (Hypothesis
545 8e9facd8 Florent Chuffart
     for computing LODs)