/doc/sphinx_doc/build/text/tuto.txt - NucleoMiner - Forge du Centre Blaise Pascal

root / doc / sphinx_doc / build / text / tuto.txt @ 21b8928f

Historique | Voir | Annoter | Télécharger (25,18 ko)

       Tutorial
       ********
       This tutorial describes steps allowing to perform quantitave analysis
       of nucleosomal epigenome. We assume that files are organised around a
       given hierarchie and that all command lines are launched from
       project's root.
       This tutorial is divided into t=wo main parts. First one consists in
       the python script *wf.py* that aligns and convert Illumina reads.
       Second one is the R script *main.r* that extracts information
       (nucleosome position and indicators) from the dataset.
       Dataset and Configuration File
       ==============================
       We want to compare nucleosomes of 3 yeast strains:
       * BY
       * RM
       * YJM
       For each strain we perform Mnase-Seq and ChIP-Seq using the 5
       following markers:
       * H3K4me1
       * H3K4me3
       * H3K9ac
       * H3K14ac
       * H4K12ac
       In order to simplify the design of exeriment, we considere Mnase as a
       marker. For each couple *(strain, marker)* we perform 3 replicates.
       So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In
       practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of
       the 53 samples is indentify by a uniq identifier. The file
       *CSV_SAMPLE_FILE* sums up this information.
       We use a convention to link sample and Illumina fastq outputs.
       Illumina output files of the sample *ID* will be stored in the
       directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41
       outputs will be stored in the directory
       *data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*.
       For BY (resp. RM and YJM) we use following reference genome
       *saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp.
       *saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and
       *saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index
       *FASTA_REFERENCE_GENOME_FILES* stores this information.
       Each chromosome/contig is identify in the fasta file by an obscure
       identifier. For example, BY chromosome I is identify by
       *gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an
       integer. So, we translate it. The index *FASTA_INDEXES* stores this
       translation.
       From a pragamatical point of view we discard some part of the genome
       (repeated sequence etc...). The list of the black listed area is
       explicitely detailled in *AREA_BLACK_LIST*.
       For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use
       previously compute .c2c file
       *data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp.
       *BY_YJM_GComp_All.c2c* and *RM_YJM_gxcomp.c2c*). For more information
       about .c2c files, please read section 5 of the manual of
       *NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens-
       lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
       *nucleominer* uses specific directory to work in, these are described
       in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*.
       Finally, *nucleominer* use external ressources, the path to these
       resspources are describe in *BOWTIE_BUILD_BIN*, *BOWTIE2_BIN*,
       *SAMTOOLS_BIN*, *BEDTOOLS_BIN* and *TF_BIN* and *TF_TEMPLATES_FILE*.
       All paths, prefixes and indexes could be change in the
       *src/current/nucleominer_config.json* file.
       Preprocessing Illumina Fastq Reads for Each Sample
       ==================================================
       This preprocessing step consists in the 4 main steps embed in the
       *wf.py* and described bellow. As a preamble, this script computes
       *samples* *samples_mnase* and *strains* that will be used along the 4
       steps.
       Creating Bowtie Index from each Reference Genome
       ------------------------------------------------
       For each strain, we need to create bowtie index. Bowtie index of a
       strain is a tree view of the genemoe reference for this strain. It
       will be used by bowtie to align reads. This step is performed by the
       following part of the *wf.py* script:
       The following table sum up involved file sizes and process durations
       concerning this step.
       +--------+------------------------+------------------------+------------------+
       | strain | fasta genome file size | bowtie index file size | process duration |
       +========+========================+========================+==================+
       | BY     | 12 Mo                  | 25 Mo                  | 11 s.            |
       +--------+------------------------+------------------------+------------------+
       | RM     | 12 Mo                  | 24 Mo                  | 9 s.             |
       +--------+------------------------+------------------------+------------------+
       | YJM    | 12 Mo                  | 25 Mo                  | 11 s.            |
       +--------+------------------------+------------------------+------------------+
       Aligning Reads to Reference Genome
       ----------------------------------
       Next, we launch bowtie to align reads to the reference genome. It
       produces a *.sam* file that we convert into a *.bed* file. Binaries
       for *bowtie*, *samtools* and *bedtools* are wrapped using python
       *subprocess* class. This step is performed by the followinw part of
       the *wf.py* script:
       Convert Aligned Reads for TemplateFilter
       ----------------------------------------
       TemplateFilter use particular input format for reads, so we convert
       *.bed* file. TemplateFilter expect reads as following: *chr coord
       strand #read* where:
       * chr is the number of the chromosome;
       * coord is the coordinate of the reads;
       * strand is *F* for forward and *R* for reverse;
       * #reads the number of reads for this position.
       Each entry is *tab*-separated.
       **WARNING** for reverse strand bowtie returns the position of left
       first nucleotid when TemplateFilter is waiting for right one. So this
       step takes it into account and add lenght of reads (in our case 50) to
       reverse reads coordinate.
       This step is performed by the followinw part of the *wf.py* script:
       The following table sum up number of reads, involved file sizes and
       process durations concerning the two last steps. In our case, aligment
       process have been multuthreaded over over 3 cores.
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | id | Illumina reads | aligned and filtred reads | ratio  | *.bed* file size | TF input file size | process duration |
       +====+================+===========================+========+==================+====================+==================+
       | 1  | 16436138       | 10199695                  | 62,06% | 1064 Mo          | 60  Mo             | 383   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 2  | 16911132       | 12512727                  | 73,99% | 1298 Mo          | 64  Mo             | 437   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 3  | 15946902       | 12340426                  | 77,38% | 1280 Mo          | 65  Mo             | 423   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 4  | 13765584       | 10381903                  | 75,42% | 931  Mo          | 59  Mo             | 352   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 5  | 15168268       | 11502855                  | 75,83% | 1031 Mo          | 64  Mo             | 386   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 6  | 18850820       | 14024905                  | 74,40% | 1254 Mo          | 69  Mo             | 482   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 7  | 15591124       | 12126623                  | 77,78% | 1163 Mo          | 72  Mo             | 405   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 8  | 15659905       | 12475664                  | 79,67% | 1194 Mo          | 71  Mo             | 416   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 9  | 14668641       | 10960565                  | 74,72% | 1052 Mo          | 70  Mo             | 375   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 10 | 14339179       | 10454451                  | 72,91% | 1049 Mo          | 51  Mo             | 363   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 11 | 18019895       | 13688774                  | 75,96% | 1378 Mo          | 59  Mo             | 474   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 12 | 13746796       | 10810022                  | 78,64% | 1084 Mo          | 54  Mo             | 360   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 13 | 15205065       | 11766016                  | 77,38% | 990  Mo          | 54  Mo             | 381   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 14 | 17803097       | 13838883                  | 77,73% | 1154 Mo          | 60  Mo             | 452   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 15 | 15434564       | 12307878                  | 79,74% | 1032 Mo          | 57  Mo             | 394   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 16 | 16802587       | 12725665                  | 75,74% | 1221 Mo          | 48  Mo             | 438   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 17 | 16058417       | 12513734                  | 77,93% | 1192 Mo          | 63  Mo             | 422   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 18 | 16154482       | 13204331                  | 81,74% | 1277 Mo          | 52  Mo             | 430   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 19 | 21013924       | 17102120                  | 81,38% | 1646 Mo          | 59  Mo             | 555   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 20 | 17213114       | 14433357                  | 83,85% | 1389 Mo          | 53  Mo             | 459   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 21 | 17360907       | 14733001                  | 84,86% | 1203 Mo          | 55  Mo             | 450   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 22 | 18136816       | 15389581                  | 84,85% | 1257 Mo          | 53  Mo             | 469   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 23 | 14763678       | 12173025                  | 82,45% | 1140 Mo          | 56  Mo             | 393   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 24 | 15541709       | 12890345                  | 82,94% | 1057 Mo          | 48  Mo             | 398   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 25 | 16433215       | 13094314                  | 79,68% | 1241 Mo          | 57  Mo             | 433   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 26 | 17370850       | 14264136                  | 82,12% | 1347 Mo          | 51  Mo             | 466   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 27 | 14613512       | 8654495                   | 59,22% | 887  Mo          | 56  Mo             | 339   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 28 | 15248545       | 11367589                  | 74,55% | 1166 Mo          | 67  Mo             | 405   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 29 | 14316809       | 10767926                  | 75,21% | 1103 Mo          | 63  Mo             | 379   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 30 | 15178058       | 12265794                  | 80,81% | 1030 Mo          | 66  Mo             | 390   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 31 | 14968579       | 11876186                  | 79,34% | 1009 Mo          | 63  Mo             | 387   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 32 | 16912705       | 13550508                  | 80,12% | 1143 Mo          | 70  Mo             | 442   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 33 | 16782154       | 12755111                  | 76,00% | 1227 Mo          | 65  Mo             | 438   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 34 | 16741443       | 13168071                  | 78,66% | 1260 Mo          | 71  Mo             | 442   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 35 | 13096171       | 10367041                  | 79,16% | 992  Mo          | 62  Mo             | 350   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 36 | 17715118       | 14092985                  | 79,55% | 1404 Mo          | 68  Mo             | 483   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 37 | 17288466       | 7402082                   | 42,82% | 741  Mo          | 48  Mo             | 339   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 38 | 16116394       | 13178457                  | 81,77% | 1101 Mo          | 63  Mo             | 420   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 39 | 14241106       | 10537228                  | 73,99% | 880  Mo          | 57  Mo             | 348   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 40 | 13784738       | 10598464                  | 76,89% | 1005 Mo          | 64  Mo             | 358   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 41 | 12438007       | 9620975                   | 77,35% | 911  Mo          | 60  Mo             | 326   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 42 | 13853959       | 11031238                  | 79,63% | 1045 Mo          | 64  Mo             | 365   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 43 | 12036162       | 6654780                   | 55,29% | 684  Mo          | 46  Mo             | 268   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 44 | 13873129       | 10251074                  | 73,89% | 1048 Mo          | 61  Mo             | 365   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 45 | 19817751       | 14904502                  | 75,21% | 1520 Mo          | 72  Mo             | 528   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 46 | 13368959       | 10818619                  | 80,92% | 912  Mo          | 63  Mo             | 350   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 47 | 7566467        | 6139001                   | 81,13% | 520  Mo          | 44  Mo             | 201   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 48 | 32586928       | 21191363                  | 65,03% | 1816 Mo          | 82  Mo             | 766   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 49 | 30733184       | 18791373                  | 61,14% | 1801 Mo          | 89  Mo             | 721   s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 50 | 41287616       | 30383875                  | 73,59% | 2911 Mo          | 112 Mo             | 1065  s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 51 | 40439965       | 31177914                  | 77,10% | 2981 Mo          | 117 Mo             | 1070  s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 53 | 40876476       | 33780065                  | 82,64% | 3316 Mo          | 103 Mo             | 1165  s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       | 55 | 52424414       | 47117107                  | 89,88% | 3811 Mo          | 119 Mo             | 1477  s.         |
       +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
       For some reasons (manipulation efficency, e.g. PCR...), we remove
       samples 33, 45, 48 and 55.
       Run TemplateFilter on Mnase Samples
       -----------------------------------
       Finally, for each sample we perfome TemplateFilter analysis.
       **WARNING** TemplateFilter returns a list of nucleosomes. Each
       nucleosome is define by its center and its width. An odd width leads
       us to considere non interger lower and upper bound.
       **WARNING** TemplateFilter is not design to deal with replicate. So we
       choose to keep a maximum of nucleosome and filter in a second time
       using the benefit of replicate. To do that we set a low correlation
       threshold parameter (*0.5*) and a particularly high value of
       overlaping (*300%*).
       This step is performed by the followinw part of the *wf.py* script:
       +----+--------+------------+---------------+------------------+
       | id | strain | found nucs | nuc file size | process duration |
       +====+========+============+===============+==================+
       | 1  | BY     | 96214      | 68 Mo         | 1022 s.          |
       +----+--------+------------+---------------+------------------+
       | 2  | BY     | 91694      | 65 Mo         | 1038 s.          |
       +----+--------+------------+---------------+------------------+
       | 3  | BY     | 91205      | 65 Mo         | 1036 s.          |
       +----+--------+------------+---------------+------------------+
       | 4  | RM     | 88076      | 62 Mo         | 984 s.           |
       +----+--------+------------+---------------+------------------+
       | 5  | RM     | 90141      | 64 Mo         | 967 s.           |
       +----+--------+------------+---------------+------------------+
       | 6  | RM     | 87517      | 62 Mo         | 980 s.           |
       +----+--------+------------+---------------+------------------+
       | 7  | YJM    | 88945      | 64 Mo         | 566 s.           |
       +----+--------+------------+---------------+------------------+
       | 8  | YJM    | 88689      | 64 Mo         | 570 s.           |
       +----+--------+------------+---------------+------------------+
       | 9  | YJM    | 88128      | 63 Mo         | 565 s.           |
       +----+--------+------------+---------------+------------------+
       Inferring Nucleosome Position and Extracting Read Counts
       ========================================================
       This preprocessing step consists in the 4 main steps embed in the
       *wf.py* and described bellow. As a preamble, this script computes
       *samples* *samples_mnase* and *strains* that will be used along the 4
       steps.
       The second part of the tutoriel use *R*
       (http://http://www.r-project.org). It consists in the following main
       steps:
          * compute_rois.R
          * extract_maps.R
          * compare_common_wp.R
          * split_samples.R
          * count_reads.R
          * get_size_factors
          * launch_deseq.R
       Computing Common Genome Region Between Strains
       ----------------------------------------------
          R CMD BATCH src/current/compute_rois.R
       Extracting Maps for Well Positionned and Fuzzy Nucleosomes
       ----------------------------------------------------------
          R CMD BATCH src/current/extract_maps.R
       Compute Distance Between Well Positionned Nucleosomes
       -----------------------------------------------------
          R CMD BATCH src/current/compare_common_wp.R
       Split and Compress Samples According CURs
       -----------------------------------------
          R CMD BATCH src/current/split_samples.R
       Count Reads for Each Nucleosome
       -------------------------------
          R CMD BATCH src/current/count_reads.R
       Get Size Factors Using DESeq
       ----------------------------
          R CMD BATCH src/current/get_size_factors.R
       Performing DESeq Analysis
       -------------------------
          R CMD BATCH src/current/launch_deseq.R
       Results
       =======
       Output Files Organisation
       -------------------------
       Previous steps produce following 45 files. Each filename is under the
       form
          results/current/[combi]_[marker]_[form]_snep.tab
       Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination,
       marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each
       post translational histone modification and form is in {wp, fuzzy,
       wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes
       or both for SNEP computation.
       chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM
       lower_bound_RM upper_bound_RM index_nuc_RM roi_index form
       BY_Mnase_Seq_1 BY_Mnase_Seq_2 BY_Mnase_Seq_3 RM_Mnase_Seq_4
       RM_Mnase_Seq_5 RM_Mnase_Seq_6 BY_H3K14ac_36 BY_H3K14ac_37
       BY_H3K14ac_53 RM_H3K14ac_38 RM_H3K14ac_39 pvalsGLM
       For each file, there is 1 line per nucleosome and each line is
       composed of many columns divided into 3 main topics:
          * nuc information
          * number opf reads for each sample
          * DESeq analysis results.
       For exemple for the file *BY_RM_H3K14ac_wp_snep.tab* informations are:
          * chr_BY, the BY chr involved
          * lower_bound_BY, the lower bound of the BY nuc
          * upper_bound_BY, the upper_bound of the BY nuc
          * index_nuc_BY, the index of the nuc in the entire list of BY
            nucs
          * chr_RM, lower_bound_RM, upper_bound_RM, index_nuc_RM
               are the same information for the RM strain
          * roi_index, the index of the region of interrest involved.
       Next cols concern indicators for each sample. They are labeled
       [strain]_[marker]_[sample_id] and each value represents the number of
       reads for the current nuc for the sample *sample_id*.
       The 5 final columns concern DESeq analysis:
          * manip[a_manip] strain[a_strain]
            manip[a_strain]:strain[a_strain], the manip (marker) effect, the
            strain effect and the snep effect.
          * pvalsGLM, the pvalue resulting of the comparison of the GLM
            model considering or the interaction term *marker:strain*
          * snep_index, a boolean set to TRUE if the *pvalueGLM* value is
            under the threshold computed with FDR function with a rate set to
 .01%.
       It also produces the file that explicts size factor for each involved
       sample in differents strain combination and nucleosomal region type:
       TODO: include this file...
       /home/filleton/analyses/snepcatalog/data/2013-10-09/current/README.txt
          results/current/size_factors.tab
       Number of SNEPs
       ---------------
       Here are the number of computed for each forms.
          [1] "wp"
                 #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
          BY-RM  30234     520     798     83    3566      26
          BY-YJM 31298     303     619    102     103     128
          RM-YJM 29863     129     340     46    3177      18
          [1] "fuzzy"
                 #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
          BY-RM  10748     294     308    101    1681      42
          BY-YJM 10669     122     176    124      93      87
          RM-YJM 11478      54     112     41    1389      20
          [1] "wpfuzzy"
                 #nucs H3K4me1 H3K4me3 H3K9ac H3K14ac H4K12ac
          BY-RM  40982     770    1136    183    5404      73
          BY-YJM 41967     439     804    214     198     199
          RM-YJM 41341     184     468     87    4687      37
       TODO:
          * Print/study intra/inter strain LODs.
          * Check the normality of sample using Shapiro–Wilk (Hypothesis
            for computing LODs)

LBMC » NucleoMiner

root / doc / sphinx_doc / build / text / tuto.txt @ 21b8928f