/doc/sphinx_doc/build/text/tuto.txt - Diff - NucleoMiner - Forge du Centre Blaise Pascal

Révision dadb6a4d doc/sphinx_doc/build/text/tuto.txt

     (nucleosome position and indicators) from the dataset.
     Dataset and Configuration File
     ==============================
     Python and R Common Configuration File
     ======================================
     First of all we define in one place some configuration variables that
     will be launch by python and R scripts. This file is
     **configurator.py**. The execution of this python script dumps
     variables into the **nucleo_miner_config.json** that will be launch by
     both kind of scriopts (R and puython).
     To do this launch at the root of your project the following command
     line:
        python src/current/configurator.py
     $$$ other python script to describe: - libcoverage.py - wf.py
     Dataset and Configuration Variables
     ===================================
     We want to compare nucleosomes of 3 yeast strains:
-...
     * H4K12ac
     In order to simplify the design of exeriment, we considere Mnase as a
     In order to simplify the design of experiment, we considere Mnase as a
     marker. For each couple *(strain, marker)* we perform 3 replicates.
     So, theoritically we should have *3 * (1 + 5) * 3 = 54* samples. In
     practice we only obtain 2 replicates for *(YJM, H3K4me1)*. Each one of
     the 53 samples is indentify by a uniq identifier. The file
     *CSV_SAMPLE_FILE* sums up this information.
     configurator.CSV_SAMPLE_FILE = None
        Path to cvs file that contains sample information.
     We use a convention to link sample and Illumina fastq outputs.
     Illumina output files of the sample *ID* will be stored in the
     directory *ILLUMINA_OUTPUTFILE_PREFIX* + *ID*. For example, sample 41
     outputs will be stored in the directory
     *data/2012-09-05/FASTQ/Sample_Yvert_Bq41/*.
     configurator.ILLUMINA_OUTPUTFILE_PREFIX = None
        Prefix for Illumina fastq output files.
     For BY (resp. RM and YJM) we use following reference genome
     *saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta* (resp.
     *saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta* and
     *saccharomyces_cerevisiae_YJM_789_screencontig.fasta*). The index
     *FASTA_REFERENCE_GENOME_FILES* stores this information.
     configurator.FASTA_REFERENCE_GENOME_FILES = None
        Dictionary where each fasta reference genomes is indexed by
        reference strain that it corresponds.
     Each chromosome/contig is identify in the fasta file by an obscure
     identifier. For example, BY chromosome I is identify by
     *gi|144228165|ref|NC_001133.7|* when TemplateFilter is waiting for an
     integer. So, we translate it. The index *FASTA_INDEXES* stores this
     translation.
     configurator.FASTA_INDEXES = None
        Dictionary of strain that indexes dictionaries where keys are
        chromosome reference from Fastq file and value are its
        correspondance for Templatefilter.
     From a pragamatical point of view we discard some part of the genome
     (repeated sequence etc...). The list of the black listed area is
     explicitely detailled in *AREA_BLACK_LIST*.
     configurator.AREA_BLACK_LIST = None
        Dictionary where keys are strain and values are black listed of
        geneome region.
     For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use
     previously compute .c2c file
     *data/2012-03_primarydata/BY_RM_gxcomp.c2c* (resp.
-...
     *NucleoMiner*, the old version of *NucleoMiner2* (http://www.ens-
     lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
     configurator.C2C_FILES = None
        Dictionary where each strain combination indexes genome aligment.
     *nucleominer* uses specific directory to work in, these are described
     in *INDEX_DIR*, *ALIGN_DIR* and *LOG_DIR*.
-...
     All paths, prefixes and indexes could be change in the
     *src/current/nucleominer_config.json* file.
     wf.json_conf_file = 'src/current/nucleominer_config.json'
        Path to the json configuration file.
     Preprocessing Illumina Fastq Reads for Each Sample
     ==================================================
-...
     *samples* *samples_mnase* and *strains* that will be used along the 4
     steps.
     wf.samples = []
        List of samples where a sample is identify by an id (key: *id*) and
        a strain name (key *strain*).
     wf.samples_mnase = []
        List of Mnase samples.
     wf.strains = []
        List of reference strains.
     Creating Bowtie Index from each Reference Genome
     ------------------------------------------------
-...
     will be used by bowtie to align reads. This step is performed by the
     following part of the *wf.py* script:
          for strain in strains:
            per_strain_stats[strain] = create_bowtie_index(strain,
              config["FASTA_REFERENCE_GENOME_FILES"][strain], config["INDEX_DIR"],
              config["BOWTIE_BUILD_BIN"])
     The following table sum up involved file sizes and process durations
     concerning this step.
-...
     *subprocess* class. This step is performed by the followinw part of
     the *wf.py* script:
          for sample in samples:
            per_sample_align_stats["sample_%s" % sample["id"]] = align_reads(sample,
              config["ALIGN_DIR"], config["LOG_DIR"], config["INDEX_DIR"],
              config["ILLUMINA_OUTPUTFILE_PREFIX"], config["BOWTIE2_BIN"],
              config["SAMTOOLS_BIN"], config["BEDTOOLS_BIN"])
     Convert Aligned Reads for TemplateFilter
     ----------------------------------------
-...
     This step is performed by the followinw part of the *wf.py* script:
          for sample in samples:
            per_sample_convert_stats["sample_%s" % sample["id"]] = split_fr_4_TF(sample,
              config["ALIGN_DIR"], config["FASTA_INDEXES"], config["AREA_BLACK_LIST"],
              config["READ_LENGTH"],config["MAPQ_THRES"])
     The following table sum up number of reads, involved file sizes and
     process durations concerning the two last steps. In our case, aligment
     process have been multuthreaded over over 3 cores.
-...
     | 55 | 52424414       | 47117107                  | 89,88% | 3811 Mo          | 119 Mo             | 1477  s.         |
     +----+----------------+---------------------------+--------+------------------+--------------------+------------------+
     For some reasons (manipulation efficency, e.g. PCR...), we remove
     For some reasons (manipulation efficiency, e.g. PCR...), we remove
     samples 33, 45, 48 and 55.
-...
     This step is performed by the followinw part of the *wf.py* script:
          for sample in samples_mnase:
            per_mnase_sample_stats["sample_%s" % sample["id"]] = template_filter(sample,
              config["ALIGN_DIR"], config["LOG_DIR"], config["TF_BIN"],
              config["TF_TEMPLATES_FILE"], config["TF_CORR"], config["TF_MINW"],
              config["TF_MAXW"], config["TF_OL"])
     +----+--------+------------+---------------+------------------+
     | id | strain | found nucs | nuc file size | process duration |
     +====+========+============+===============+==================+
-...
     Inferring Nucleosome Position and Extracting Read Counts
     ========================================================
     This preprocessing step consists in the 4 main steps embed in the
     *wf.py* and described bellow. As a preamble, this script computes
     *samples* *samples_mnase* and *strains* that will be used along the 4
     steps.
     The second part of the tutoriel use *R*
     (http://http://www.r-project.org). It consists in the following main
     steps:
     The second part of the tutorial uses *R*
     (http://http://www.r-project.org). It consists in a set of R scripts
     taht will be sourced in an R console launched at the root of your
     project. the R srcipts are:
        * compute_rois.R
        * headers.R
        * extract_maps.R
-...
        * launch_deseq.R
     Computing Common Genome Region Between Strains
     ----------------------------------------------
     The Script headers.R
     --------------------
     The script header.R is included in each other scripts. It is in charge
     of:
        * launching libraries used in thes scripts
        * launching configuration (design, strain, marker...)
        * computing and caching CURs
     In your R console, run the following command line:
        R CMD BATCH src/current/header.R
     The Script extract_maps.R
     -------------------------
     This script is in charge of extracting Maps for well positioned and
     fuzzy nucleosomes. First of all, this script computed intra and inter
     strain nucleosome maps for each CUR. This step is executed in parallel
     on many cores using the BoT library. Next, it collects results and
     produces well positioned, fuzzy and UNR maps.
     The well-positioned map for BY is collected in the result directory
     and is called **BY_wp.tab**. It is composed of following columns:
        * chr, the number of the chromosome
        * lower_bound, the lower bound of the nucleosome
        * upper_bound, the upper bound of the nucleosome
        * cur_index, index of the CUR
        * index_nuc, the index of the nucleosome in the CUR
        * wp, 1 if it is a well positioned nucleosome, 0 else
        * nb_reads, the number of reads that supports this nucleosome
        * nb_nucs, the number of TemplateFilter nucleosome across
          replicates (= the number of replicates if it is a well-positioned
          nucleosome)
        * llr_1, for a well-positioned nucleosome, it is the LLR1 between
          the first and the second TemplateFilter nucleosome.
        * llr_2, for a well-positioned nucleosome, it is the LLR1 between
          the second and the first TemplateFilter nucleosome.
        * wp_llr, for a well-positioned nucleosome, it is the LLR2
          overall TemplateFilter nucleosomes.
        * wp_pval, for a well-positioned nucleosome, it is the p-value
          chi square test obtained with the LLR2 (**1-pchisq(2.LLR2,
          df=4)**)
        * dyad_shift, for a well-positioned nucleosome, it is shift
          between the two extreme TemplateFilter nucleosome dyad positions.
     The fuzzy map for BY is collected in the result directory and is
     called **BY_fuzzy.tab**. It is composed of following columns:
        * chr, the number of the chromosome
        * lower_bound, the lower bound of the nucleosome
        * upper_bound, the upper bound of the nucleosome
        * cur_index, index of the CUR
     The common well-position map for BY and RM strains is collected in the
     result directory and is called **BY_RM_common_wp.tab**. It is composed
     of following columns:
        * cur_index, the index of the CUR
        * index_nuc_BY, the index of the BY nucleosome in the CUR
        * index_nuc_RM,the index of the RM nucleosome in the CUR
        * llr_score, the LLR3 score between th eBy and RM nucleosomes
        * common_wp_pval,  the p-value chi square test obtained with the
          LLR3 (**1-pchisq(2.LLR3, df=2)**)
     The common UNR map for BY and RM strains is collected in the result
     directory and is called **BY_RM_common_unr.tab**. It is composed of
     following columns:
        * cur_index, the index of the CUR
        * index_nuc_BY, the index of the BY nucleosome in the CUR
        * index_nuc_RM,the index of the RM nucleosome in the CUR
     To execute this script, run the following command line in your R
     console:
        source("src/current/extract_maps.R")
     The Script compare_common_wp.R
     ------------------------------
     This script is used to compare inter strain distances between common
     well-positioned nucleosomes.
     For example, it compute the file **BY_RM_common_wp_diff.tab** that
     contains dyad shifts between two well-positioned nucleosomes. It is
     composed of following columns:
        * cur_index, the index of the CUR
        * index_nuc_BY, the index of the BY nucleosome in the CUR
        * index_nuc_RM,the index of the RM nucleosome in the CUR
        * llr_score, the LLR3 score between th eBy and RM nucleosomes
        * common_wp_pval,  the p-value chi square test obtained with the
          LLR3 (**1-pchisq(2.LLR3, df=2)**)
        * diff, the dyad shifts between two well-positioned nucleosomes
     It also translates well-positioned nucleosome maps from a strain to an
     other strain and stores it into a table.
     For example, the file **results/2014-04/RM_wp_tr_2_BY.tab** contains
     RM well-positioned nucleosome translated into the BY genome
     referential. It is composed of following columns:
        * strain_ref, the reference genome (in which positioned are
          defined)
        * begin, the translated lower bound of the nucleosome
        R CMD BATCH src/current/compute_rois.R
        * end, the translated upper bound of the nucleosome
        * chr, the number of chromosome for the reference genome (in
          which positioned are defined)
     Extracting Maps for Well Positionned and Fuzzy Nucleosomes
     ----------------------------------------------------------
        * length, the length of the nucleosome (could be negative)
        R CMD BATCH src/current/extract_maps.R
        * cur_index, the index of the CUR
        * index_nuc, the index of the nucleosome in the CUR
     Compute Distance Between Well Positionned Nucleosomes
     -----------------------------------------------------
     To execute this script, run the following command line in your R
     console:
        R CMD BATCH src/current/compare_common_wp.R
-...
     Where combi is in {BY_RM, BY_YJM, RM_YJM} for each strain combination,
     marker is in {H3K4me1, H3K4me3, H3K9ac, H3K14ac, H4K12ac} for each
     post translational histone modification and form is in {wp, fuzzy,
     wpfuzzy} considering well positionned nucleosomes, fuzzy nucleosomes
     or both for SNEP computation.
     wpfuzzy} considering well positioned nucleosomes, fuzzy nucleosomes or
     both for SNEP computation.
     chr_BY lower_bound_BY upper_bound_BY index_nuc_BY chr_RM
     lower_bound_RM upper_bound_RM index_nuc_RM roi_index form

Formats disponibles : Unified diff

LBMC » NucleoMiner

Révision dadb6a4d doc/sphinx_doc/build/text/tuto.txt