/doc/sphinx_doc/tuto.rst - NucleoMiner - Forge du Centre Blaise Pascal

root / doc / sphinx_doc / tuto.rst @ 3961deb6

Historique | Voir | Annoter | Télécharger (25,93 ko)

       Tutorial
       ========
       This tutorial describes steps allowing to perform quantitative analysis of epigenetic marks on individual nucleosomes. We assume that files are organised according to a given hierarchy and that all command lines are launched from the project’s root directory.
       This tutorial is divided into two main parts. The first part covers the python script `wf.py` that aligns and converts short sequence reads. The second part covers the R scripts that extracts nucleosome-level information (nucleosome position and indicators) from the dataset.
       Experimental Dataset, Working Directory and Configuration File
       --------------------------------------------------------------
       Working Directory Organisation
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       After having installed NucleoMiner2 environment (Previous section), go to the root working directory of the tutorial by typing the following command in a terminal:
       .. code:: bash
         cd doc/Chuffart_NM2_workdir/
       Retrieving Experimental Dataset
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       The MNase-seq and MN-ChIP-seq raw data are available at ArrayExpress (http://www.ebi.ac.uk/arrayexpress/) under accession number E-MTAB-2671.
       $$$ TODO explain how organise Experimental Dataset into the `data` directory of the working directory.
       In this tutorial, we want to compare nucleosomes of 2 yeast strains: BY and RM. For each strain Mnase-Seq was performed as well as ChIP-Seq using an antibody recognizing the H3K14ac epigenetic mark. Illumina sequencing was done in single-read of 50 bp long.
       The dataset is composed of 55 files organised as follows:
         - 3 replicates for BY MNase Seq
           - sample 1 (5 fastq.gz files)
           - sample 2 (5 fastq.gz files)
           - sample 3 (4 fastq.gz files)
         - 3 replicates for RM MNase Seq
           - sample 4 (4 fastq.gz files)
           - sample 5 (4 fastq.gz files)
           - sample 6 (5 fastq.gz files)
         - 3 replicates for BY ChIP Seq H3K14ac
           - sample 36 (5 fastq.gz files)
           - sample 37 (5 fastq.gz files)
           - sample 53 (9 fastq.gz files)
         - 2 replicates for RM ChIP Seq H3K14ac
           - sample 38 (5 fastq.gz files)
           - sample 39 (4 fastq.gz files)
       Python and R Common Configuration File
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       First, we need to define useful configuration variables that will be passed to python and R scripts. These variables are contained in file `configurator.py`. The execution of this python script dumps variables into the `nucleominer_config.json` file that will then be used by both R and python scripts.
       The initialization of this variables is done in the configurator.py file. If you need to adapt variable values (path, default parameters...) you need to edit this file. Then, go to the root directory of your project and run the following command to dump the configuration file:
       .. code:: bash
         python src/current/configurator.py
       Preprocessing Illumina Fastq Reads for Each Sample
       --------------------------------------------------
       Once variables and design have been specified, the script wf.py will automatically run all the analysis. You don't need to do anything.
       To run the full analysis, run the following command:
       .. code:: bash
         python src/current/wf.py
       The details of the steps performed by this script are explained below.
       This preprocessing consists of 4 steps embedded in the `wf.py` script. They are described bellow.  As a preamble, this script computes `samples`, `samples_mnase` and `strains` that will be used along the 4 steps.
       .. autodata:: wf.samples
           :noindex:
       .. autodata:: wf.samples_mnase
           :noindex:
       .. autodata:: wf.strains
           :noindex:
       Creating Bowtie Index from each Reference Genome
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       For each strain, the script *wf.py* then creates bowtie index. Bowtie index of a strain is a tree view of the genome of this strain. It will be used by bowtie to align reads. The part of the script performing this is the following:
       .. literalinclude:: ../../../snep/src/current/wf.py
          :start-after: # _STARTOF_ step_1
          :end-before: # _ENDOF_ step_1
          :language: python
       As an indication, the following table summarizes the file sizes and process durations that we experienced when running this step on a Linux server***.
       ======  ======================  ======================  ================
       strain  fasta genome file size  bowtie index file size  process duration
       ======  ======================  ======================  ================
       BY      12 Mo                          25 Mo                    11 s.
       RM      12 Mo                          24 Mo                    9 s.
       ======  ======================  ======================  ================
       Aligning Reads to Reference Genome
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       Next, the *wf.py* script launches bowtie to align reads to the reference genome. It produces a `.sam` file that is converted into a `.bed` file. Binaries for `bowtie`, `samtools` and `bedtools` are wrapped using python `subprocess` class. This step is performed by the following part of the script:
       .. literalinclude:: ../../../snep/src/current/wf.py
          :start-after: # _STARTOF_ step_2
          :end-before: # _ENDOF_ step_2
          :language: python
       Convert Aligned Reads into TemplateFilter Format
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       TemplateFilter uses particular input formats for reads, so it is necessary to convert the `.bed` files. TemplateFilter expect reads in the following format: `chr`, `coord`, `strand` and `#read` where:
       - `chr` is the number of the chromosome;
       - `coord` is the coordinate of the reads;
       - `strand` is `F` for forward and `R` for reverse;
       - `#reads` the number of reads covering this position.
       Each entry is *tab*-separated.
       **WARNING** for reverse strands, bowtie returns the position of the first nucleotide on the left hand side, whereas TemplateFilter expects the first one on the right hand side.  This is taken into account in NucleoMiner2 by adding the read length (in our case 50) to the reverse reads coordinates.
       This step is performed by the following part of the *wf.py* script:
       .. literalinclude:: ../../../snep/src/current/wf.py
          :start-after: # _STARTOF_ step_3
          :end-before: # _ENDOF_ step_3
          :language: python
       The following table summarizes the number of reads, the involved file sizes and process durations that we experienced when running the two last steps. In our case, alignment process were multithreaded over 3 cores.
       ==  ==============  =========================  ======  ================  ==================  ================
       id  Illumina reads  aligned and filtred reads  ratio   `.bed` file size  TF input file size  process duration
       ==  ==============  =========================  ======  ================  ==================  ================
 16436138        10199695                   62,06%  1064 Mo           60  Mo              383   s.
 16911132        12512727                   73,99%  1298 Mo           64  Mo              437   s.
 15946902        12340426                   77,38%  1280 Mo           65  Mo              423   s.
 13765584        10381903                   75,42%  931  Mo           59  Mo              352   s.
 15168268        11502855                   75,83%  1031 Mo           64  Mo              386   s.
 18850820        14024905                   74,40%  1254 Mo           69  Mo              482   s.
 17715118        14092985                   79,55%  1404 Mo           68  Mo              483   s.
 17288466        7402082                    42,82%  741  Mo           48  Mo              339   s.
 16116394        13178457                   81,77%  1101 Mo           63  Mo              420   s.
 14241106        10537228                   73,99%  880  Mo           57  Mo              348   s.
 40876476        33780065                   82,64%  3316 Mo           103 Mo              1165  s.
       ==  ==============  =========================  ======  ================  ==================  ================
       Run TemplateFilter on Mnase Samples
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       Finally, for each sample we perform TemplateFilter analysis.
       **WARNING** TemplateFilter returns a list of nucleosomes. Each nucleosome is
       defined by its center and its width. An odd width leads us to consider non-
       integer lower and upper bound.
       **WARNING** TemplateFilter was not designed to handle replicates. So we recommend to keep a maximum of nucleosomes and filter the aberrant ones afterwards using the benefits of having replicates. To do this, we set a low correlation threshold parameter (0.5) and a particularly high value of overlap (300%).
       This step is performed by the following part of the `wf.py` script:
       .. literalinclude:: ../../../snep/src/current/wf.py
          :start-after: # _STARTOF_ step_4
          :end-before: # _ENDOF_ step_4
          :language: python
       ==  ======  ==========  =============  ================
       id  strain  found nucs  nuc file size  process duration
       ==  ======  ==========  =============  ================
 BY     96214       68 Mo          1022 s.
 BY     91694       65 Mo          1038 s.
 BY     91205       65 Mo          1036 s.
 RM     88076       62 Mo          984 s.
 RM     90141       64 Mo          967 s.
 RM     87517       62 Mo          980 s.
       ==  ======  ==========  =============  ================
       ..
       ..
       .. - libcoverage.py
       .. - wf.py
       ..
       ..
       ..
       ..
       ..
       ..
       .. In order to simplify the design of experiment, we consider Mnase as a marker.
       .. For each couple `(strain, marker)` we perform 3 replicates. So, theoritically
       .. we should have `3 * (1 + 5) * 3 = 54` samples. In practice we only obtain 2
       .. replicates for `(YJM, H3K4me1)`. Each one of the 53 samples is indentify by a
       .. uniq identifier. The file `CSV_SAMPLE_FILE` sums up this information.
       ..
       .. .. autodata:: configurator.CSV_SAMPLE_FILE
       ..     :noindex:
       ..
       .. We use a convention to link sample and Illumina fastq outputs. Illumina output
       .. files of the sample `ID` will be stored in the directory
       .. `ILLUMINA_OUTPUTFILE_PREFIX` + `ID`. For example, sample 41 outputs will be
       .. stored in the directory `data/2012-09-05/FASTQ/Sample_Yvert_Bq41/`.
       ..
       .. .. autodata:: configurator.ILLUMINA_OUTPUTFILE_PREFIX
       ..     :noindex:
       ..
       .. For BY (resp. RM and YJM) we use following reference genome
       .. `saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta`
       .. (resp. `saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta` and
       .. `saccharomyces_cerevisiae_YJM_789_screencontig.fasta`).
       .. The index `FASTA_REFERENCE_GENOME_FILES` stores this information.
       ..
       .. .. autodata:: configurator.FASTA_REFERENCE_GENOME_FILES
       ..     :noindex:
       ..
       .. Each chromosome/contig is identify in the fasta file by an obscure identifier.
       .. For example, BY chromosome I is identify by `gi|144228165|ref|NC_001133.7|` when
       .. TemplateFilter is waiting for an integer. So, we translate it. The index
       .. `FASTA_INDEXES` stores this translation.
       ..
       .. .. autodata:: configurator.FASTA_INDEXES
       ..     :noindex:
       ..
       .. From a pragamatical point of view we discard some part of the genome (repeated
       .. sequence etc...). The list of the black listed area is explicitely detailled in
       .. `AREA_BLACK_LIST`.
       ..
       .. .. autodata:: configurator.AREA_BLACK_LIST
       ..     :noindex:
       ..
       .. For BY-RM (resp. BY-YJM and RM-YJM) genome sequence alignment we use previously
       .. compute .c2c file `data/2012-03_primarydata/BY_RM_gxcomp.c2c` (resp.
       .. `BY_YJM_GComp_All.c2c` and `RM_YJM_gxcomp.c2c`). For more information about
       .. .c2c files, please read section 5 of the manual of `NucleoMiner`, the old
       .. version of `NucleoMiner2`
       .. (http://www.ens-lyon.fr/LBMC/gisv/NucleoMiner_Manual/manual.pdf).
       ..
       .. .. autodata:: configurator.C2C_FILES
       ..     :noindex:
       ..
       .. `nucleominer` uses specific directory to work in, these are described in
       .. `INDEX_DIR`, `ALIGN_DIR` and `LOG_DIR`.
       ..
       .. Finally, `nucleominer` use external ressources, the path to these resspources
       .. are describe in `BOWTIE_BUILD_BIN`, `BOWTIE2_BIN`, `SAMTOOLS_BIN`,
       .. `BEDTOOLS_BIN` and `TF_BIN` and `TF_TEMPLATES_FILE`.
       ..
       .. All paths, prefixes and indexes could be change in the
       .. `src/current/nucleominer_config.json` file.
       ..
       .. .. autodata:: wf.json_conf_file
       ..     :noindex:
       ..
       Inferring Nucleosome Position and Extracting Read Counts
       --------------------------------------------------------
       The second part of the tutorial uses R (http://http://www.r-project.org). NucleoMiner2 contains a set of R scripts that will be sourced in R from a console launched at the root of your project. These scripts are:
         - headers.R
         - extract_maps.R
         - translate_common_wp.R
         - split_samples.R
         - count_reads.R
         - get_size_factors
         - launch_deseq.R
       The Script headers.R
       ^^^^^^^^^^^^^^^^^^^^
       The script headers.R is included in all other R scripts. It is in charge of:
         - launching libraries used in the scripts
         - launching configuration (design, strain, marker...)
         - computing and caching Common Uinterrupted Regions (CURs). Caching means storing the information in the computer's memory.
       Note that you can customize the function “translate”. This function allows you to use the alignments between genomes when performing various tasks.
         -  You may want to analyze data of a single strain (e.g. treatment/control, or only few mutations). In this case, the genome is identical across all samples and you do not need to define particular CURs (CURs are chromosomes). Simply use the default translate function which is neutral.
         - If you are analyzing data from two or more strains (as NucleoMiner2 was designed for), then you need to translate coordinates of one genome into the coordinates of another one. You must do this by aligning the two genomes, which will produce a .c2c file (see Appendice "Generate .c2c Files").  thenuse it to produce the list of regions and customise “translate”.
       In our tutorial, we are in the second case and to perform all these steps run the following command line in your R console:
       .. code:: bash
         source("src/current/headers.R")
       The Script extract_maps.R
       ^^^^^^^^^^^^^^^^^^^^^^^^^
       This script is in charge of extracting Maps for well-positioned and sensitive nucleosomes. First of all, this script computes intra and inter-strain matches of nucleosome maps for each CUR. This step can be executed in parallel on many cores using the BoT library. Next, it collects results and produces maps of  well-positioned nucleosomes, sensitive nucleosomes and Unaligned Nucleosomal Regions .
       The map of well-positioned nucleosomes for BY is collected in the result directory and is called `BY_wp.tab`. It is composed of following columns:
        - chr, the number of the chromosome
        - lower_bound, the lower bound of the nucleosome
        - upper_bound, the upper bound of the nucleosome
        - cur_index, index of the CUR
        - index_nuc, the index of the nucleosome in the CUR
        - wp, 1 if it is a well positioned nucleosome, 0 otherwise
        - nb_reads, the number of reads that support this nucleosome
        - nb_nucs, the number of TemplateFilter nucleosome across replicates (= the number of replicates in which it is a well-positioned nucleosome)
        - llr_1, for a well-positioned nucleosome, it is the LLR1 (log-likelihood ratio) between the first and the second TemplateFilter nucleosome on the chain.
        - llr_2, for a well-positioned nucleosome, it is the LLR1 between the second and the third TemplateFilter nucleosome on the chain.
        - wp_llr, for a well-positioned nucleosome, it is the LLR2 that compares consistency of the positioning over all TemplateFilter nucleosomes.
        - wp_pval, for a well-positioned nucleosome, it is the p-value chi square test obtained from LLR2 (`1-pchisq(2.LLR2, df=4)`)
        - dyad_shift, for a well-positioned nucleosome, it is the shift between the two extreme TemplateFilter nucleosome dyad positions.
       The sensitive map for BY is collected in the result directory and is called `BY_fuzzy.tab`. It is composed of following columns:
        - chr, the number of the chromosome
        - lower_bound, the lower bound of the nucleosome
        - upper_bound, the upper bound of the nucleosome
        - cur_index, index of the CUR
       The map of common well-positioned nucleosomes aligned between the BY and RM strains is collected in the result directory and is called `BY_RM_common_wp.tab`. It is composed of following columns:
        - cur_index, the index of the CUR
        - index_nuc_BY, the index of the BY nucleosome in the CUR
        - index_nuc_RM, the index of the RM nucleosome in the CUR
        - llr_score, , the LLR3 score that estimates conservation between the positions in BY and RM
        - common_wp_pval,  the p-value chi square test obtained from LLR3 (`1-pchisq(2.LLR3, df=2)`)
        - diff, the dyads shift between the positions in the two strains (in bp)
       The common UNR map for BY and RM strains is collected in the result directory and is called `BY_RM_common_unr.tab`. It is composed of the following columns:
        - cur_index, the index of the CUR
        - index_nuc_BY, the index of the BY nucleosome in the CUR
        - index_nuc_RM,the index of the RM nucleosome in the CUR
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/extract_maps.R")
       The Script translate_common_wp.R
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       This script is used to translate common well-positioned nucleosome positions from a strain to another strain and stores it into a table.
       For example, the file `results/2014-04/RM_wp_tr_2_BY.tab` contains RM well-positioned nucleosomes translated into the BY genome coordinates. It is composed of following columns:
        - strain_ref, the reference genome (in which positioned are defined)
        - begin, the translated lower bound of the nucleosome
        - end, the translated upper bound of the nucleosome
        - chr, the number of chromosomes for the reference genome (in which positioned are defined)
        - length, the length of the nucleosome (could be negative)
        - cur_index, the index of the CUR
        - index_nuc, the index of the nucleosome in the CUR
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/translate_common_wp.R")
       The Script split_samples.R
       ^^^^^^^^^^^^^^^^^^^^^^^^^^
       To optimize memory space usage, we split and compress TemplateFilter input files according to their corresponding  chromosome. for example, `sample_1_TF.tab` will be split into :
         - sample_1_chr_1_splited_sample.tab.gz
         - sample_1_chr_2_splited_sample.tab.gz
         - ...
         - sample_1_chr_17_splited_sample.tab.gz
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/split_samples.R")
       The Script count_reads.R
       ^^^^^^^^^^^^^^^^^^^^^^^^
       To associate a number of observations (read) to each nucleosome we run the script `count_reads.R`. It produces the files `BY_RM_H3K14ac_wp_and_nbreads.tab`, `BY_RM_H3K14ac_unr_and_nbreads.tab` `BY_RM_Mnase_Seq_wp_and_nbreads.tab` and `BY_RM_Mnase_Seq_unr_and_nbreads.tab`
       for H3K14ac common well-positioned nucleosomes, H3K14ac UNRs, Mnase common well-positioned nucleosomes and Mnase UNRs respectively.
       For example, the file `BY_RM_H3K14ac_unr_and_nbreads.tab` contains counted reads for well-positioned nucleosomes with the experimental condition ChIP H3K14ac. It is composed of the following columns:
         - chr_BY, the number of the chromosome for BY
         - lower_bound_BY, the lower bound of the nucleosome for BY
         - upper_bound_BY, the upper bound of the nucleosome  for BY
         - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
         - chr_RM, the number of the chromosome for RM
         - lower_bound_RM, the lower bound of the nucleosome for RM
         - upper_bound_RM, the upper bound of the nucleosome  for RM
         - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
         - cur_index, index of the CUR
         - BY_H3K14ac_36, the number of reads for the current nucleosome for the sample 36
         - BY_H3K14ac_37, #reads for sample 37
         - BY_H3K14ac_53, #reads for sample 53
         - RM_H3K14ac_38, #reads for sample 38
         - RM_H3K14ac_39, #reads for sample 39
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/count_reads.R")
       The Script get_size_factors.R
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       This script uses the DESeq function `estimateSizeFactors` to compute the size factor of each sample. It corresponds to normalisation of read counts from sample to sample, as determined by DESeq. When a sample has n reads for a nucleosome or a UNR,
       the normalised count is n/f where f is the factor contained in this file.
       The script dumps computed size factors into the file `size_factors.tab`. This file has the form:
       ========= ======= ======= =======
       sample_id      wp     unr   wpunr
       ========= ======= ======= =======
 0.87396 0.88097 0.87584
 1.07890 1.07440 1.07760
 1.06400 1.05890 1.06250
 0.85782 0.87948 0.86305
 0.97577 0.96590 0.97307
 1.19630 1.18120 1.19190
 0.93318 0.92762 0.93166
 0.48315 0.48453 0.48350
 1.11240 1.11210 1.11230
 0.89897 0.89917 0.89903
 2.22650 2.22700 2.22660
       ========= ======= ======= =======
       sample_id are given in file samples.csv
       If you don't know which column to use for normalization, we recommend using wpunr.
       Here are the details of the factors produced:
         - unr: factor computed from data of UNR regions. These regions are defined for every pairs of aligned genomes (e.g. BY_RM)
         - wp: same, but for well-positioned nucleosomes.
         - wpunr: both types of regions.
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/get_size_factors.R")
       The Script launch_deseq.R
       ^^^^^^^^^^^^^^^^^^^^^^^^^
       Finally, the script `launch_deseq.R` perform statistical analysis on each nucleosome using `DESeq`. It produces files:
         - results/current/BY_RM_H3K14ac_wp_snep.tab
         - results/current/BY_RM_H3K14ac_unr_snep.tab
         - results/current/BY_RM_H3K14ac_wpunr_snep.tab
         - results/current/BY_RM_H3K14ac_wp_mnase.tab
         - results/current/BY_RM_H3K14ac_unr_mnase.tab
         - results/current/BY_RM_H3K14ac_wpunr_mnase.tab
       These files are organised with the following columns (see file `BY_RM_H3K14ac_wp_snep.tab` for an example):
         - chr_BY, the number of the chromosome for BY
         - lower_bound_BY, the lower bound of the nucleosome for BY
         - upper_bound_BY, the upper bound of the nucleosome  for BY
         - index_nuc_BY, the index of the BY nucleosome in the CUR for BY
         - chr_RM, the number of the chromosome for RM
         - lower_bound_RM, the lower bound of the nucleosome for RM
         - upper_bound_RM, the upper bound of the nucleosome  for RM
         - index_nuc_RM,the index of the RM nucleosome in the CUR for RM
         - cur_index, index of the CUR
         - form
         - BY_Mnase_Seq_1, the number of reads for the current nucleosome for the sample 1
       Next columns concern indicators for each sample:
         - BY_Mnase_Seq_2, #reads for sample 2
         - BY_Mnase_Seq_3, #reads for sample 3
         - RM_Mnase_Seq_4, #reads for sample 4
         - RM_Mnase_Seq_5, #reads for sample 5
         - RM_Mnase_Seq_6, #reads for sample 6
         - BY_H3K14ac_36, #reads for sample 36
         - BY_H3K14ac_37, #reads for sample 37
         - BY_H3K14ac_53, #reads for sample 53
         - RM_H3K14ac_38, #reads for sample 38
         - RM_H3K14ac_39, #reads for sample 39
       The 5 last columns concern DESeq analysis:
         - manip[a_manip] strain[a_strain] manip[a_strain]:strain[a_strain], the manip (marker) effect, the strain effect and the snep effect. These are the coefficients of the fitted generalized linear model.
         - pvalsGLM, the pvalue resulting from the comparison of the GLM model considering the interaction term *marker:strain* to the GLM model that does not consider it. This is the statsitcial significance of the interaction term and therefore the statistical significance of the SNEP.
         - snep_index, a boolean set to TRUE if the pvalueGLM value is under the threshold computed with FDR function with a rate set to 0.0001.
       To execute this script, run the following command in your R console:
       .. code:: bash
         source("src/current/launch_deseq.R")
       Results: Number of SNEPs
       ------------------------
       Here are the number of computed SNEPs for each forms.
       ===== ======= ===== =======
        form strains #nucs H3K14ac
       ===== ======= ===== =======
          wp   BY-RM 30464    3549
         unr   BY-RM  9497    1559
       wpunr   BY-RM 39961    5240
       ===== ======= ===== =======
       APPENDICE: Generate .c2c Files
       ------------------------------
       The `.c2c` files is a simple table that describes how two genome
       sequences are aligned. This file can be generated by using scripts that were developed in NucleoMiner 1.0 (Nagarajan et al. PLoS Genetics 2010) and which we provide in this release of NucleoMiner2.
       To use these scripts on your UNIX/LINUX computer you need first to install MUMmer which is designed to rapidly align entire genomes, whether in complete or draft form.
       Installing MUMmer
       ^^^^^^^^^^^^^^^^^
       Get the last version of MUMmer archive on your computer (MUMmer3.23.tar.gz is provided in the directory deps of your working directory). Copy it in a dedicated directory. Install it locally into the src folder of you working directory by typing (working directory):
       tar -xvzf MUMmer3.23.tar.gz
       .. code:: bash
         cd src
         tar xfvz ../deps/MUMmer3.23.tar.gz
         cd MUMmer3.23
         make check
         make install
       Installing NucleoMiner 1.0 scripts
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       Get the nucleominer-1.0.tar.gz archive on your computer (this archive is provided in the directory deps of your working directory). Install it locally into the src folder of you working directory by typing (working directory):
       .. code:: bash
         cd src
         tar xfvz ../deps/nucleominer-1.0.tar.gz
         cd ..
       This creates a directory that contains  NucleoMiner 1.0 scripts (src/nucleominer-1.0/scripts).
       Generate .c2c Files
       ^^^^^^^^^^^^^^^^^^^
       To generate .c2c files you need to type the following command in a terminal:
       .. code:: bash
         export PATH=$PATH:src/MUMmer3.23:src/nucleominer-1.0/scripts
         export PERL5LIB=$PERL5LIB:src/nucleominer-1.0/scripts/
         NMgxcomp data/saccharomyces_cerevisiae_BY_S288c_chromosomes.fasta \
           data/saccharomyces_cerevisiae_rm11-1a_1_supercontigs.fasta \
           data/byxrm 2>NMgxcomp.log
       After execution, the directory `data` will hold the .c2c files.

LBMC » NucleoMiner

root / doc / sphinx_doc / tuto.rst @ 3961deb6