Functional and structural genomics using pedant 陽明生技所



Yüklə 445 b.
tarix23.09.2017
ölçüsü445 b.


Functional and structural genomics using PEDANT

  • 陽明生技所

  • 生物資訊學程

  • 林千涵


Introduction

  • With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools



Introduction-PEDANT

  • Difference of existing genome analysis programs

    • protein oriented vs. DNA oriented analysis
    • interactive work vs. commandline operation
    • bioinformatics method applied
    • user interface
    • conveniency feature, project management and data editors
    • fidelity of result produced
  • Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses

  • PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search)

    • a workhorse for general bioinformatics research
    • a common framework for a number of genome analysis projects
    • a complete database of automated genomes
    • a tool for routine analysis of large amounts of genomic contigs and ESTs


System Architecture

  • Overview

    • database module: storing, modifying and accessing data
    • processing module: bioinformatics computations
    • user interface: web based communication


System Architecture-Cont.

  • Data access

    • primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output )
    • secondary table: parsed program results
    • simplified schema
  • Operation in command line mode

    • applying bioinformatics methods to sequences
    • parsing data tables
    • querying the resulting databases
  • Web interface

    • No static HTML pages required
    • DNA and Protein viewers make direct access to the SQL tables
  • Implementation and system requirements

    • Perl 5, and C++ for graphical viewer
  • Performance



Schema



Bioinformatics Method

  • Overview of the PEDANT processing pipeline

    • identification of coding regions and various analysis genetics elements
    • homology search
    • detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition
    • automatically attributed to pre-defined functional categories
  • Prediction of genes and other genetic elements

    • Table 1
    • choose one of 15 genetic codes
    • http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c
  • Functional and structural categories

    • similarity search : PSI-BLAST(Position-Specific Iterated BLAST)
    • special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS
    • significant matches of PIR: annotations, keywords, enzyme classification and superfamily information
    • with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case)
    • low complexity region, membrance regions, coiled coils and signal peptides
    • comparison of SCOP with IMPALA


Table 1



Bioinformatics Method-Cont.

  • Yeast biological role categories

    • first system of biological role of categories : E.Coli
    • MIPS: advanced hierarchical functional catalogue (Yeast)
    • Multidimensionality-protein:gene is M:M
    • automated assignment to MIPS is first approximation, will be refined by manual annotation
    • Distribution of ORFs
  • Visualization

    • a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation
    • Protein report page


Distribution of ORFs



Protein report page



Bioinformatics Method-Cont.2

  • Automatic versus manual annotation

    • Problem of error propagation
      • erroneous annotation by human error and spurious similarity hits
      • with filtering algorithms and domain structure ?
      • quality improvement of manual review of human experts !
    • Manual annotation
      • Catalogue independent
      • Flexibility: first place in higher category and later step move to the finer categories
    • 528 categories: 20 main categories and 6 levels
    • confidence levels: “reject”, “low”, “medium”, “high” and default is “auto”
  • Data release management

    • new release data can be intelligently merged with existing data pool
    • transfer manual annotation between subsequent data release
    • “manual” field: “yes” or ”no” and default is “no” initially
    • example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”


Manual annotation transfer



The PEDANT Genome Database

  • Annotation of publicly available completely sequenced and unfinished genomes

    • Genome annotated by MIPS
    • Completely sequenced and published genomic sequences
    • Unfinished and/or unpublished genomics sequences
    • gene prediction by ORPHEUS, allow large overlaps between ORFs
  • PEDANT as a structural genomics resource-0.3M proteins

    • class-based approach, cost-saving
    • (i)non-redundant protein sequence databases
    • (ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles
    • (iii)construct a SCOP profile library using IMPALA
    • (iv)IMPALA search with each genomic sequence against SCOP library
    • same procedure for nr PDB sequence database
    • performance of IMPALA
  • Cross-genome comparison

    • treat each genome as an individual contig : creat cross-genome datasets without any modification
    • 44 genomes


Performance of IMPALA



Applications

  • Arabidopsis thaliana chromosome IV

    • 3744 predicted protein coding genes
    • roughly 30% are known proteins or strongly similar to known proteins
    • multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species
  • Assembled human transcripts

    • human UniGene subjected PEDANT analysis, compare over 75000 contigs
    • this MySQL DB is close to 8GB
    • acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects
  • Analysis of the GroEL substrates

    • GroEL: a common E.Coli chaperonin
    • structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation


Classification of predicted genes



Summary and Outlook

  • PEDANT is a useful tool for genome annotation and bioinformatics research

  • It can automated and manual assignment of gene product to functional and structural categories

  • extensive hyperlinked protein report and advanced viewers

  • Outlook

    • better decision rules need to be employed
    • manually annotate predicted genetics eelments(ex. LTRs)
    • supporting Oracle RDBMS
    • automatic gene prediction pipeline for higher eukaryotes
    • interactive capabilities




Dostları ilə paylaş:


Verilənlər bazası müəlliflik hüququ ilə müdafiə olunur ©genderi.org 2019
rəhbərliyinə müraciət

    Ana səhifə