Functional and structural genomics using pedant 陽明生技所

Yüklə 445 b.

tarix	23.09.2017
ölçüsü	445 b.
	#1269

Functional and structural genomics using PEDANT

陽明生技所
生物資訊學程
林千涵

Introduction

With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools

Introduction-PEDANT

Difference of existing genome analysis programs

protein oriented vs. DNA oriented analysis
interactive work vs. commandline operation
bioinformatics method applied
user interface
conveniency feature, project management and data editors
fidelity of result produced

Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses
PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search)

a workhorse for general bioinformatics research
a common framework for a number of genome analysis projects
a complete database of automated genomes
a tool for routine analysis of large amounts of genomic contigs and ESTs

System Architecture

Overview

database module: storing, modifying and accessing data
processing module: bioinformatics computations
user interface: web based communication

System Architecture-Cont.

Data access

primary table: store raw data (ex DNA, protein sequences and program results ex BLAST output )
secondary table: parsed program results
simplified schema

Operation in command line mode

applying bioinformatics methods to sequences
parsing data tables
querying the resulting databases

Web interface

No static HTML pages required
DNA and Protein viewers make direct access to the SQL tables

Implementation and system requirements

Perl 5, and C++ for graphical viewer

Performance

parallel capabilities

Schema

Bioinformatics Method

Overview of the PEDANT processing pipeline

identification of coding regions and various analysis genetics elements
homology search
detection of protein motifs, prediction of secondary structure and other protein features and sensitive fold recognition
automatically attributed to pre-defined functional categories

Prediction of genes and other genetic elements

Table 1
choose one of 15 genetic codes
http://www.ncbi.nlm.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c

Functional and structural categories

similarity search : PSI-BLAST(Position-Specific Iterated BLAST)
special datasets: MIPS, COG, PROSITE, PFAM and BLOCKS
significant matches of PIR: annotations, keywords, enzyme classification and superfamily information
with significant relationship of PDB, secondary structure information: STRIDE(upper case), PREDATOR(lower case)
low complexity region, membrance regions, coiled coils and signal peptides
comparison of SCOP with IMPALA

Table 1

Bioinformatics Method-Cont.

Yeast biological role categories

first system of biological role of categories : E.Coli
MIPS: advanced hierarchical functional catalogue (Yeast)
Multidimensionality-protein:gene is M:M
automated assignment to MIPS is first approximation, will be refined by manual annotation
Distribution of ORFs

Visualization

a integrated, hypertext-linked protein report with calculated parameters and sequences as reference for further manual annotation
Protein report page

Distribution of ORFs

Protein report page

Bioinformatics Method-Cont.2

Automatic versus manual annotation

Problem of error propagation

erroneous annotation by human error and spurious similarity hits
with filtering algorithms and domain structure ?
quality improvement of manual review of human experts !

Manual annotation

Catalogue independent
Flexibility: first place in higher category and later step move to the finer categories

528 categories: 20 main categories and 6 levels
confidence levels: “reject”, “low”, “medium”, “high” and default is “auto”

Data release management

new release data can be intelligently merged with existing data pool
transfer manual annotation between subsequent data release
“manual” field: “yes” or ”no” and default is “no” initially
example: a PFAM domain identified in new release ORF is “manual: no” and “conf: auto”

Manual annotation transfer

The PEDANT Genome Database

Annotation of publicly available completely sequenced and unfinished genomes

Genome annotated by MIPS
Completely sequenced and published genomic sequences
Unfinished and/or unpublished genomics sequences
gene prediction by ORPHEUS, allow large overlaps between ORFs

PEDANT as a structural genomics resource-0.3M proteins

class-based approach, cost-saving
(i)non-redundant protein sequence databases
(ii)PSI-BLAST search with SCOP against (I) abd saving resulting profiles
(iii)construct a SCOP profile library using IMPALA
(iv)IMPALA search with each genomic sequence against SCOP library
same procedure for nr PDB sequence database
performance of IMPALA

Cross-genome comparison

treat each genome as an individual contig : creat cross-genome datasets without any modification
44 genomes

Performance of IMPALA

Applications

Arabidopsis thaliana chromosome IV

3744 predicted protein coding genes
roughly 30% are known proteins or strongly similar to known proteins
multi-cellular organisms has higher all-alpha and smaller mixed alpha/beta structural domains ratio to unicellular species

Assembled human transcripts

human UniGene subjected PEDANT analysis, compare over 75000 contigs
this MySQL DB is close to 8GB
acceptable query time show the suitability of PEDANT for large-scale EST sequencing projects

Analysis of the GroEL substrates

GroEL: a common E.Coli chaperonin
structural motif common in 52 substrates relying on GroEL for folding in vivo : two or more alpha/beta domains involving buried beta-sheets with large hydrophobic surfaces--easy aggregation

Classification of predicted genes

Summary and Outlook

PEDANT is a useful tool for genome annotation and bioinformatics research
It can automated and manual assignment of gene product to functional and structural categories
extensive hyperlinked protein report and advanced viewers
Outlook

better decision rules need to be employed
manually annotate predicted genetics eelments(ex. LTRs)
supporting Oracle RDBMS
automatic gene prediction pipeline for higher eukaryotes
interactive capabilities

Yüklə 445 b.

Dostları ilə paylaş:

Functional and structural genomics using pedant 陽明生技所

Functional and structural genomics using PEDANT

陽明生技所

生物資訊學程

林千涵

Introduction

With increasing biological sequence data, it need a system with ability of storing and retreving tens of gigabytes of data, a mature database management system, and a good visualization tools

Introduction-PEDANT

Difference of existing genome analysis programs

Benchmark may vary in terms of chosen of balance between sensitivity and selectivity of the analyses

PEDANT (Protein Extraction, Description, and ANalysis Tool) was available in mid-1997(use FASTA as similarity search)

System Architecture

Overview

System Architecture-Cont.

Data access

Operation in command line mode

Web interface

Implementation and system requirements

Performance

Schema

Bioinformatics Method

Overview of the PEDANT processing pipeline

Prediction of genes and other genetic elements

Functional and structural categories

Table 1

Bioinformatics Method-Cont.

Yeast biological role categories

Visualization

Distribution of ORFs

Protein report page

Bioinformatics Method-Cont.2

Automatic versus manual annotation

Data release management

Manual annotation transfer

The PEDANT Genome Database

Annotation of publicly available completely sequenced and unfinished genomes

PEDANT as a structural genomics resource-0.3M proteins

Cross-genome comparison

Performance of IMPALA

Applications

Arabidopsis thaliana chromosome IV

Assembled human transcripts

Analysis of the GroEL substrates

Classification of predicted genes

Summary and Outlook

PEDANT is a useful tool for genome annotation and bioinformatics research

It can automated and manual assignment of gene product to functional and structural categories

extensive hyperlinked protein report and advanced viewers

Outlook