Nucleic Acids Research, 2009, Vol. 37, Database issue Published online 21 October 2008

Yüklə 70,76 Kb.

Pdf görüntüsü

tarix	23.09.2017
ölçüsü	70,76 Kb.
	#1267

D408–D411

Nucleic Acids Research, 2009, Vol. 37, Database issue

Published online 21 October 2008

doi:10.1093/nar/gkn749

PEDANT covers all complete RefSeq genomes

Mathias C. Walter

, Thomas Rattei

, Roland Arnold

, Ulrich Gu¨ldener

Martin Mu¨nsterko¨tter

, Karamfilka Nenova

, Gabi Kastenmu¨ller

Patrick Tischler

, Andreas Wo¨lling

, Andreas Volz

, Norbert Pongratz

, Ralf Jost

Hans-Werner Mewes

1,2

and Dmitrij Frishman

1,2,

*

1

Institute of Bioinformatics and Systems Biology, Helmholtz Zentrum Mu¨nchen, German Research Center for

Environmental Health (GmbH), Ingolsta¨dter Landstrasse 1, 85764 Neuherberg,

Department of Genome-oriented

Bioinformatics, Wissenschaftszentrum Weihenstephan, Technische Universita¨t Mu¨nchen, Am Forum 1, 85350

Freising and

Biomax Informatics AG, Lochhamer Strasse 9, 82152 Martinsried, Germany

Received September 15, 2008; Accepted October 3, 2008

ABSTRACT

The PEDANT genome database provides exhaustive

annotation of nearly 3000 publicly available eukaryo-

tic, eubacterial, archaeal and viral genomes with

more than 4.5 million proteins by a broad set of

bioinformatics algorithms. In particular, all comple-

tely sequenced genomes from the NCBI’s Reference

Sequence collection (RefSeq) are covered. The

PEDANT processing pipeline has been sped up by

an order of magnitude through the utilization of pre-

calculated similarity information stored in the simi-

larity matrix of proteins (SIMAP) database, making it

possible to process newly sequenced genomes

immediately as they become available. PEDANT

is freely accessible to academic users at http://

pedant.gsf.de.

For

programmatic

access

Web

Services

are

available

http://pedant.gsf.de/

webservices.jsp.

INTRODUCTION

Since its ﬁrst announcement in 1997 (1), the PEDANT

genome database has steadily grown to become one

of the most comprehensive collections of automatically

annotated genomes. As of September 2008, PEDANT

covers all complete genomes as provided by the RefSeq

(2) database. In total 861 completely sequenced genomes

from all three domains of life as well as 2081 complete

viral genomes are available (Table 1). Here, we deﬁne a

‘complete genome’ as a genome whose chromosomal data-

sets exist as RefSeq records or Ensembl (3) entries and

genes have been predicted. For those eukaryotic genomes

(currently 33) that are available both from RefSeq or

Ensembl, we provide the annotation of both versions.

This results in a total number of 2975 genome databases

with 4.5 million proteins occupying 3.1 TB of storage. All

PEDANT databases are continuously updated. For exam-

ple, assignments of genes to the MIPS Functional Catalog

(FunCat) (4) have been recently recalculated using the new

2.1 version of FunCat (http://mips.gsf.de/projects/funcat).

The current version of the software driving the

PEDANT web site, which we refer to as PEDANT3,

represents an industry-strength Java workbench that sup-

ports large-scale grid computing and utilizes a work-ﬂow-

based processing engine (D. Frishman et al., manuscript in

preparation). Dozens of custom workﬂows are available:

generic workﬂows for eukaryotic, prokaryotic and viral

genomes as well as more specialized workﬂows supporting

speciﬁc genome groups (gram-positive versus gram-nega-

tive bacteria, fungi, plants), data types (EST collections,

raw contigs without any predicted Open Reading Frames

(ORFs), protein-only datasets, etc.) and bioinformatics

methods (e.g. alternative gene prediction techniques).

Advanced protein and DNA viewers implemented using

server-side Java provide graphical representation of pro-

tein annotation features as well as genetic elements on

chromosomes.

NEW FEATURES AND IMPROVEMENTS

Genome import pipeline

Given the quick pace of genome sequencing keeping track

of currently available data and obtaining them from

source databases for local processing represents a time-

consuming and technically challenging task. In order to

organize a more eﬃcient import of genomes to PEDANT

from various sources, we set up a specialized processing

pipeline (Figure 1). In the ﬁrst step, we acquire a list of

available genomes from each genome resource. Then we

*To whom correspondence should be addressed. Tel: +49 8161 712134; Fax: +49 8161 712186; Email: d.frishman@wzw.tum.de

ß 2008 The Author(s)

This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/

by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Downloaded from https://academic.oup.com/nar/article-abstract/37/suppl_1/D408/1008649/PEDANT-covers-all-complete-RefSeq-genomes

by guest

on 22 September 2017

try to ﬁnd out the Entrez genome project ID by using the

Entrez Programming Utilities (eUtils, http://www.ncbi.

nlm.nih.gov/entrez/query/static/eutils_help.html)

and

querying the NCBI databases (5) for genome project

information. If available, we use the genome project ID

as a primary key for a given genome, otherwise the NCBI

taxonomy ID is utilized. The advantage of genome project

IDs is that they are stable in contrast to the taxonomy IDs

which may change, especially for the species/strains of

newly sequenced genomes. The genome IDs are then

stored in our local meta-database which also serves as

the data basis for generating the full genome list for the

PEDANT web page.

Data retrieval procedures have been adapted to several

diﬀerent sources of genome information. For download-

ing RefSeq genomes, we use a patched version (retry

on connection timeouts, improved error handling) of

the NCBI ToolBox (http://www.ncbi.nlm.nih.gov/IEB/

ToolBox) program. For Ensembl genomes, we install the

provided MySQL database dumps (ftp://ftp.ensembl.org/

pub/current_mysql) at our local MySQL server and

extract the genomic data directly.

Retrieval of genomes not contained in RefSeq and

Ensembl can only be done in a semi-automatic fashion

with manual veriﬁcation. In many cases, RefSeq lists the

involved genome sequence centers where original data can

be obtained. Another useful resource to locate genomes

is ‘the genomes online database (GOLD)’ (6). We then

retrieve the assembly and annotation data directly from

the sequence centers and check them for missing

sequences, nonunique identiﬁers and unusual formatting.

If the gene annotation data are missing or in a draft ver-

sion (especially fungal genomes), gene predictions are car-

ried out or existing models are improved dependent on the

annotation project (7,8).

Integration of PEDANT and SIMAP

Calculating and updating protein similarities and domain

assignments is the most time consuming and computation-

ally expensive task in our genome annotation pipeline.

Previously, BLASTP (9) and InterProScan (10) searches

required up to 80% of the total CPU time of the

PEDANT genome annotation workﬂow. To master the

high number of newly sequenced genomes and to keep

the data in PEDANT up-to-date, a radical reduction of

this huge computational eﬀort has become necessary.

The most obvious answer to this problem is to utilize

high-performance computing facilities and avoid redun-

dant calculations. The similarity matrix of proteins

(SIMAP) (11) provides precalculated and up-to-date all-

against-all alignments as well as domain assignments for

essentially all publicly available protein sequences (21 mil-

lion as of this writing). Our recent eﬀorts to integrate

PEDANT with SIMAP made it possible to avoid compu-

tationally intensive BLASTP and InterProScan runs and

have led to a dramatic acceleration of the genome annota-

tion work. Compared with de novo calculations, retrieving

similarities and domains from the SIMAP database

reduces the required CPU time by factors between 5 and

60. A typical bacterial genome with 3000 predicted genes

can be processed at MIPS in <40 min using 60 Sun Grid

Engine (SGE, http://gridengine.sunsource.net) nodes.

To generate and obtain these data, we have developed

a computational workﬂow that coordinates the tasks

between PEDANT and SIMAP. The ﬁrst step in this

workﬂow involves the import and maintenance of

genome sequences and primary annotation provided by

the respective source databases in PEDANT. In a subse-

quent step, SIMAP automatically retrieves protein and

sequence data from PEDANT. If novel protein sequences

previously unknown to SIMAP have been imported,

query genome IDs

fetch genome

information

download

genome data

verify, reformat,

convert

create or update

PEDANT analysis

import into

SIMAP

computation

re-fetch

compute similarity

find domains

fetch blastp hits and

InterPro domains

[SIMAP update]

mirror to

public webserver

Figure 1. UML activity model of the PEDANT genome import and

processing pipeline. Symbols according to the UML 2.0 speciﬁcation

(http://www.uml.org) for activity diagrams.

Table 1. The number of species from major taxonomic groups con-

tained in the PEDANT genome database as of September 2008

NCBI Taxonomy ID

Taxonomic group

Number of genomes

131567

Cellular organisms

861

2157

Archaea

Bacteria

691

2759

Eukaryota

117

4751

Fungi

33208

Metazoa

33090

Viridiplantae

Other

10239

Viruses

2081

Total

2942

Other groups: Alveolata (2), Amoebozoa (1), Cryptophyta (1),

Euglenozoa (5), Rhodophyta (1), Stramenopiles (2).

Nucleic Acids Research, 2009, Vol. 37, Database issue

D409

Downloaded from https://academic.oup.com/nar/article-abstract/37/suppl_1/D408/1008649/PEDANT-covers-all-complete-RefSeq-genomes

by guest

on 22 September 2017

their similarities to all other protein sequences and their

domain architecture are calculated in SIMAP by utilizing

large public resource computing facilities (12). As soon as

the precalculated data are completely available in SIMAP,

a notiﬁcation event is triggered to start the SIMAP-based

methods in PEDANT. These methods have been imple-

mented as remote Enterprise Java Bean (EJB) invocations,

which allow for rapid and eﬃcient retrieval of data

from SIMAP. One method designed to replace BLASTP

retrieves homologs from a composite nonredundant

database

that

includes

PDB,

UniProt/Swissprot,

UniProt/TrEMBL, as well as all protein sequences already

present in PEDANT. The second method which serves

as a substitute for InterProScan retrieves precalculated

protein domain assignments considering all InterPro

member databases according to the InterPro XML

format speciﬁcation, except for the TMHMM (13),

SignalP (14) and TargetP (15) methods which are run by

PEDANT itself considering the appropriate genomic con-

text (i.e. gram stain for signal peptides).

Web Services

The comprehensive collection of 3000 extensively anno-

tated genomes provides a unique foundation for data

mining and large-scale investigation of genome properties.

While information on a limited number of genes of inter-

est can be conveniently explored using the PEDANT

web interface, any computational analysis of genomes at

large necessitates local access to data. However, the large

amount of annotation data computed for 4.5 million

PEDANT proteins makes systematic dissemination of

database dumps or ﬂat ﬁles unpractical (although we do

provide them upon request). Instead, we oﬀer a simple,

transparent and computer language-independent remote

access based on the Web Service technology. This service

has been implemented as a document style, SOAP-based

Web Service (see http://www.w3.org/TR/soap12-part0).

It can be easily integrated into own applications since

for most computer languages libraries exist to access

these kind of services. The functions provided by the

Web Service are described in a Web Service Description

File (WSDL, see http://www.w3.org/TR/wsdl), which

allows for an automatic generation of a client program,

e.g. by using the Perl SOAP::Lite (http://www.soaplite.

com) or the Java Axis (http://ws.apache.org/axis/java/

index.html) libraries.

The

PEDANT3

WSDL

File

can

found

http://mips.gsf.de/webservice/pedant3/Pedant3Access

BeanService/Pedant3AccessWebService?wsdl. At present

the service provides the following query types:

(i) return the list of organisms processed in PEDANT,

(ii) return the computational methods used to annotate

a particular organism,

(iii) return a result overview (e.g. which functional cate-

gory appears how many times) for a certain method

in a certain organism,

(iv) return the genetic elements of an organism,

(v) return the result of a certain method for a single

genetic element or for a whole genome ordered by

its genetic elements.

For the latter query type it is possible to search in both

directions: the service can return all genetic elements

having a certain property (e.g. a certain functional attri-

bute), or all properties of a certain genetic element (e.g. all

functional attributes of a protein). Furthermore, in the

former case it is possible to query several genomes at

once. For BLASTP- and SIMAP-based methods, it is pos-

sible to restrict the results by an E-Value cutoﬀ. A detailed

overview of the Web Service functionality can be found at

http://pedant.gsf.de/webservices.jsp.

The PEDANT3 Web Service encapsulates the compli-

cated internal data structures of the PEDANT database

and returns the results in a generic format that consists of

key-value pairs of properties assigned to a given genetic

element. This generic format assures that the end-user

client software will not have to be reprogrammed if new

methods are introduced into the PEDANT system.

DISCUSSION

There is no ﬁxed release cycle for PEDANT. As soon as

new genomes become available at RefSeq or any other

listed genome resource, they will be imported, processed

and made available via the web server. However, since

SIMAP has a monthly release cycle, the computation of

a genome by PEDANT is typically ﬁnished roughly 1

month after its import. Since the PEDANT3 software is

now stable and all genomes from the previous version,

PEDANT2, have been either migrated or reimported

into PEDANT3, we took PEDANT2 and its Web

Service oﬄine. We also discarded all incomplete genomes

previously available via PEDANT2 because the new high-

throughput technologies now allow ﬁnishing genome

sequencing projects on a very short-time frame.

the

future,

genomes

from

further

resources

[i.e. USCS Genome Browser Database (16), Vega (17)]

will be imported and previously imported genomes will

be kept up-to-date. We are also in the process of supple-

menting the PEDANT web site by multiple new features,

including viewing the genome project information [RefSeq

status, source sequence centers, whole-genome shotgun

(WGS) (18) sequencing coverage, number of records,

etc.], taxonomic selection of genomes and improved

search capabilities. A cross-genome index for precom-

puted annotations is nearly ﬁnished and will be available

online shortly. This will allow for comparison of genomes

based on their annotated features, such as domain con-

tent, functional categories and structural folds.

ACKNOWLEDGEMENTS

We are grateful to Volker Stu¨mpﬂen for assistance with

the Web Services.

FUNDING

Funding for open access charge: Helmholtz Gemeinschaft.

Conﬂict of interest statement

. None declared.

D410

Nucleic Acids Research, 2009, Vol. 37, Database issue

Downloaded from https://academic.oup.com/nar/article-abstract/37/suppl_1/D408/1008649/PEDANT-covers-all-complete-RefSeq-genomes

by guest

on 22 September 2017

REFERENCES

1. Frishman,D. and Mewes,H.-W. (1997) Pedantic genome analysis.

Trends Genet.

, 13, 415–416.

2. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) Ncbi reference

sequences (refseq): a curated non-redundant sequence database

of genomes, transcripts and proteins. Nucleic Acids Res., 35,

D61–D65.

3. Hubbard,T.J.P., Aken,B.L., Beal,K., Ballester,B., Caccamo,M.,

Chen,Y., Clarke,L., Coates,G., Cunningham,F., Cutts,T. et al.

(2007) Ensembl 2007. Nucleic Acids Res., 35, D610–D617.

4. Ruepp,A., Zollner,A., Maier,D., Albermann,K., Hani,J.,

Mokrejs,M., Tetko,I., Gu¨ldener,U., Mannhaupt,G.,

Mu¨nsterko¨tter,M. et al. (2004) The funcat, a functional

annotation scheme for systematic classiﬁcation of proteins from

whole genomes. Nucleic Acids Res., 32, 5539–5545.

5. Wheeler,D.L., Barrett,T., Benson,D.A., Bryant,S.H., Canese,K.,

Chetvernin,V., Church,D.M., Dicuccio,M., Edgar,R., Federhen,S.

et al

. (2008) Database resources of the national center for biotech-

nology information. Nucleic Acids Res., 36, D13–D21.

6. Liolios,K., Mavromatis,K., Tavernarakis,N. and Kyrpides,N.C.

(2008) The genomes on line database (gold) in 2007: status of

genomic and metagenomic projects and their associated metadata.

Nucleic Acids Res.

, 36, D475–D479.

7. Gu¨ldener,U., Mannhaupt,G., Mu¨nsterko¨tter,M., Haase,D.,

Oesterheld,M., Stu¨mpﬂen,V., Mewes,H.-W. and Adam,G. (2006)

Fgdb: a comprehensive fungal genome resource on the plant

pathogen fusarium graminearum. Nucleic Acids Res., 34,

D456–D458.

8. Ka¨mper,J., Kahmann,R., Bo¨lker,M., Ma,L.-J., Brefort,T.,

Saville,B.J., Banuett,F., Kronstad,J.W., Gold,S.E., Mu¨ller,O. et al.

(2006) Insights from the genome of the biotrophic fungal plant

pathogen ustilago maydis. Nature, 444, 97–101.

9. Altschul,S.F., Madden,T.L., Scha¨ﬀer,A.A., Zhang,J., Zhang,Z.,

Miller,W. and Lipman,D.J. (1997) Gapped blast and psi-blast:

a new generation of protein database search programs.

Nucleic Acids Res

, 25, 3389–3402.

10. Quevillon,E., Silventoinen,V., Pillai,S., Harte,N., Mulder,N.,

Apweiler,R. and Lopez,R. (2005) Interproscan: protein domains

identiﬁer. Nucleic Acids Res., 33, W116–W120.

11. Rattei,T., Tischler,P., Arnold,R., Hamberger,F., Krebs,J.,

Krumsiek,J., Wachinger,B., Stu¨mpﬂen,V. and Mewes,H.-W. (2008)

Simap–structuring the network of protein similarities. Nucleic Acids

Res.

, 36, D289–D292.

12. Rattei,T., Walter,M., Arnold,R., Anderson,D. and Mewes,W. (2007)

Using public resource computing and systematic pre-calculation for

large scale sequence analysis. Lect. Notes Bioinform., 4360, 11–18.

13. Kahsay,R.Y., Gao,G. and Liao,L. (2005) An improved hidden

Markov model for transmembrane protein detection and topology

prediction and its applications to complete genomes. Bioinformatics,

21, 1853–1858.

14. Bendtsen,J.D., Nielsen,H., vonHeijne,G. and Brunak,S. (2004)

Improved prediction of signal peptides: Signalp 3.0. J. Mol. Biol.,

340, 783–795.

15. Emanuelsson,O., Nielsen,H., Brunak,S. and vonHeijne,G. (2000)

Predicting subcellular localization of proteins based on their

n-terminal amino acid sequence. J. Mol. Biol., 300, 1005–1016.

16. Karolchik,D., Kuhn,R.M., Baertsch,R., Barber,G.P., Clawson,H.,

Diekhans,M., Giardine,B., Harte,R.A., Hinrichs,A.S., Hsu,F. et al.

(2008) The ucsc genome browser database: 2008 update. Nucleic

Acids Res.

, 36, D773–D779.

17. Wilming,L.G., Gilbert,J.G.R., Howe,K., Trevanion,S., Hubbard,T.

and Harrow,J.L. (2008) The vertebrate genome annotation (vega)

database. Nucleic Acids Res., 36, D753–D760.

18. Staden,R. (1979) A strategy of DNA sequencing employing com-

puter programs. Nucleic Acids Res., 6, 2601–2610.

Nucleic Acids Research, 2009, Vol. 37, Database issue

D411

Downloaded from https://academic.oup.com/nar/article-abstract/37/suppl_1/D408/1008649/PEDANT-covers-all-complete-RefSeq-genomes

by guest

on 22 September 2017

Yüklə 70,76 Kb.

Dostları ilə paylaş: