Tech Explained

Bioinformatics Toolbox

PCQ Bureau

04 Jul 2002 09:48 IST

New Update

RNA and DNA are the proteins that store the hereditary information about an organism. These macromolecules have a fixed structure, which can be analyzed by biologists with the help of bioinformatic tools and databases. We look at a few of these.

Databases

A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource.

GenBank

GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure, that is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature.

SwissProt

This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

PIR-PSD

PIR (Protein Information Resource) produces and distributes the PIR-International Protein Sequence Database (PSD). It is the most comprehensive and expertly annotated protein sequence database.

Markup languages
	XML. Bioinformatics databases can be created using flat files or relational database management systems, such as Oracle, IBM DB2 or MySql. These existing databases and data formats can be translated into XML. The aim is to design good XML documents that capture the flexible data structures in bioinformatics. What is ultimately necessary is efficiently storing XML documents in systems built on RDBMS in a way that supports querying the documents and combining those queries with data stored in other relational tables (such as sequence or expression data). XML provides a solution for data migration among programming languages. Using modules like the XML::Parser it is possible to move structured data object between C/C++, Java, and PERL without having to worry about the decoding and encoding details.
	AnatML (Anatomical Markup Language). This is an XML-based language used to describe anatomy, in particular for modeling the human musculoskeletal system, and facilitating data exchange among contributing scientists.
	BSML (Bioinformatic Sequence Markup Language). This is a data type definition for the representation of molecular biological data. Project data and links are stored in BSML documents. It provides management tools, visualization interface and container sequences for sequences and other bioinformatic data.
	BIOML (Bioinformatic Polymer Markup Language). Developed by Proteometrics, BIOML allows full specifications of all experimental information known about polymers.
	Databases like Oracle and Sybase

Protein sequence databases are classified as primary, secondary and composite depending upon the content stored in them. PIR and SwissProt are primary databases that contain protein sequences as ‘raw’ data. Secondary databases (like Prosite) contain the information derived from protein sequences. Primary databases are combined and filtered to form non-redundant
composite databases.

Tools

There are both standard and customized products to meet the requirements of particular projects. There are data-mining software that retrieve data from genomic sequence databases and also visualization tools to analyze and retrieve information from proteomic databases. These can be classified as homology and similarity tools, protein functional analysis tools, sequence analysis tools and miscellaneous tools. Here is a brief description of a few of these.

BLAST

BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set of search programs designed for the Windows platform and is used to perform fast similarity searches regardless of whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed. Also a protein database can be searched to find a match against the queried protein sequence.

NCBI has also introduced the new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their convenience and format their results multiple times with different formatting options.

EMBOSS

EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It can work with data in a range of formats and also retrieve sequence data transparently from the Web. Extensive libraries are also provided with this package, allowing other scientists to release their software as open source. It provides a set of sequence-analysis programs, and also supports all UNIX platforms.

Clustalw

It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid.

RasMol

It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use
program.

PROSPECT

PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a protein-structure prediction system that employs a computational technique called protein threading to construct a protein’s 3-D model.

PatternHunter

PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using little memory on a desktop computer. Its features are its advanced patented algorithm and data structures, and the java language used to create it. The Java language version of PatternHunter is just 40 KB, only 1% the size of Blast, while offering a large portion of its functionality.

COPIA

COPIA (COnsensus Pattern Identification and Analysis) is a protein structure analysis tool for discovering motifs (conserved regions) in a family of protein sequences. Such motifs can be then used to determine membership to the family for new protein sequences, predict secondary and tertiary structure and function of proteins and study evolution history of the sequences.

JAVA in Bioinformatics

Since research centers are scattered all around the globe ranging from private to academic settings, and a range of hardware and OSs are being used, Java is emerging as a key player in bioinformatics. Physiome Sciences’ computer-based biological simulation technologies and Bioinformatics Solutions’ PatternHunter are two examples of the growing adoption of Java in
bioinformatics.

Perl in BioinformaticsString manipulation, regular expression matching, file parsing, data format interconversion etc are the common text-processing tasks performed in bioinformatics. Perl excels in such tasks and is being used by many developers. Yet, there are no standard modules designed in Perl specifically for the field of bioinformatics. However, developers have designed several of their own individual modules for the purpose, which have become quite popular and are coordinated by the BioPerl project.

BioJava

The BioJava Project is dedicated to providing Java tools for processing biological data which includes objects for manipulating sequences, dynamic programming, file parsers, simple statistical routines, etc.

BioPerl

The BioPerl project is an international association of developers of Perl tools for bioinformatics and provides an online resource for modules, scripts and web links for developers of
Perl-based software.

BioXML

A part of the BioPerl project, this is a resource to gather XML documentation, DTDs and XML aware tools for biology in one location.

Rashmi Sahu

Stay connected with us through our social media channels for the latest updates and news!