Advertisment

Bioinformatics Toolbox

author-image
PCQ Bureau
New Update

RNA and DNA are the proteins that store the hereditary information about an organism. These macromolecules have a fixed structure, which can be analyzed by biologists with the help of bioinformatic tools and databases. We look at a few of these.

Advertisment

Databases 



A few popular databases are GenBank from NCBI (National Center for Biotechnology Information), SwissProt from the Swiss Institute of Bioinformatics and PIR from the Protein Information Resource. 

GenBank



GenBank (Genetic Sequence Databank) is one of the fastest growing repositories of known genetic sequences. It has a flat file structure, that is an ASCII text file, readable by both humans and computers. In addition to sequence data, GenBank files contain information like accession numbers and gene names, phylogenetic classification and references to published literature. 

SwissProt



This is a protein sequence database that provides a high level of integration with other databases and also has a very low level of redundancy (means less identical sequences are present in the database).

Advertisment

PIR-PSD



PIR (Protein Information Resource) produces and distributes the PIR-International Protein Sequence Database (PSD). It is the most comprehensive and expertly annotated protein sequence database.

Markup languages

XML. Bioinformatics databases can be created using flat files or relational database management systems, such as Oracle, IBM DB2 or MySql. These existing databases and data formats can be translated into XML. The aim is to design good XML documents that capture the flexible data structures in bioinformatics. What is ultimately necessary is efficiently storing XML documents in systems built on RDBMS in a way that supports querying the documents and combining those queries with data stored in other relational tables (such as sequence or expression data).



XML provides a solution for data migration among programming languages. Using modules like the XML::Parser it is possible to move structured data object between C/C++, Java, and PERL without having to worry about the decoding and encoding details.
AnatML (Anatomical Markup Language). This is an XML-based language used to describe anatomy, in particular for modeling the human musculoskeletal system, and facilitating data exchange among contributing scientists.
BSML (Bioinformatic Sequence Markup Language). This is a data type definition for the representation of molecular biological data. Project data and links are stored in BSML documents. It provides management tools, visualization interface and container sequences for sequences and other bioinformatic data. 
BIOML (Bioinformatic Polymer Markup Language). Developed by Proteometrics, BIOML allows full specifications of all experimental information known about polymers.
Databases like Oracle and Sybase

Protein sequence databases are classified as primary, secondary and composite depending upon the content stored in them. PIR and SwissProt are primary databases that contain protein sequences as ‘raw’ data. Secondary databases (like Prosite) contain the information derived from protein sequences. Primary databases are combined and filtered to form non-redundant

composite databases. 

Advertisment

Tools 



There are both standard and customized products to meet the requirements of particular projects. There are data-mining software that retrieve data from genomic sequence databases and also visualization tools to analyze and retrieve information from proteomic databases. These can be classified as homology and similarity tools, protein functional analysis tools, sequence analysis tools and miscellaneous tools. Here is a brief description of a few of these.

BLAST



BLAST (Basic Local Alignment Search Tool) comes under the category of homology and similarity tools. It is a set of search programs designed for the Windows platform and is used to perform fast similarity searches regardless of whether the query is for protein or DNA. Comparison of nucleotide sequences in a database can be performed. Also a protein database can be searched to find a match against the queried protein sequence.

NCBI has also introduced the new queuing system to BLAST (Q BLAST) that allows users to retrieve results at their convenience and format their results multiple times with different formatting options.

Advertisment

EMBOSS



EMBOSS (European Molecular Biology Open Software Suite) is a software-analysis package. It can work with data in a range of formats and also retrieve sequence data transparently from the Web. Extensive libraries are also provided with this package, allowing other scientists to release their software as open source. It provides a set of sequence-analysis programs, and also supports all UNIX platforms.

Clustalw



It is a fully automated sequence alignment tool for DNA and protein sequences. It returns the best match over a total length of input sequences, be it a protein or a nucleic acid.

RasMol



It is a powerful research tool to display the structure of DNA, proteins, and smaller molecules. Protein Explorer, a derivative of RasMol, is an easier to use

program.

Advertisment

PROSPECT



PROSPECT (PROtein Structure Prediction and Evaluation Computer ToolKit) is a protein-structure prediction system that employs a computational technique called protein threading to construct a protein’s 3-D model. 

PatternHunter 



PatternHunter, based on Java, can identify all approximate repeats in a complete genome in a short time using little memory on a desktop computer. Its features are its advanced patented algorithm and data structures, and the java language used to create it. The Java language version of PatternHunter is just 40 KB, only 1% the size of Blast, while offering a large portion of its functionality.

COPIA 



COPIA (COnsensus Pattern Identification and Analysis) is a protein structure analysis tool for discovering motifs (conserved regions) in a family of protein sequences. Such motifs can be then used to determine membership to the family for new protein sequences, predict secondary and tertiary structure and function of proteins and study evolution history of the sequences.

Advertisment

JAVA in Bioinformatics



Since research centers are scattered all around the globe ranging from private to academic settings, and a range of hardware and OSs are being used, Java is emerging as a key player in bioinformatics. Physiome Sciences’ computer-based biological simulation technologies and Bioinformatics Solutions’ PatternHunter are two examples of the growing adoption of Java in

bioinformatics.

Perl in BioinformaticsString manipulation, regular expression matching, file parsing, data format interconversion etc are the common text-processing tasks performed in bioinformatics. Perl excels in such tasks and is being used by many developers. Yet, there are no standard modules designed in Perl specifically for the field of bioinformatics. However, developers have designed several of their own individual modules for the purpose, which have become quite popular and are coordinated by the BioPerl project.

BioJava



The BioJava Project is dedicated to providing Java tools for processing biological data which includes objects for manipulating sequences, dynamic programming, file parsers, simple statistical routines, etc.

Advertisment

BioPerl



The BioPerl project is an international association of developers of Perl tools for bioinformatics and provides an online resource for modules, scripts and web links for developers of

Perl-based software.

BioXML



A part of the BioPerl project, this is a resource to gather XML documentation, DTDs and XML aware tools for biology in one location. 

Rashmi Sahu

Advertisment