A brief introduction

What it is

“The development and application of computational methods to acquire, store, organize, archive and visualize biological data.” [1]

“Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned” [2]

“Bioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data.” [3]

“Bioinformatics is conceptualising biology in terms of molecules and applying informatics techniques (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications.” [4]

Some typical applications

  • Sequence analysis, gene prediction One of the oldest areas where bioinformatics developed into a science of its own was the task to analyze the information encoded in our own genes.
  • Other data analysis All experiments need (and should get) computational support at some point. From a few calculations in an excel file to trained classifying methods used in the field of data mining, there are various ways to get a better understanding of the experimental data.
  • Pathway reconstruction, systems biology The analysis of various types of experiments hopefully leads to insights into the underlying biological mechanisms. The discovery of how the individual parts (molecules or other entities) act together in a coordinated network is one of the exciting areas of research.
  • Data management and integration How can the large amounts of data be handled, that e.g. high-throughput experiments or the sequencing of the human genome produce?
    Databases have been developed that either specialize in one type of data (eg. promotor regions) or try to integrate various other sources in a higher-level meta-database or data warehouse.
    Various technologies and formats have been developed to allow the interchange and integration of data.
    Public databases for molecular biology data play a central role in research.
    A long list of open source projects now provide simple or highly sophisticated ways to access and handle all kinds of data.

Sources

show
  1. M. Chicurel Bioinformatics: Bringing it all together Nature 419, 751-757
  2. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
  3. BioPlanet website: Links, Jobs, and more
  4. N.M. Luscombe, D. Greenbaum, M. Gerstein What is Bioinformatics? A proposed Definition and Overview of the Field Method.Inform.Med. 4/2001

Bioinformatics Blog

My work notes about various bioinformatics & biology topics to give you more details.

Freelancing Bioinformatics Work

Contact page for Gene Test Bioinformatics Solutions

ENSEMBL coding

ensembl-2

An introduction to the world of genome information you can
access with the Ensembl annotation system.

The EnsEMBL project was one of the reasons that got me excited about bioinformatics. It not only creates and provides data on genome annotation, it also maintains a powerful interface to work with this data programmatically (via API). The large set of modules written in Perl can for example be used to

  • fetch information on your favorite genomic region
  • get data on specific genes in different organisms
  • annotate the clones on your microarray
  • start your own gene-prediction project

There is an API in the language JAVA as well as in Ruby, they are not being developed further, though. Of special interest is also the BioMart system which allows complex cross-database queries in a simple interface.
More information can be found on the EnsEMBL pages. For specific coding questions subscribe to the mailing list.

Code snippets to get you started:

1. Connect to the database
show

#connect to defined database using the DBAdaptor
# (used in examples 1-8)

use Bio::EnsEMBL::DBSQL::DBAdaptor;
my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(
-host => 'ensembldb.ensembl.org',
-dbname => 'homo_sapiens_core_42_36b',
-user => 'anonymous',
);

#OR connect automaticall using a the Registry
# (used in example 9 ff.)
#please see
http://www.ensembl.org/info/software/registry/index.html

use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->load_registry_from_db(
-host => 'ensembldb.ensembl.org',
-user => 'anonymous'
);
2. Fetch a specific chromosome
show

# [connect (1)]

my $chrom  = "X";
my $slice_adaptor = $db->get_SliceAdaptor;
my $slice = $slice_adaptor->fetch_by_region("chromosome", $chrom);
print "\nhave slice of ".$slice->seq_region_name." ".
      $slice->seq_region_start."-".$slice->seq_region_end;
3. Fetch a specific genomic region
show

# [connect (1)]

my $chrom  = "X";
my $start  = 100000;
my $end    = 200000;
my $strand = 1;
my $slice_adaptor = $db->get_SliceAdaptor;
my $slice = $slice_adaptor->fetch_by_region(
				"chromosome",
				$chrom,
				$start,
				$end,
				$strand);
print "\nhave slice of ".$slice->seq_region_name." ".
    $slice->seq_region_start."-".$slice->seq_region_end;
4. Fetch all chromosomes
show

# [connect (1)]

my @chromosomes;
foreach my $chr ( @{ $slice_adaptor->fetch_all('chromosome') } ) {
   #print out information
   print $chr->seq_region_name.", ".$chr->start." - ".$chr->end."\n";
   #store the names
   push @chromosomes, $chr->seq_region_name;
   #or work with the chromosome...
}
5. Fetch genes
show

# [connect (1)]

#all genes from a slice, e.g. a chromosome
# [get_slice]
my @genes = @{ $slice->get_all_Genes() };

#specific gene, using EnsEMBL-ID
my $gene_adaptor = $db->get_GeneAdaptor;
my $gene = $gene_adaptor->fetch_by_stable_id("ENSG00000147892");

#specific gene, using gene symbol (short name)
my $gene_adaptor = $db->get_GeneAdaptor;
my @genes = @{$gene_adaptor->fetch_all_by_external_name("ADAMTSL1")};
6. Get more information on the gene
show

# [connect (1)]

# [get_gene (5)]

#print basic information
print "genomic localization:\t".$gene->seq_region_name.
      "\t".$gene->start."\t".$gene->end.
      "\t".$gene->strand."\t".$gene->biotype."\n";
print "description:\t".$gene->description."\n";
#print further infos, ids
foreach my $dbEntry ( @{ $gene->get_all_DBEntries } ) {
   if( $dbEntry->database eq "HUGO" or
       $dbEntry->database eq "HGNC"){
      $symbol = "symbol:\t".$dbEntry->display_id."\n";
   }
   else{
      print "ID:\t".$dbEntry->database.":\t".
      	    $dbEntry->display_id."\n";
   }
}
7. Work with the gene
show

# [connect (1)]

# [get_gene (5)]

foreach my $transcript ( @{ $gene->get_all_Transcripts } ) {
   print $transcript->dbID."\t".$transcript->start."\t".
   	 $transcript->end."\t".$transcript->strand."\n";
   foreach my $exon ( @{ $transcript->get_all_Exons } ) {
      print "\t".$exon->dbID."\t".$exon->start."\t".
      	    $exon->end."\n";
   }
}
8. Use SQL to fetch information
show

# [connect (1)]

my $query = "SELECT gene_id, seq_region_id, seq_region_start,
	  seq_region_end, seq_region_strand FROM gene LIMIT 5;";
my $sth = $db->dbc->prepare($query);
$sth->execute();
while(my ($id, $region, $start, $end, $strand) = $sth->fetchrow) {
   print "gene $id: $region, $start, $end, $strand\n";
}
9. Get homologues genes using EnsEMBL::Compara
show

use Bio::EnsEMBL::DBSQL::DBAdaptor;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

#use Registry file for a simple connection setup,
#please see 
  http://www.ensembl.org/info/software/registry/index.html
use Bio::EnsEMBL::Registry;
Bio::EnsEMBL::Registry->load_registry_from_db(
		-host    => 'ensembldb.ensembl.org',
		-user    => 'anonymous'
		);

#get compara adaptors
my $ma =  Bio::EnsEMBL::Registry->get_adaptor(
			'compara', 'compara', 'Member')
   or die "\n$@\ncan't get adaptor 1.\n";
my $ha =  Bio::EnsEMBL::Registry->get_adaptor(
			'compara', 'compara', 'Homology')
   or die "\n$@\ncan't get adaptor 2.\n";

#fetch human gene from core database
my $query_species = "Homo_sapiens";
my $gene_id       = "ENSG00000147892";

#fetch source gene
my $member = $ma->fetch_by_source_stable_id(
		"ENSEMBLGENE", $gene_id) or return 0;
my $sourceGenome = $member->genome_db->dbID;
print "\nsource gene ($query_species): ".$member->stable_id;

#get all homologues from other species
my $other_species = "Mus_musculus";
my $homologies = $ha->fetch_by_Member_paired_species(
			$member, $other_species);

#or from all species
#my $homologies = $ha->fetch_by_Member($member);

#display all results
foreach my $homologie (@$homologies) {
  foreach my $member_attrib (@{$homologie->get_all_Member_Attribute}) {
    my ($newmember, $attrib) = @$member_attrib;

    if ($newmember->genome_db->dbID != $sourceGenome) {
      print "\nhomologue: ".$newmember->stable_id.
      	    " / ".$newmember->taxon_id.
            ": ".$newmember->chr_name.
            " ".$newmember->chr_start.
            "-".$newmember->chr_end;
    }
  }
}
10. Get GeneOntology terms for a gene using EnsEMBL & GOApph
show

use Bio::EnsEMBL::DBSQL::DBAdaptor;
# [connect (1)]

#use GO::AppHandle for GO logic if possible!
use GO::AppHandle;
my %args = (
      -dbhost => 'sin.lbl.gov',
      -dbname => 'go',
	);
my $apph = GO::AppHandle->connect( \%args );

# [get_gene (5)]

#get GO infos
if ( $gene->is_known ) {
  foreach $link ( @{ $gene->get_all_DBLinks } ) {
    if ( $link->database eq "GO" ) {

      #show GO term
      print $link->display_id;

      #get the ancester terms
      foreach my $go (@GOs1) {
	get_parent($go);
      }

    }
  }
}

#fetch all parent terms recursively
sub get_parent($) {
  my $term = shift;
  my $parent_term;
  my $type;
  my $parent_terms;

  $parent_terms = $apph->get_parent_terms($term);
  foreach $parent_term (@$parent_terms) {
    get_parent($parent_term);
  }
  $parent_term = $term->name();
  if ( ( $parent_term ne "Gene_Ontology" )
        && ( $parent_term ne "molecular_function" )
	&& ( $parent_term ne "cellular_component" )
	&& ( $parent_term ne "biological_process" ) ) {
      print $parent_term."(".$term->type "), ";
  }
}
11. Get GeneOntology terms for a gene using EnsEMBL only
show

use Bio::EnsEMBL::DBSQL::DBAdaptor;

# connect with general registry
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
                 '-host'       => 'ensembldb.ensembl.org',
                 '-user'       => 'anonymous',
                 '-db_version' => '58',
                 '-verbose'    => '0',
                );
# get adaptors
my $goa  = $registry->get_adaptor( 'Multi', 'Ontology', 'GOTerm' ) or die "Cant get GO adaptor\n";
my $ga   = $registry->get_adaptor( 'Human', 'Core', 'Gene' );

my $id = "ENSG00000006062";
my $gene = $ga->fetch_by_stable_id($id);

#get GO infos
foreach my $link ( @{ $gene->get_all_DBLinks } ) {
  if ( $link->database eq "GO" ) {
    my $term_id = $link->display_id;
    my $term_name = '-';
    my $term = $goa->fetch_by_accession($term_id);
    if($term and $term->name){
      $term_name = $term->name;
    }
    print $gene->stable_id.": $term_id ($term_name)\n";

    #fetch complete GO hierachy
    foreach my $ancestor_term (@{ $term->ancestors() }){
      print "\t".$ancestor_term->accession." (".$ancestor_term->name.")\n";
    }

  }
}