Bioinformatics

A brief introduction

The development and application of computational methods to acquire, store, organize, archive and visualize biological data [1]

What it is

“Bioinformatics is the field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned” [2]

“Bioinformatics is the application of computer technology to the management and analysis of biological data. The result is that computers are being used to gather, store, analyse and merge biological data.” [3]

“Bioinformatics is conceptualising biology in terms of molecules and applying informatics techniques (derived from disciplines such as applied maths, computer science and statistics) to understand and organise the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications.” [4]

Some typical applications

show
  • Sequence analysis, gene prediction

    One of the oldest areas where bioinformatics developed into a science of its own was the task to analyze the information encoded in our own genes.

  • Other data analysis

    All experiments need (and should get) computational support at some point. From a few calculations in an excel file to trained classifying methods used in the field of data mining, there are various ways to get a better understanding of the experimental data.

  • Pathway reconstruction, systems biology

    The analysis of various types of experiments hopefully leads to insights into the underlying biological mechanisms. The discovery of how the individual parts (molecules or other entities) act together in a coordinated network is one of the most exciting areas of research.

  • Data management and integration

    How can the large amounts of data be handled, that e.g. high-throughput experiments or the sequencing of the human genome produce?
    Databases have been developed that either specialize in one type of data (eg. promotor regions) or try to integrate various other sources in a higher-level meta-database or data warehouse.
    Various technologies and formats have been developed to allow the interchange and integration of data. The Distributed Annotation System (DAS) is one of the protocols that can be used to exchange and display this type of data. It is easy to set up your own server using eg. Proserver (Perl) to stream data from a GFF file or a database. The main BioDas page provides all the details.
    Public databases for molecular biology data play a central role in research

 

Sources
show

  1. M. Chicurel Bioinformatics: Bringing it all together Nature 419, 751-757
  2. National Center for Biotechnology Information, http://www.ncbi.nlm.nih.gov/
  3. BioPlanet website: Links, Jobs, and more
  4. N.M. Luscombe, D. Greenbaum, M. Gerstein What is Bioinformatics? A proposed Definition and Overview of the Field Method.Inform.Med. 4/2001

Bioinformatics Blog

My work notes about various bioinformatics & biology topics.

Freelancing Bioinformatics Work

Contact page for Gene Test Bioinformatics Solutions

ENSEMBL coding

ensembl-2

An introduction to the world of genome information you can
access with the Ensembl annotation system.

The EnsEMBL project not only creates and provides data on genome annotation, it also maintains a powerful interface to work with this data programmatically (via API). The large set of modules written in Perl can for example be used to

  • fetch information on your favorite genomic region
  • get data on specific genes in different organisms
  • annotate the clones on your microarray
  • start your own gene-prediction project

There is an API in the language JAVA as well, this is not being developed further, though. An API in Ruby is being developed externally and available here. Of special interest is also the BioMart system which allows complex cross-database queries in a simple interface.
More information can be found on the EnsEMBL pages. For specific coding questions subscribe to the mailing list.

Code snippets to get you started:

  1. Connect to the database
    show

    #connect to defined database using the DBAdaptor
    # (used in examples 1-8)
    
    use Bio::EnsEMBL::DBSQL::DBAdaptor;
    my $db = new Bio::EnsEMBL::DBSQL::DBAdaptor(
    -host => 'ensembldb.ensembl.org',
    -dbname => 'homo_sapiens_core_42_36b',
    -user => 'anonymous',
    );
    
    #OR connect automaticall using a the Registry
    # (used in example 9 ff.)
    #please see
    http://www.ensembl.org/info/software/registry/index.html
    
    use Bio::EnsEMBL::Registry;
    Bio::EnsEMBL::Registry->load_registry_from_db(
    -host => 'ensembldb.ensembl.org',
    -user => 'anonymous'
    );
  2. Fetch a specific chromosome
    show

    # [connect (1)]
    
    my $chrom  = "X";
    my $slice_adaptor = $db->get_SliceAdaptor;
    my $slice = $slice_adaptor->fetch_by_region("chromosome", $chrom);
    print "\nhave slice of ".$slice->seq_region_name." ".
          $slice->seq_region_start."-".$slice->seq_region_end;
  3. Fetch a specific genomic region
    show

    # [connect (1)]
    
    my $chrom  = "X";
    my $start  = 100000;
    my $end    = 200000;
    my $strand = 1;
    my $slice_adaptor = $db->get_SliceAdaptor;
    my $slice = $slice_adaptor->fetch_by_region(
    				"chromosome",
    				$chrom,
    				$start,
    				$end,
    				$strand);
    print "\nhave slice of ".$slice->seq_region_name." ".
        $slice->seq_region_start."-".$slice->seq_region_end;
  4. Fetch all chromosomes
    show

    # [connect (1)]
    
    my @chromosomes;
    foreach my $chr ( @{ $slice_adaptor->fetch_all('chromosome') } ) {
       #print out information
       print $chr->seq_region_name.", ".$chr->start." - ".$chr->end."\n";
       #store the names
       push @chromosomes, $chr->seq_region_name;
       #or work with the chromosome...
    }
  5. Fetch genes
    show

    # [connect (1)]
    
    #all genes from a slice, e.g. a chromosome
    # [get_slice]
    my @genes = @{ $slice->get_all_Genes() };
    
    #specific gene, using EnsEMBL-ID
    my $gene_adaptor = $db->get_GeneAdaptor;
    my $gene = $gene_adaptor->fetch_by_stable_id("ENSG00000147892");
    
    #specific gene, using gene symbol (short name)
    my $gene_adaptor = $db->get_GeneAdaptor;
    my @genes = @{$gene_adaptor->fetch_all_by_external_name("ADAMTSL1")};
    
  6. Get more information on the gene
    show

    # [connect (1)]
    
    # [get_gene (5)]
    
    #print basic information
    print "genomic localization:\t".$gene->seq_region_name.
          "\t".$gene->start."\t".$gene->end.
          "\t".$gene->strand."\t".$gene->biotype."\n";
    print "description:\t".$gene->description."\n";
    #print further infos, ids
    foreach my $dbEntry ( @{ $gene->get_all_DBEntries } ) {
       if( $dbEntry->database eq "HUGO" or
           $dbEntry->database eq "HGNC"){
          $symbol = "symbol:\t".$dbEntry->display_id."\n";
       }
       else{
          print "ID:\t".$dbEntry->database.":\t".
          	    $dbEntry->display_id."\n";
       }
    }
    
  7. Work with the gene
    show

    # [connect (1)]
    
    # [get_gene (5)]
    
    foreach my $transcript ( @{ $gene->get_all_Transcripts } ) {
       print $transcript->dbID."\t".$transcript->start."\t".
       	 $transcript->end."\t".$transcript->strand."\n";
       foreach my $exon ( @{ $transcript->get_all_Exons } ) {
          print "\t".$exon->dbID."\t".$exon->start."\t".
          	    $exon->end."\n";
       }
    }
  8. Use SQL to fetch information
    show

    # [connect (1)]
    
    my $query = "SELECT gene_id, seq_region_id, seq_region_start,
    	  seq_region_end, seq_region_strand FROM gene LIMIT 5;";
    my $sth = $db->dbc->prepare($query);
    $sth->execute();
    while(my ($id, $region, $start, $end, $strand) = $sth->fetchrow) {
       print "gene $id: $region, $start, $end, $strand\n";
    }
  9. Get homologues genes using EnsEMBL::Compara
    show

    use Bio::EnsEMBL::DBSQL::DBAdaptor;
    use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;
    
    #use Registry file for a simple connection setup,
    #please see 
      http://www.ensembl.org/info/software/registry/index.html
    use Bio::EnsEMBL::Registry;
    Bio::EnsEMBL::Registry->load_registry_from_db(
    		-host    => 'ensembldb.ensembl.org',
    		-user    => 'anonymous'
    		);
    
    #get compara adaptors
    my $ma =  Bio::EnsEMBL::Registry->get_adaptor(
    			'compara', 'compara', 'Member')
       or die "\n$@\ncan't get adaptor 1.\n";
    my $ha =  Bio::EnsEMBL::Registry->get_adaptor(
    			'compara', 'compara', 'Homology')
       or die "\n$@\ncan't get adaptor 2.\n";
    
    #fetch human gene from core database
    my $query_species = "Homo_sapiens";
    my $gene_id       = "ENSG00000147892";
    
    #fetch source gene
    my $member = $ma->fetch_by_source_stable_id(
    		"ENSEMBLGENE", $gene_id) or return 0;
    my $sourceGenome = $member->genome_db->dbID;
    print "\nsource gene ($query_species): ".$member->stable_id;
    
    #get all homologues from other species
    my $other_species = "Mus_musculus";
    my $homologies = $ha->fetch_by_Member_paired_species(
    			$member, $other_species);
    
    #or from all species
    #my $homologies = $ha->fetch_by_Member($member);
    
    #display all results
    foreach my $homologie (@$homologies) {
      foreach my $member_attrib (@{$homologie->get_all_Member_Attribute}) {
        my ($newmember, $attrib) = @$member_attrib;
    
        if ($newmember->genome_db->dbID != $sourceGenome) {
          print "\nhomologue: ".$newmember->stable_id.
          	    " / ".$newmember->taxon_id.
                ": ".$newmember->chr_name.
                " ".$newmember->chr_start.
                "-".$newmember->chr_end;
        }
      }
    }
  10. Get GeneOntology terms for a gene using EnsEMBL & GOApph
    show

    use Bio::EnsEMBL::DBSQL::DBAdaptor;
    # [connect (1)]
    
    #use GO::AppHandle for GO logic if possible!
    use GO::AppHandle;
    my %args = (
          -dbhost => 'sin.lbl.gov',
          -dbname => 'go',
    	);
    my $apph = GO::AppHandle->connect( \%args );
    
    # [get_gene (5)]
    
    #get GO infos
    if ( $gene->is_known ) {
      foreach $link ( @{ $gene->get_all_DBLinks } ) {
        if ( $link->database eq "GO" ) {
    
          #show GO term
          print $link->display_id;
    
          #get the ancester terms
          foreach my $go (@GOs1) {
    	get_parent($go);
          }
    
        }
      }
    }
    
    #fetch all parent terms recursively
    sub get_parent($) {
      my $term = shift;
      my $parent_term;
      my $type;
      my $parent_terms;
    
      $parent_terms = $apph->get_parent_terms($term);
      foreach $parent_term (@$parent_terms) {
        get_parent($parent_term);
      }
      $parent_term = $term->name();
      if ( ( $parent_term ne "Gene_Ontology" )
            && ( $parent_term ne "molecular_function" )
    	&& ( $parent_term ne "cellular_component" )
    	&& ( $parent_term ne "biological_process" ) ) {
          print $parent_term."(".$term->type "), ";
      }
    }
  11. Get GeneOntology terms for a gene using EnsEMBL only
    show

    use Bio::EnsEMBL::DBSQL::DBAdaptor;
    
    # connect with general registry
    my $registry = 'Bio::EnsEMBL::Registry';
    $registry->load_registry_from_db(
                     '-host'       => 'ensembldb.ensembl.org',
                     '-user'       => 'anonymous',
                     '-db_version' => '58',
                     '-verbose'    => '0',
                    );
    # get adaptors
    my $goa  = $registry->get_adaptor( 'Multi', 'Ontology', 'GOTerm' ) or die "Cant get GO adaptor\n";
    my $ga   = $registry->get_adaptor( 'Human', 'Core', 'Gene' );
    
    my $id = "ENSG00000006062";
    my $gene = $ga->fetch_by_stable_id($id);
    
    #get GO infos
    foreach my $link ( @{ $gene->get_all_DBLinks } ) {
      if ( $link->database eq "GO" ) {
        my $term_id = $link->display_id;
        my $term_name = '-';
        my $term = $goa->fetch_by_accession($term_id);
        if($term and $term->name){
          $term_name = $term->name;
        }
        print $gene->stable_id.": $term_id ($term_name)\n";
    
        #fetch complete GO hierachy
        foreach my $ancestor_term (@{ $term->ancestors() }){
          print "\t".$ancestor_term->accession." (".$ancestor_term->name.")\n";
        }
    
      }
    }