The Biologist and the Internet

by Alfred Simbun

Illustration by Charis Loke

Illustration by Charis Loke

Biology is a field of discovering the truth about living things which are known to be very complex and complicated. To uncover the truth about any domain of interests from a pool of living organisms, several research methods are used. Technology plays a crucial role in accelerating information retrieval. The advancement of computer technology and the internet have been evolving alongside biological sciences. These two disparate disciplines are symmetrically important and have become more strongly intertwined since the day the internet was created by Tim Berners-Lee, and when the Human Genome Project was announced. From that era onwards, biological information became easily accessible and data retrieval became more efficient. Thanks to these widely available data, new biological insights are made, and eventually this process transforms the way biologists conceive scientific hypothesis. For instance, online databases like PubMed at the National Center for Biotechnology Information (NCBI) [1], Protein Data Bank (PDB) [2], Web of Knowledge (ISI) [3] and web search engines like Google and Hakia [4] have made tremendous aids to the advancement of genomics and proteomic researches. Pre-internet era was seen as the years of traditional information retrieval when biologists only relied on scientific papers published by other scientists. Data extraction from piles of papers was, and still is, a tedious task.  In these days, technologies like the internet have transformed information retrieval into knowledge discovery, where computerised method has the capacity to think like human (this is also known as semantic technology). Scientists no longer talk about research outcomes in tabulated data, graphs and charts, but also web-like data linkages that represent relationships of connected research outcomes. This gives greater impact to the way we understand living things.

The Internet and the World Wide Web (WWW) are being accessed routinely by scientists since the birth of these technologies and the existence of thousands of online web portals pertaining biology [5]. These web portals provide interesting information, allowing accelerated virtual interactions among users and the discovery of information of vital importance to their scientific research. This network is made up of electronic links that connect related web portals, thus allowing users to browse from portal to portal. Apart from web browsing, information can be gathered by retrieving them through web search engines such as Google, one of the earliest web search engines that permits users to find almost any information on the Internet. Google revolutionised knowledge discovery in the past decade by providing a greater access to a wealth of information and delivering it with high precision when user does a single keyword search [6].

In the past, information gathering was done by accessing books in the library and/or by purchasing published articles such as scientific journals periodically [6]. The aforementioned technologies have profoundly changed the way human behave in finding their information of interests and getting things done anywhere and increasingly while on the go [7]. Instead of visiting the library or going to a nearby bookshop, ‘Googling’, a de facto verb in the English language by mid-2003 that means “to query information through Google.com”, helps to retrieve documents with just few clicks [2]. To date, this term has been widely accepted as a synonym for ‘searching’. The goal of these technologies is to make information available publicly. With the emergence of scientific web portals such as the NCBI and PDB, scientists have since had access to scientific information such as research results, analytical commentaries, and even genomic and proteomic sequences which can be used by other researchers for other similar or dissimilar type of research focuses. Also, in the past two decades, academic institutions have begun to publish their scientific works on their websites and thus have allowed up-to-date research findings and recent discovery announcements. A greater paradigm shift can be identified here. In the past decade, the highest level of information was a collection of individually processed data presented in various forms of diagrams and tables. Due to the impacts of Google, NCBI, PDB, ISI, Hakia and many more, the widespread use of ontologiesa-driven data representations has transformed this information into a huge network of knowledge. A well-known example is Facebook, where individual information is hugely shared thus allowing endless point of knowledge transfer.

These technologies have revolutionised the information retrieval processes in life sciences, especially in the domain of biotechnology, not just in the context of the application of computer systems, but also in the realm of knowledge ontologies [8]. The revolution is about how biotechnology was brought to a more advanced level of understanding, simply via ‘Googling’, a simple act of information retrieval. In today’s modern world, scientists increasingly demand an information retrieval platform that has the ability to find missing information in order to quench their thirst for scientific wisdom [9]. The solution to this problem is semantic technology, an advanced and intelligent computer platform that has the ability to retrieve missing data, build a network of information and create reliable knowledge hub with a higher result precision level than the traditional keyword-based search engines [10].

The extraordinary intelligence of these retrieval algorithms paved the way for more people to gain more information from within and outside science-minded communities. Online web search engines nowadays are well equipped with high technology applications like image retrieval ability, voice recognition, natural language processing (NLP) ability, ontology-driven web links that connects a single result to multiple webpages and so on [11]. Knowing that biological information evolves rapidly due to the amount of novel discoveries, biologists need various advanced solutions that can help them to crack or decode biological information (stored in the forms of images, tables, diagrams, ontologies, etc.), and to understand the meaning of big data (the semantic of connected information). All these can be done by the aforementioned online web search engines.

bioinformatics

Apart from the above-mentioned web portals, there are other specific online biological databases that provide their own search engine technologies. These have been developed to cater for the biologists’ specific needs in retrieving DNA/RNA sequences and protein structures and information related to them. Before the advent of computerised database, biologists faced the same issues as other professionals – manual work comprising of browsing through thousands, if not hundreds of paperback records kept in the libraries. Drug discovery projects committed by any pharmaceutical company used to take at least few decades to produce a single candidate drug for a single disease yet through the use of these intelligent portals, it is now possible to shorten the discovery period [12]. This is all thanks to the field of bioinformatics, a life science discipline that helps biologists accelerate research by developing web applications like NCBI and PDB.

Information reliability is still a major concern in the field of knowledge retrieval [13]. When a biologist searches information based on a single keyword, for example ‘microRNA’, online web search engines will return thousands of hits (referring to the number of queried results associated to the keyword) and placing them from top most accessed/queried results down to the least prominent ones. In the past, queried results were presented in a similar way they are today but without any advanced filtering [14]. That means, since these search engines were not algorithmically advanced, biologists were getting thousands of junk information which were not relevant to what was being searched. In this case, information overloading can happen and the integrity of these results is highly questionable. The main anticipation for the future generation of information retrieval is that reliable search results will be given and this is the most basic rule when it comes to creating a reliable knowledge network.

The idea of online web search engines in the context of ‘Web 3.0’ and beyond is to make computers think like real human. Although realistically it may take decades to achieve such amazing anticipation, nevertheless the on-going efforts to upgrade the current existing technology has already begun. One of the reasons that contributed to such difficulties is the amount of biological data generated daily by all biologists in the world. This can be seen with the number of scientific publications published in various scientific journals. Clearly, there is still room for growth for computer technologies to help filter this vast amount of information.

About the Author

Alfred Simbun is a Technical Product Specialist at QIAGEN Biotechnology Malaysia and he is passionate about bioinformatics and gene network. He obtained his Diploma in Information Technology in 2004 and received his BSc in Bioinformatics (Hons) from the Management & Science University (MSU) Shah Alam, and subsequently a Master’s degree in Bioinformatics from University of Malaya (UM). His research focus includes plant microRNA, fungal small RNA studies, and semantic technology and ontology development. His strong interest in small RNA discovery, fungal comparative genomics, and gene network has led him to be actively involved in fungal bioinformatics analysis. Apart from these, he is also inclined towards social sciences, particularly in the area of communicative sciences, medical ethics, environmental sustainability, and last but not least the Catholic’s social doctrines. He is currently pursuing a Master’s degree in social sciences at UM and can be contacted at [email protected] Find out more about Alfred by visiting his Scientific Malaysian profile at http://www.scientificmalaysian.com/members/alfredsimbun/.

References

[1] National Center for Biotechnology Information (NCBI). Retrieved on 18th June 2013 from http://www.ncbi.nlm.nih.gov/.

[2] Protein Data Bank (PDB). Retrieved on 18th June 2013 from http://www.rcsb.org/pdb/home/home.do.

[3] Thomson Reuters (formerly ISI) Web of Knowledge. Retrieved on 18th June 2013 from http://wokinfo.com/.

[4] Hakia, a Semantic search engine. Retrieved on 18th June 2013 from http://hakia.com/.

[5] Herve Recipon and Wojciech Makalowski, 1997. The biologist and the World Wide Web: an overview of the search engines technology, current status and future perspectives. Current Opinion in Biotechnology, 8:115-118. DOI 0958-l 669-008-001 15.

[6] Jan Brophy and David Bawden, 2005. Is Google enough? Comparison of an internet search engine with academic library resources. Aslib Proceedings: New Information Perspectives, vol.57, no.6, pp.498-512. DOI 10.1108/00012530510634235.

[7] Jacques Bughin, Laura Corb, James Manyika, Olivia Nottebohm, Michael Chui, Borja de Muller Barbat and Remi Said. The impact of Internet technologies: Search. McKinsey Global Institute, July 2011.

[8] Robert Stevens, Alan Rector and Duncan Hull, 2010. What is an ontology? Ontogenesis. Retrieved on 12th August 2013 from http://ontogenesis.knowledgeblog.org/66.

[9] Google Blog’s “Technologies behind Google Ranking” by Amit Singhal, 2008. Retrieved on 18th June 2013 from http://googleblog.blogspot.com/2008/07/technologies-behind-google-ranking.html.

[10] Duygu Tümer, Mohammad Ahmed Shah, and Yiltan Bitirim, 2009. An Empirical Evaluation on Semantic Search Performance of Keyword-Based and Semantic Search Engines: Google, Yahoo, Msn and Hakia. 2009 Fourth International Conference on Internet Monitoring and Protection, 978-0-7695-3612-5/09. DOI 10.1109/ICIMP.2009.16.

[11] Hamid R. Jamali and Saeid Asadi, 2009. Google and the scholar: the role of Google in scientists’ information-seeking behavior. Aslib Proceedings: New Information Perspectives, vol.34, no.2, pp.282-294. DOI 10.1108/14684521011036990.

[12] Zhengwu Lu and Jing Su, 2010. Clinical data management: Current status, challenges, and future directions from industry perspectives. Open Access Journal of Clinical Trials 2010:2 93–105.

[13] Reliability of Information on the Internet by Anton Vedder and Robert Wachbroit. Retrieved on 20th August 2013 from http://arno.uvt.nl/show.cgi?fid=14203.

[14] Martin 0. Andago, Teh P.L Phoebe, Bassam A.M Thanoun, 2010. Evaluation of a Semantic Search Engine against a Keyword Search Engine Using First 20 Precision. International Journal for the Advancement of Science and Arts, vol.1, no.2.