HIPS Talk: "Building `Google search engines´ for microbial sequence"

Tuesday, February 4, 2020, at 16:30 s.t. in Bldg E8.1, Seminar Room (Ground Floor)

Prof. Dr. Zamin Iqbal  (European Bionformatics Institute (EMBL-EBI)) will give a presentation entitled
"Building `Google search engines´ for microbial sequence"

When?: Tuesday, February 4, 2020, at 16:30 s.t.
Where?: in Bldg E8.1, Seminar Room (Ground Floor)
Host: Prof. Dr. Olga Kalinina

There is opportunity to talk with the speaker before the talk.   
For details and for making appointments please contact Prof. Dr. Olga Kalinina at 0681-98806-3600 or per email: olga.kalinina@helmholtz-hips.de  
Guests are welcome!

Although we continue to sequence and store progressively more genomes, we have barely scratched the surface of the genetic  diversity of microbial life on Earth. One of the unexpected challenges is the fact that we don't have very good access even to the genomes that are publicly available. There are two reasons: first, 80% of publicly archived data is deposited as raw unassembled sequence, which is not (usefully) searchable by BLAST. Second, the full DNA archive is too big for standard tools like BLAST, and doubling every 18 months.

The ability to query these data for sequence search terms would facilitate both basic research and applications such as realtime genomic epidemiology and surveillance. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web search to produce a searchable data structure named BItsliced Genomic Signature Index (BIGSI, www.ncbi.nlm.nih.gov/pubmed/30718882). We indexed the entire global corpus of 447,833 bacterial and viral whole-genome sequence datasets using four orders of magnitude less storage than previous methods - you can try the search out
at bigsi.io. I will talk about how we applied our BIGSI search function to rapidly find resistance genes MCR-1, MCR-2, and MCR-3, determine the host-range of 2,827 plasmids, and quantify antibiotic resistance in archived datasets.

These kinds of method open up valuable resources for science, in just the same way that Google opens up the internet - this  is an enabling technology rather than an end in itself.  I will also discuss limitations of searching in DNA rather than  protein space, and plans to address that.