bosc2002/talks.html at gh-pages · OBF/bosc2002 · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
	<title>BOSC2002 Talk Abstracts</title>
	<style type="text/css"><!--
		h1 { color: #336; font-style: normal; font-weight: bolder; font-family: Arial, Helvetica }
		h2 { color: #336; font-style: normal; font-weight: bolder; font-family: Arial, Helvetica }
		h3 { color: #336; font-style: normal; font-weight: bolder; font-family: Arial, Helvetica }-->
	</style>

	<body bgcolor="white">
		<table border="0" cellpadding="2" cellspacing="2">
			<tr>
				<td colspan="2" align="left" valign="top">
<a href="index.html"><img src="bosc2002.png" alt="BOSC 2001" width="196" height="196" hspace="10"></a></td>
				<td valign="center" hspace=20>

<img src="title.gif" alt="Bioinformatics Open Source Conference" width="394" height="53">
					<p>
            <font face="Arial,Helvetica,Geneva,Swiss,SunSans-Regular">
	      <font size=+5><B>Talk Abstracts</B></font><br>
	      View <a href="program.html">the program</a><p>
	      Download <a href="slides/">the slides</a><p>
            </font>
</tr>
</table>
<P><HR>

<p>
<a name="birney">
1 Aug, 9:15 - 10:15  <STRONG>Ewan Birney Keynote</STRONG><br>

<P>

Are we growing up? Reports from the open bioinformatics foundation and
the open bioinformatics database access project.

<p>
<hr width="50%">
<a name="bioperl">
1 Aug, 10:15 - 10:40 <strong>Bioperl project report</strong><br>
Jason Stajich / Duke University<br>
View slides: [<a href="slides/2002-08-01-Stajich-Bioperl.pdf">PDF</a>]<br>

<p>
The Bioperl project recently released version 1.0 of our toolkit for life
science programming.  The components include modules which support
sequences, sequence reading and writing, sequence features and annotations
and features with simple and complex locations, multiple sequence
alignments, phylogenetic trees, BLAST &amp; FASTA parsing, building and
accessing local sequence databases, accessing remote sequence databases,
retrieving and manipulating bibliographic references, interoperating with
BioCORBA and OBDA biological object standards.  The toolkit has been used
in a wide array of situations from simple laboratory situations and as the
building blocks for enterprise solutions in EnsEMBL and the Generic Model
Organism Database (GMOD).

<p>
The toolkit is built with an easily extensible architecture which can be
used for quickly building perl programs to address specific research
questions.  Several examples of its use to answer real laboratory
questions will be discussed.

<p>
License: Perl Artistic.


<p>
<hr width="50%">
<a name="bioperl-pipeline">
1 Aug, 11:00 - 11:25 <strong>Bioperl-Pipeline System</strong><br>
Shawn Hoon / Fugu Genome Project, Singapore<br>
View slides: [<a href="slides/2002-08-01-Hoon-Pipeline.pdf">PDF</a>]<br>

<p>
The prominence of the in-silico laboratory coupled with the explosion of
comparative genomics have made the nature of computational biological
analysis increasingly complex. This is exacerbated by the plethora of
software that are now available. It is not uncommon for an analysis to
involve large amounts of data from disparate sources and formats,
different programs with specific requirements and output formats that must
be suitable for human interpretation. There thus exists a need for a
flexible workflow framework that will hide such complexity, allowing
scientists to focus on their analysis, while providing bioinformaticians a
coherent methodology for which to extend the system. It was with this in
mind that we developed the bioperl-pipeline system. Largely adapted from
the Ensembl Pipeline Annotation System, some of the features in the
current system include:

<ol>
  <li>  Handling of various input and output data formats from various databases.

  <li>  A bioperl interface to non-specific loadsharing software (LSF,PBS
etc) to ensure that the various analysis programs are run in proper
order and are successfully completed while re-running those that fail.

  <li>  A flexible pluggable bioperl interface that allows programs to be
      'pipeline-enabled'.
</ol>

<p>
We are currently looking at extending the system in the following way:

<ol>
  <li>  A 'grid-aware' system that allows jobs to be distributed over a
      bio-cluster network harnassing collective computing power that will
      be especially useful for small groups looking to perform
      compute-intensive analysis.

  <li>  A user-friendly click and drag GUI system to allow easy workflow
      design and job tracking.
</ol>

<p>
We are now applying this framework to our compara system for high
throughput multi-species studies. We will discuss the design and
implementation details of the bioperl-pipeline package.


<p>
<hr width="50%">
<a name="cabio">
1 Aug, 11:25 - 11:35<br>
<strong>Cancer Bioinformatics Infrastructure Objects (caBIO):
An open-source, object oriented API for biomedical informatics</strong><br>
<i>Peter A. Covitz</i>, Himanso Sahni, Scott Gustafson, and Kenneth Buetow<br>
National Cancer Institute Center for Bioinformatics<br>

<p>
The National Cancer Institute has established a Center for
Bioinformatics (NCICB) whose mission is to support the NCI's programs
in basic and clinical cancer research.  The NCICB is aggressively
pursuing a program to develop a core infrastructure and API for
biomedical information management and retrieval.  The initiative
employs industry-standard software engineering methodologies to develop data models,
middleware, vocabularies and ontologies for biomedical research.

<p>
caBIO is the primary programming interface to the core
infrastructure. caBIO objects are implemented using Java and Java Bean
technology, and represent biological and laboratory entities such as
genes, chromosomes, sequences, libraries, clones, pathways, and
ontologies. caBIO provides uniform API access to a variety of genomic,
biological, and clinical data sources including GenBank, Unigene,
LocusLink, Homologene, Ensembl, Golden Path, DAS servers, CGAP, NCI
Enterprise Vocabulary Services, and clinical trials protocols.  Any
client can retrieve HTML and XML from caBIO via HTTP.  Java-based
clients can further communicate with caBIO via the domain objects
provided by the caBIO JAR, while server components can communicate via
Java RMI.  Non-Java applications can communicate via SOAP. RDF is
currently used to advertise services to crawlers and agents, and a
UDDI registry is planned.  For its presentation layer, caBIO uses
servlets and JSPs under Jakarta Tomcat. All caBIO objects can be
transformed into XML, and XSL/XSLT is used to present data in
documents, web pages or other interfaces.

<p>
NCICB makes the caBIO interfaces available on its public servers, and
also makes the underlying software available for use at at local
sites.  More information is available at
<a href="https://web.archive.org/web/20040606140611/http://ncicb.nci.nih.gov/core">http://ncicb.nci.nih.gov/core</a>.
The open source license covering caBIO software can be found at
<a href="https://web.archive.org/web/20040606140611/http://ncicb.nci.nih.gov/core/caBIO/developer_resources/core_jar/license">http://ncicb.nci.nih.gov/core/caBIO/developer_resources/core_jar/license</a>.

<p>
<hr width="50%">
<a name="biopython">
1 Aug, 11:35 - 12:00 <strong>Biopython and the Laboratory Scientist</strong><br>
Brad Chapman / University of Georgia<br>
View slides: [<a href="slides/2002-08-01-Chapman-Biopython.pdf">PDF</a>]<br>

<p>
Biopython is a collection of open-source tools in the Python
programming language. Developed by a collection of programmers from
around the world, the Biopython toolkit is designed to provided
re-usable code for anyone answering biological questions using
Python. Biopython has been around since 1999, and has a number of
active contributors and users. In this talk, I will briefly describe
the basic components provided in the Biopython toolkit. From there, I
will describe how Biopython can be used in a academic laboratory
environment, taking examples from my own lab. The emphasis will be on
utilization of Biopython code for automating everyday tasks faced by
wet lab researchers. I will try to show that Python and Biopython can
be used productively by researchers lacking formal training in
computer science.  Finally, I will describe integrating Biopython into
larger bioinformatics projects. Again, this will draw on my own
experience using Biopython and will describe how using Biopython can
help make your coding life easier when approaching a large
project. The aim of the entire talk it to convince you that using open
source libraries like Biopython is worth the time invested in learning
it.

<p>
<hr width="50%">
<a name="contract">
1 Aug, 1:50 -  2:15 <strong>The Open Source Authors' Contract</strong><br>
Steven Brenner / University of California, Berkeley
<p>
Most universities, national laboratories, companies, and other employers
have clauses in their employment contracts that prevent or restrict the
creation and use of open source software.  Indeed, it seems likely that
much of the biological open source software is being produced illegally,
in violation of institutions' terms.  While benign neglect of enforcement
of the institutions' regulations has led to a situation that is generally
acceptable, it is not ideal.

<p>
Several individuals have sought the ability to produce open source
software by seeking exemptions or variations of their institutions'
intellectual property agreements.  However, this is a painstaking process,
and the associated legal fees can be costly.  I propose that a general
contract be drawn up, which has standard terms for individuals to create
open source software without undue constraints.

<p>
Since this idea was first broached a year ago, there has been widespread
discussion regarding regulations governing production open source
software.  This talk will provide a background to the motivation for the
Authors' contract, as well as recent responses which suggest productive
ways forward.

<p>
<hr width="50%">
<a name="biojava">
1 Aug, 2:15 -  2:40 <strong>BioJava Toolkit Progress</strong><br>
Matthew Pocock / BioJava Consulting Limited<br>

<p>
BioJava is an open-source software project that aims to provide an
industry-quality Java library for common bioinformatics tasks. BioJava
is part of the open-bio foundation. BioJava was started in the autumn of
1998, and now has over 25 developers. In the past two years, the core
development team has expanded from the original team of two to five.
This has brought with it a greater range of views and expertise, as well
as a greater stability. In parallel with this, we are in the process of
integrating unit testing to maintain the quality of the &gt;130,000 lines
of code and documentation in the core library.

<p>
BioJava has taken an active role in participating in the open-bio
hackathons. Representatives have attended both legs of the hackathon
(Tuscon, AZ, USA and Cape Town, SA). During this time, several important
interoperabe technologies were designed and implemented. These include a
registry file format for biological entities, an SQL schema for storage
of sequences and their annotations, BioCorba-based corba clients and
servers, bibliographic web services, web services for publishing
sequence data and flat file indexing. All of these have been implemented
in BioJava, and interoperate with implementations in the other open-bio
language projects, as well as with some external implementations.

<p>
Over the next year, we hope to mature the library's functionality in
areas related to sequence manipulation, pipeline management, alignments,
Sequence GUIs and file parsers. In parallel, we shall be integrating
code-generation, more flexible transaction management and ontology
representations with the current free-form annotation model and BioJava
interfaces to allow the representation of more fluid data types, and
more maintainable and robust implementation of standard interfaces.

<p>
<hr width="50%">
<a name="goet">
1 Aug, 2:40 -  2:50 <strong>GOET: the General Ontology Editing Tool</strong><br>
John Richter / Berkeley Drosophila Genome Project<br>

<p>
GOET is a Java application designed to facilitate the creation of
ontology schemas and data. GOET allows a user to define DAML+OIL-like
schemas and then populate those schemas with data. Data can loaded from
and saved to DAML+OIL flat files, as well as numerous other formats.

<p>
GOET is highly customizable via pluggable editor kits. Editor kits are
Java jar files that define a custom user interface for GOET, tailored to a
particular kind of data. Editor kits allow programmers to create the most
efficient user interface for any given ontology. GOET comes with a generic
editor kit that can edit <i>any</i> ontology, making it easy for users to
experiment with new schemas.

<p>
GOET provides a strong toolkit for ontology editing, with automatic
support for history tracking, undo/redo, cycle checking, and other
important graph editing tools. This toolkit makes it easy for programmers
to develop new, powerful editor kits.

Other information:<br>

GOET is being developed as part of the gmod project at
<a href="https://web.archive.org/web/20040606140611/http://sourceforge.net/projects/gmod">http://sourceforge.net/projects/gmod</a>.
<br>
Like all gmod components, GOET is distributed under the terms of the
Artistic License.

<p>
<hr width="50%">
<a name="hmm">
1 Aug, 2:50 -  3:00 <strong>GHMM &amp; HMMed: A comprehensive HMM toolkit</strong><br>
Alexander Schliep / Max-Planck-Institut for Molecular Genetics<br>

<p>
Hidden Markov Models (HMMs) are one of the most successfull tools
for analyzing biological sequences.

<p>
We have developed a graphical editor for HMMs called HMMEd which
allows to create sophisticated models manually using a graphical
user interface. Hierarchical models are supported (e.g. a three
state model representating a single codon as one 'super state'),
as well as a wide range of HMM extensions and user data associated
with the states of the HMM. Graphical editors for discrete
emission distributions as well as mixtures of continous pdfs
are integrated.

<p>
For the exchange of HMMs we propose a XML-based format which is
loosely based on GraphML, is hierarchical and also incorporates
necessary extensions for proper graphical display.

<p>
The GNU (pending permission from the FSF) HMM library (GHMM) is a
C-library providing efficient implementations of a comprehensive
collection of algorithms for both discrete and continous
emission HMMs. Python bindings allow interactive work with HMMs
from the Python command line and, at some later stage, tight
integration with HMMEd, which is also written in Python using
Tkinter.

<p>
HMMEd (pronounced Hammered) and the GHMM are licensed under the LGPL.

<p>
<hr width="50%">
<a name="usability">
1 Aug, 3:00 -  3:10 <strong>Usability</strong><br>
Andrew Dalke / Dalke Scientific Software, LLC<br>
View slides: [<a href="slides/2002-08-01-Dalke/">HTML</a> |
  <a href="slides/2002-08-01-Dalke/usability.sxi">OpenOffice</a> |
  <a href="slides/2002-08-01-Dalke/usability.ppt">PPT</a>]<br>

<p>
Open source software is often said to be "unusable."  On the surface
this doesn't make sense because many of the projects are widely used
to do real work.  But usability isn't a binary value, it refers to
ease of use.  Two packages can be equally featureful but one be much
more usable than the other.

<p>
A lot of research has gone into understanding how to make more usable
software.  This knowledge is starting to make its way into mainstream
software projects, but is still relatively unused in bioinformatics.
I'll discuss several reasons why this might be so, the major one being
that few even know this topic exists.

<p>
In my talk I'll cover some of the standard techniques of usability
design, including testing, persona development, use cases, and paper
prototyping.  These are simple, inexpensive techniques that can be
applied to almost any project to make them more usable and enjoyable.
To keep my presentation grounded, I'll include examples from my
experiences in applying them to real projects.

<p>
<hr width="50%">
<a name="eisen">
1 Aug, 4:00 -  5:00 <strong>Michael Eisen Keynote</strong><br>
View slides: [<a href="slides/2002-08-01-Eisen-BOSC.ppt">PPT</a> |
              <a href="slides/2002-08-01-Eisen-BOSC.pdf">PDF</a>]<br>
<p>

Creating an Electronic Public Library
of Scientific Knowledge.

<p>
<hr width="50%">
<a name="hide">
2 Aug, 9:00 - 10:00 <strong>Winston Hide Keynote</strong><br>
<p>
Dr. Hide will be presenting on the impact of Open Source in the Real World.
<p>
<hr width="50%">
<a name="omnigene">
2 Aug, 10:00 - 10:25 <strong>OmniGene</strong><br>
Brian Gilman / Whitehead Institute<br>
<p>
	The OmniGene project has produced modules to build web services
for data analysis, integration, and visualization. OmniGene accomplishes
this goal through the utilization of Java Enterprise and web service
technologies. The core API consists of modules to: 1) perform queries
across disparate databases, 2) Transform queries into commonly used XML
formats, 3) Parse the output of these queries into an object graph, 4)
Visualize and share knowledge in a client server or true peer to peer
network, 5) Dynamically discover another web service, 6) Easily plug in
analysis applications. One major goal of the OmniGene development team is
to abstract away XML parsing, Enterprise Java Bean, and trasaction code
from the bioinformatician so that they may concentrate on their data and
data analysis.

<p>
	The OmniGene system will utilize the output and API from the
BioMOBY project to perform dynamic discovery of services. The BioMOBY and
OmniGene developers are now working together to integrate their two
projects. OmniGene is an open source, open standards initiative and is
distributed under the BSD license.


<p>
<hr width="50%">
<a name="browsing">
2 Aug,10:45 - 11:10 <strong><a href="#generic">Generic Genome Browser</a> &amp; <a href="#apollo">Apollo</a></strong><br>
Joint talk by Lincoln Stein &amp; Nomi Harris
<p>

<a name="generic">
<strong>The Generic Genome Browser: A Building Block for a Model Organism System</strong><br>
<i>Lincoln D. Stein</i>,
Allen Day,
Todd Harris,
Adrian Arva <br>
Cold Spring Harbor Laboratory<br>
<br>
ShengQiang Shu,
Suzanna Lewis,
Christopher Mungall<br>
Berkeley Drosophila Genome Project, Lawrence Berkeley Laboratory</br>
<br>
View slides: [<a href="slides/2002-08-02-Stein-GMOD.ppt">PPT</a>]<br>
<p>
The Generic Model Organism System Database Project (GMOD) seeks to
develop and release reuseable software components for model organism
system databases. Here we describe the Generic Genome Browser
(GBrowse), a web-based application for displaying genomic annotations.
For the end user, features of the browser include the ability to
scroll and zoom through arbitrary regions of a genome, to enter a
region of the genome by search for a landmark, full text search of all
annotations, the ability to enable and disable tracks and change their
relative order and appearance, the ability to upload private
annotations and view them in the context of the public ones, and the
user's ability to publish his own annotations to the community. For
the data provider, features of the browser software include reliance
on readily-available Open Source components, simple installation,
flexible configuration, and easy integration with other components of
a model organism system web site.

<p>
GBrowse is written in the Perl programming language and makes
extensive use of the BioPerl middleware layer. This gives the browser
the flexibility to take advantage of a number of underlying databases
and data sources, including ones based on the Distributed Annotation
System (DAS). For new developers, GBrowse uses a minimal MySQL-based
database called Bio::DB::GFF. Developers requiring a richer database
back end can use the GadFly database, an outgrowth of the Berkeley
Drosophila Genome Sequencing project.

<p>

GBrowse is currently used as the genome browser for the WormBase (<a href="https://web.archive.org/web/20040606140611/http://www.wormbase.org/">www.wormbase.org</a>) and FlyBase (<a href="https://web.archive.org/web/20040606140611/http://www.flybase.org/">www.flybase.org</a>) projects. Its
source code, example data and configuration files, and support are all
available at the GMOD web site,
<a href="https://web.archive.org/web/20040606140611/http://www.gmod.org/">http://www.gmod.org</a>.

<p>
<a name="apollo">
<strong>The Apollo Genome Annotation Tool</strong><br>

<i>Nomi L. Harris</i>, Suzanna E. Lewis, Mark Gibson, Colin Wiel, John Richter<br>
Berkeley Drosophila Genome Project<br>
<br>
Stephen M.J. Searle, Michele E. Clamp<br>
The Sanger Institute and the European Bioinformatics Institute<br>
<br>
View slides: <a href="slides/2002-08-02-Harris-Apollo.ppt">PPT</a><br>

<p>
Apollo is an Open Source genome annotation viewer and editor. It was
developed as a collaboration between the Berkeley Drosophila Genome
Project (part of the FlyBase consortium) and The Sanger Institute in
Cambridge, UK. Apollo allows researchers to explore genomic
annotations at many levels of detail, and to perform expert annotation
curation, all in a graphical environment. It is being used by the
FlyBase biologists to make the final annotations on the finished
<i>Drosophila melanogaster</i> genome, and will also be the primary vehicle
for sharing these annotations with the community.

<p>

An increasing number of research groups are using Apollo as a starting
point for customizing their own annotation visualization tool. Because
Apollo was developed from the beginning to serve the needs of two
groups working on different organisms (fruitfly and human) with different
types of data, it was specifically designed to be flexible and
extensible. The Generic Model Organism Database (GMOD) project, which
aims to provide a complete ready-to-use toolkit for analyzing whole
genomes, has adopted Apollo as its annotation workbench.

<p>

Apollo can read annotation data in a variety of formats (including GAME
XML, GFF, and GenBank) via CGI, CORBA, and flat files, and the data
adapter interface makes it straightforward to add new formats.  The "look
and feel" of the display is also highly configurable. For example, rows
of data (known as "tiers") can be moved with the mouse; the threshold for
displaying results can be changed; and the colors of each result can be
customized, all from within Apollo.

<p>

Apollo is a powerful annotation curation workbench, simplifying the task of
biologists who are poring over thousands of computational results and
deciding how to summarize all the relevant data into a complete and
accurate description of the genome and, ultimately, the proteome. Gene
transcripts are easily created by dragging computational results and
dropping them in the annotation zone. Other tools within Apollo let the
curators edit the annotations in many ways, including adding canned or
customized comments. Validation of annotations is simplified by edge
matching, start/stop codons, and splice site detection.  The new Synteny
Viewer allows visualization of cross-genome comparisons.

<p>

Apollo is available
at SourceForge: <a href="https://web.archive.org/web/20040606140611/http://sourceforge.net/projects/gmod/">http://sourceforge.net/projects/gmod/</a>.
Like all gmod components, it is distributed under the terms of the
Artistic License.

<p>
<hr width="50%">
<a name="biomoby">
2 Aug, 11:10 - 11:35 <strong>BioMOBY</strong><br>
Mark Wilkinson / Plant Biotechnology Institute, NRC Canada <br>

<p>

The BioMOBY project will generate a web services registry for
biological data.  Two main components are required: 1: a defined set
of approximately 200-250 lightweight XSD templates describing basic
biological data objects, 2: a central registry, "MOBY-Central", which
stores the URL and protocol of a web service, as well as the input and
output objects that the service accepts and generates.  At present it
is planned that MOBY-Central, upon client request, will generate a
WSDL service specification on-the-fly and return this to the Client.
The Client then transacts the service directly from the service
providor.  Such a system differs in several ways from the UDDI model;
it simplifies participation of Servers by assuming the task of WSDL
document creation, but most importantly it is "object driven" in that
the Client may request that MOBY-Central report all services that can
accept the object in-hand as input.  This enables the dynamic
discovery of information without necessitating prior Client (user)
knowledge of the existence or location of this information.

<p>

A similar but independant project, OmniGene, stands as proof of
concept for the BioMOBY proposals and the OmniGene and BioMOBY
developers are now working together to create the BioMOBY system.  No
license agreement has yet been decided among the BioMOBY project
participants, however it is likely that BioMOBY will be released under
the same license as Perl itself.


<p>
<hr width="50%">
<a name="biosql">
2 Aug, 11:35 - 12:00 <strong>BioSQL</strong><br>
Chris Mungall / Berkeley Drosophila Genome Project<br>

<p>

The BioSQL project defines a general bioinformatics schema intended
for use in multiple projects across different database management
systems and programming languages. Adapters exist for perl, java,
python and ruby.  These adapters provide an API for accessing a BioSQL
database via that languages bio-project object model. The BioSQL
schema and adapter code works with both MySQL and postgres, and should
be easily adaptable to other DBMSs e.g. Oracle.

<p>

BioSQL takes a modular approach to schemas. The existing core modules
deal primarily with sequence annotation data. There is also a module
for controlled vocabularies and ontologies, and modules can be added
for dealing with other data classes, eg expression data.

<p>

The BioSQL schema is designed to be extensible, individual projects
can extend the datamodel via controlled property tables, whilst still
conforming to the core schema.

<p>
<hr width="50%">
<a name="kilburn">
2 Aug, After lunch speaker (about 1pm)<br>
<strong>Post-Genomic Challenges for the Bioinformatics Open Source Community</strong><br>
Dan Kilburn / Beyond Genomics<br>
View slides: [<a href="slides/2002-08-02-Kilburn.ppt">PPT</a>]<br>

<p>

A new phase of the work in the field of bioinformatics has begun: the
task of elucidating how living biological systems function.
Identifying how genes function and how their products interact with
other molecules occurs within the context of processes, location and
their assumed roles.  This places new demands for representing and
handling forms of complex information, including pathways and causal
models in biology, and places new demands on bioinformatics scientists
to bridge the gape between computation and biology.  Annotation and
testing of models of biological systems requires the use of knowledge
representations e.g., ontologies, which can links to large volumes of
diverse experimental data.  Beyond Genomics is developing a systems
biology approach for integrating and analyzing information from
multiple platforms called BioSystematics&trade; that integrates three
parallel data producing platforms: MS/MS proteomics, MS/NMR
metabolomics, and microarray transcriptomics.  This enables us to
build causal and dynamic models for normal and diseased systems,
capturing not only sequence/structural information, but also
process-oriented information relevant to biochemical systems.  An
initial challenge we face is to build an infrastructure utilizing
scalable and distributed solutions that incorporate a number of novel
bioinformatics technologies, and which use open source solutions and
standards to simplify challenges of data integration and interchange.


<p>
<!-- Canceled
<hr width="50%">
<a name="cartwheel">
2 Aug, 1:50 - 2:15 <STRONG>Cartwheel and FamilyJewels: Bioinformatics Toolkits for Genomic Analysis</STRONG><br>

<i>C. Titus Brown</i>, Tristan De Buysscher, Ramon Cendejas, Meredith
L. Howard, Barbara J. Wold, Eric H. Davidson and R. Andrew Cameron<br>
California Institute of Technology<br>

<P>
-->

The Cartwheel Project is an open-source development effort to develop
tools for community analysis and annotation of genomic sequence, and
FamilyJewels is an associated project to develop tools for comparative
sequence analysis. The primary emphasis of these efforts is on the
development of client-server tools that are easy to use with clients
that can run on multiple platforms.

<p>

There are several components to Cartwheel and FamilyJewels that are
fully developed, including a batching and queueing system that wraps
several 3rd-party analysis tools (Cartwheel batchqueue); a Web site
and analysis server for creating personal analyses and annotations of
genomic sequence (canal); a GUI program for viewing large BAC
annotations (SUGAR); a comparative sequence analysis program
(seqcomp); and a comparative sequence analysis visualization tool
(FamilyRelations). The visualization tools are actively being used in
several labs to investigate gene regulatory networks, and the
Cartwheel batchqueue system is being used by the Sea Urchin Genome
Project to do automatic analyses and annotations.

<p>

We are currently extending Cartwheel to provide a mechanism for
community annotation of genomic sequence via the DAS protocol.

<p>

All code is distributed under the GNU Public License, and the CVS
archives are on SourceForge. All backend components are written in C
or CPython, and the frontend GUIs are written in Jython (Python for
Java). For further information see
<a href="https://web.archive.org/web/20040606140611/http://cartwheel.caltech.edu/">http://cartwheel.caltech.edu/</a>,
<a href="https://web.archive.org/web/20040606140611/http://sea-urchin.caltech.edu/software/">http://sea-urchin.caltech.edu/software/</a>, and
<a href="https://web.archive.org/web/20040606140611/http://family.caltech.edu/">family.caltech.edu</a>.


<p>
<hr width="50%">
<a name="magestk">
2 Aug, 1:50 -  2:15 <strong>MAGEstk - the MAGE software toolkit</strong><br>
Jason Stewart / OpenInformatics.com and OpenInformatics.org <br>

<p>
Location: <a href="https://web.archive.org/web/20040606140611/http://mged.sf.net/">mged.sf.net</a><br>
License: MIT<br>

<p>
The MAGE software toolike (MAGEstk) project has created a software
infrastructure we believe will be useful to many biologists. It was
originally created to provide a complete software API to the
MicroArray Gene Expression Object Model (MAGE-OM) developed by the OMG
(omg.org) in collaboration with MGED (mged.org). Since it's inception,
MAGEstk has evolved into a powerful generic tool for automatically
generating a broad informatics infrastructure from a UML object model.

<p>
When provided with a UML data model the MAGEstk tools are able to
automatically generate the following:

<ul>
<li> a software API in Java, Perl, C++, and (soon) Python
<li> an XML markup language to transmit data, defined by a DTD
<li software for serialization of data objects to xml, and de-serializaiton of xml into data objects <li> a relation DB schema in SQL to store the data objects persistently
<li> (soon) the code for serializing objects to the DB and de-serializing
  data from the DB into objects
</ul>

<p>

By providing such a useful set of infrastructure so easily, many
informatics projects will be able to get started quickly without
getting bogged down re-implementing basic infrastructure. Since it
requires a UML data model, this forces biologists to spend time
developing a good model for the data, and not on software
development. This should help improve the overall quality of the final
system. We believe that MAGEstk will prove itself especially useful
for enabling informatics projects to communicate data using emerging
WWW services protocols such as MOBY.

<p>
<hr width="50%">
<a name="freebsd">
2 Aug, 2:15 - 2:22 <strong>The FreeBSD-bio porting project</strong><br>
Johann Visagie / Electric Genetics <br>
<p>

FreeBSD is a free and open source operating system based on 4.4BSD, the final
version of Berkeley's academic operating system which formed the basis for
the development of much of the technologies underlying the Internet,
including the initial incorporation of TCP/IP into a Unix operating system.
As such it has a long and widely published academic track record, and over
this long history it has developed the sort of benchmark-stretching
performance and rock-solid stability required of a Bioinformatics server.

<p>
FreeBSD provides a third party application infrastructure known as the "ports
collection".  Going beyond most of the various binary packaging systems found
in commercial and open source Unices, FreeBSD's ports provide an integrated
infrastructure for downloading, patching, building and installing
applications from source code, as well as for the further maintenance,
upgrading and (if necessary) removal of such appliations once installed.

<p>
The ports system extends open source principles from development to system
administration; it allows the experienced user to share his experience in
compiling and installing and configuring a particular application on FreeBSD,
and it allows the novice user to emulate this experience with a single
command.

<p>
The FreeBSD-bio project is a (highly) informal group of volunteers -
co-operating via a mailing list - who share the goal of committing a large
variety of commonly used bioinformatics tools and packages to the FreeBSD
ports collection's CVS tree.  Since FreeBSD ports are packaged in a
pre-compiled binary format on FreeBSD distribution CDs and DVDs, this will
eventually allow even the novice end user to set up a fully functional
bioinformatics server "out of the box" with minimal effort.

<p>
URL (mailing list):  <a href="https://web.archive.org/web/20040606140611/http://www.plig.net/mailman/listinfo/freebsd-bio/">http://www.plig.net/mailman/listinfo/freebsd-bio/</a><br>
Licence:             BSD, of course.  :-)<br>


<p>
<hr width="50%">
<a name="vocabulary">
2 Aug, 2:22 - 2:29 <strong>A controlled vocabulary for gene expression</strong><br>
<i>Johann Visagie</i>[1], Janet Kelso[2], Soraya Bardien-Kruger[2], Alan
Christoffels[2], Tania Hide[1], Winston Hide[2]<br>

[1] Electric Genetics (Pty) Ltd<br>
[2] South African National Bioinformatics Institute<br>

<p>

Electric Genetics and SANBI have developed a tool that integrates
transcript information, genomic sequence, genetic mapping information
and a standardised controlled vocabulary to serve as the basis for a
more complex system that will aid in the identification of disease
genes candidates.

<p>

A controlled vocabulary which consists of a predefined orthogonal set
of hierarchical schemas was constructed.  These currently include
schemas containing terms describing anatomical site, cell type,
developmental stage and pathology.

<p>

Expression data was mapped to the vocabulary by associating 6937
individual cDNA libraries (including dbEST and SAGE) with one or more
terms in as many of these schemas as possible.

<p>

The vocabulary was implemented as a relational database schema.  A
Python API was constructed to provide a number of facilities to mine
the vocabulary database, including a parser for a simple boolean query
language.

<p>

The controlled vocabulary demonstrates that mapping expressed
sequences to terms in a number of separate hierarchical schemas allows
detailed mining of gene expression state information.

<p>

Current ongoing work includes the investigation of the utility of DAS
v1.5 to serve as a query front-end for the controlled vocabulary.


<p>
<hr width="50%">
<a name="lumberjack">
2 Aug, 2:29 - 2:36 <strong>LumberJack</strong><br>
<i>Carolyn J. Lawrence</i> [1],
R. Kelly Dawe [1&amp;2],
Russell L. Malmberg [1]<br>

Departments of [1] Plant Biology and [2] Genetics, University of Georgia<br>
<p>

LumberJack is a ML heuristic search tool written in JAVA that
progressively jackknifes windows from a sequence alignment to generate
multiple neighbor joining trees.  It compares the trees statistically
on the basis of their relative likelihood scores.  This allows the
program not only to identify reasonable phylogenetic trees quickly,
but also to map phylogenetic signal onto the alignment.


<p>
<hr width="50%">
<a name="clustering">
Reviewers' Note: The non-open source licensing restrictions were removed after
submission of the abstract.<br>
2 Aug, 2:36 - 2:43<br>
<strong>Analyzing cDNA microarray data using Python and the C clustering library:<br>
Why scripts are better than GUIs.</strong><br>
<i>Michiel de Hoon</i>, Seiya Imoto, Satoru Miyano<br>
Laboratory of DNA Information Analysis / Human Genome Center / Institute of Medical Science /
  University of Tokyo<br>
<p>

  Gene expression data generated in cDNA microarray experiments are commonly
 analyzed by clustering methods using GUI-based codes such as Cluster/TreeView
 program (Eisen, 1999) and GeneCluster (Tamayo, 1999). While the former is
 open source, the latter is not.<br>
<br>
  Whereas a GUI makes an analysis tools easier to use initially, analysis
software based on scripting languages such as Python have several advantages.
Script-based codes are generally more flexible and easier to develop. In addition,
scripting languages usually offer a range of useful features, such as text
processing capabilities, database access, and plotting routines that would
have to be programmed from scratch in a GUI-based code. Numerical routines
that would be too slow if implemented in a scripting language directly, can
be written in a lower-level language such as C and Fortran. These routines
can then be called from the script. Finally, we note that GUI-based codes
usually rely on a specific (often commercial) compiler, which makes porting
code, open-source development and code improvement very difficult.<br>
<br>
  While script-based code is already commonly being used for sequence analysis
 (e.g., BioPerl and BioPython), analysis of gene expression data is still
dominated by GUI-based codes. We have therefore written a C library of routines
for hierarchical clustering and for Self-Organizing Maps, as well as an advanced
 k-means clustering algorithm. This clustering library can be called from
Python, or linked to other codes. Unlike Cluster/TreeView, no licensed code
was used to develop this library, making the C clustering library truly open
source. The library can be compiled with the GNU C compiler.<br>
<br>
  The C clustering library is freely available for academic and non-commercial
 use at<br>
<a href="https://web.archive.org/web/20040606140611/http://bonsai.ims.u-tokyo.ac.jp/%7Emdehoon/software/software.html">
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/software.html</a>
<br>
<b>  Contact:</b><a class="moz-txt-link-abbreviated" href="https://web.archive.org/web/20040606140611/mailto:mdehoon@ims.u-tokyo.ac.jp"><br>
mdehoon@ims.u-tokyo.ac.jp</a>


<p>
<hr width="50%">
<a name="depend">
2 Aug, 2:43 - 2:50 <strong>Schedule::Depend</strong><br>
Steven Lembark / Workhorse Computing<br>

<p>
One of the more challanging areas of BioInformatics is simply getting
computer work organized. The challange is complicated by competing
requirements for speed, orginazation and dynamic schedules. Most existing
tools are ill-suited for a dynamic parallel sheduling of interactive
jobs. Traditional tools such as cron are generally too static, make sytax
for handling dynamic job lists is byzantine at best, and the
commercial packages are frequently too complicated to handle dynamic
schedules easily.

<p>
Schedule::Depend is a OO Perl module designed for specifically this kind
of task. It uses a simple syntax of colon-separated dependencies (pretty
much like make's) with the individual job tags being passed through an
"unalis" method. The default method provided can handle object or class
methods, subroutine refernces or perl blocks along with shell paths. This
avoids having to wrap everyting in shell code, and simplifies the syntax
enormously. The simplified syntax makes generating dynamic schedules far
simpler than with other tools. The module provides for job tracking,
restarting (i.e., skipping previously completed tasks) and continuing after
an aborted job (much like "make -k").

<p>
As an example I'd bring in a working schedule used in a BioInfo. lab
or schedule from update_wormbase.

<p>
The module itself is on CPAN, released under the same terms as
Perl5.


<p>
<hr width="50%">
<a name="biit">
2 Aug, 2:50 - 2:57 <strong>Biological Information Integration Toolkit</strong><br>
<i>Jeremy Praissman</i>, Dawei Lin, John Rose, Bi-Cheng Wang<br>
Department of Biochemistry and Molecular Biology / University of Georgia

<p>

Rapid progress of genome sequencing projects and the NIH protein
structure initiative has brought about an increased focus on
biological data integration and computationally guided
experimentation. As part of the effort to develop high throughput
structure determination technologies, it is crucial to integrate and
perform large scale analysis of interdisciplinary data. While many
tools exist for parsing biological databases and interfacing with
existing analysis programs, there is currently a lack of support
infrastructure for correlating biological information from disparate
sources. Our toolkit is an attempt to address this need using
mathematical set and combinatorial graph techniques.

<p>

The toolkit, which is implemented in perl and uses some bioperl
modules, has already been applied in analyzing neighbour gene
relateness for complete bacterial genomes released at NCBI. It will be
extended to support further correlation of genes with other genomic,
structural and functional information.

Current features:
<ul>
<li> calculating basic genome statistics
<li>calculating statistics for genes grouped by intergenic distance
<li> parsing and classifying functional annotation
<li> analyzing annotation within sets of genes grouped by intergenic distance
<li> generating user specified sequence fragments
</ul>

Planned features:

<ul>
<li> subset generating operations on gene sets using user supplied functions
<li> generic filtering iterators
<li> combinatorial data structures for further correlating biological information
</ul>

<p>

<hr width="50%">
<a name="nblast">
2 Aug, 3:20 - 3:25, <strong>NBLAST: a cluster variant of BLAST for NxN comparisons</strong><br>
<i>Michel Dumontier</i><br>
Samuel Lunenfeld Research Institute, Mt. Sinai Hospital, Toronto, ON
Canada M5G 1X5
<p>

  The BLAST algorithm compares biological sequences to one another in order
to determine shared motifs and common ancestry. However, the comparison of
all non-redundant (NR) sequences against all other NR sequences is a
computationally intensive task. We developed NBLAST as a cluster computer
implementation of the BLAST family of sequence comparison programs for the
purpose of generating pre-computed BLAST alignments and neighbour lists of
NR sequences.

<p>
  NBLAST performs the heuristic BLAST algorithm and generates an exhaustive
database of alignments, but it only computes alignments (i.e. the upper
triangle) of a possible N<sup>2</sup> alignments, where N is the set of all sequences
to be compared. A task-partitioning algorithm allows for cluster computing
across all cluster nodes and the NBLAST master process produces a BLAST
sequence alignment database and a list of sequence neighbours for each
sequence record. The resulting sequence alignment and neighbour databases
are used to serve the SeqHound query system through a C/C++ and PERL
Application Programming Interface.

<p>
  NBLAST offers a local alternative to the NCBI's remote Entrez system for
pre-computed BLAST alignments and neighbour queries. On our 216-processor
450 MHz PIII cluster, NBLAST requires ~24 hrs to compute neighbours for
850000 proteins currently in the non-redundant protein database.  NBLAST
source code and binaries are available at
<a href="https://web.archive.org/web/20040606140611/http://sourceforge.net/projects/slritools">http://sourceforge.net/projects/slritools</a>
and the NBLAST article is freely
available at BioMed Central Bioinformatics
<a href="https://web.archive.org/web/20040606140611/http://www.biomedcentral.com/1471-2105/3/13/">http://www.biomedcentral.com/1471-2105/3/13/</a>.