Sequence Ontology Meeting 2004
This meeting was divided in two to accommodate the people in America and Europe. The main meeting was held in Berkeley on August 19-20th 2004 and a second shorter meeting was held prior to the Genome Informatics Conference in Hinxton in September.
Part 1 Berkeley
The purpose of this document is to record the highlights of the discussions, and distill the action items from the meeting. Each of the discussion sessions stemmed from either a presentation by an attendee, or a question posed to the group about SO. The questions posed to the group are outlined in detail in the document 'Discussion Topics for SO meeting'. All meeting related documents will be archived on the SO website meetings page.
Some minor changes and some drastic changes have been made to the ontology as a result of this meeting. The minor changes, which including some housekeeping and additions have been checked into CVS. The drastic changes have been added to the CVS repository under the name SO-meeting.obo for perusal and more rigorous checking by the community.
Discussions
Pseudogenes
The Question 'Does SO allow us to represent pseudogenes' in the way that people want to annotate them. Are we missing any key concepts? Are the definitions correct? Rama Balakrishnan presented to us her views of pseudogenes based on her work at SGD. The consequence of mutation terms can be used to further characterize a pseudogene.
- Pseudogenes are necessarily non functional.
- Genes are functional.
- Regardless of whether or not we can be 100% sure that something truly is a pseudogene, it is a term that people want to and need to annotate with.
- Mutations in a gene do not make it a pseudogene.
Bob gave us an example of a problem arising when we look at pseudogenes across genomes
Mouse genome-------|geneA|-------------\bad_copy_geneA\-------- paralogy Human genome-------|geneA|-------------\bad_copy_geneA\-------
What happens if the original geneA is lost in mouse? Is the bad copy of geneA a pseudogene or not?
Action Items
- The definition of pseudogene must be changed. Remove the sentence 'On occasion a pseudogene is functional as a consequence of being 'captured' by a non-paralogous gene, it is then known as a 'captured_psudogene'.
- Remove the term captured_pseudogene from being kind_of pseudogene_attribute.
- Add a term pseudogenic_exon. It is_a pseudogenic_region. This is different to decayed_exon as it will allow annotators to annotate pseudogenes to a deeper level.
- Create a new relationship called 'non_functional_relative_of' to allow us to annotate the relationship to the functional gene.
- Rama has suggested a term blocked_orf which she intends to use to signify an open reading frame with a premature stop.
- There need to be guidelines and a consistent way to annotate pseudogenes. We need to make a general annotation guide.
Genes
'The problem with polycistronic transcripts' and 'what does it mean to be a transcript' ended up being discussed together, along with 'what is a gene?' which spontaneously erupted.
The 2 basic problems were:
- The term polycistronic_transcripts cause a conflict in the ontology. This is because we can trace the relationship back using transitivity, and it states that polycistronic_mRNA is part_of a gene. This is conflicting because all of it is not part of a gene but part of several genes.
- There are parts of processed_transcripts that are not parts of the genomic sequence. They are added later.
The first thing to note about these problems is that they arise because of transitivity and they both involve the part_of relationship. What do we mean by part_of with regards to SO?
Philosophers have defined the properties of part to be:
- Nothing is a proper part of itself (A proper part is part of but not identical to the individual or whole)
- If A is a proper part of B then the B is not a part of A
- If A is a part of B and B is a part of C then A is a part of C
So when we apply this to sequence, a part must be contained within (ie fall within the coordinates of) the whole. Parts are transitive.
First discussion about transcript - when we change coordinate space, ie describe RNA sequence rather than genomic sequence, do we need to use 'derived_from'? instead of part_of?
The concept gene is one of the problem areas. There is an idea to have a concept 'transcribed_region' that will encompass the sequence on the genomic that spans the length of what is transcribed. A transcript would therefore be a part of this. This term is proposed to replace gene in the hierarchy.
We struggled to define gene. Michael defines gene: -if the products share an exon then they are the same gene. Bob pointed out that two different transcripts of the same gene_thing can have different functions. Gene_locus is proposed to replace gene, as it has a 'fuzzy' quality. Other definitions of gene are proposed that involve inheritance. 'A gene is described by particular parts of a gene_locus, and if a part is the CDS, then it does not overlap on the same stand in the same frame with any other genes CDS.'
Definition of gene is still up for discussion. 2 days was not long enough to resolve this problem.
Action items
- Separate gene from the hierarchy of the parts of gene. This resolves the problem of polycistronic transcripts. Create transcribed_region to replace gene in the hierarchy. Gene is associated with 1 or more transcribed_regions. A transcribed_region is associated with one or more genes.
- Define a relationship to describe that between gene and the parts of a gene. "Associated_with" is too vague - but is used as a placeholder.
- Define what we mean by gene.
- Resolve how to represent parts that are added at a later time.
Similarity/homology question
Instead of creating homologous_region as a locatable_sequence_feature we have decided to create 4 new nested relationships that will allow us to notate similarity, homology, orthology and paralogy between two locatable_sequence_features.
Action items
- Create the new relationships (similar_to, homologous_to, orthologous_to, paralogous_to)
SO and GO overlap.
It is OK for the terms to overlap.
Variation/mutation question
Everyone agrees that mutation is a 'loaded' term and we should stay with variation.
We ran out of time to discuss one of the big topics, which was to look at the ontology and decide if we could label any of the relationships or inverse relationships as necessary.
Part 2 - Cambridge
People interested in SO convened for a couple of hours before the Genome Informatics conference in Hinxton in September, to discuss the outcome of the August SO meeting. The aim of this was to walk thru the problems discussed and make sure the proposed solutions were logical.
Discussions
Polycistronic transcripts:
Richard's comment is that the original was not a problem to WormBase because they only annotate processed transcripts in worm, and when they are processed they become single transcripts. Michael pointed out that in fly both cases exist, i.e. there may also be polycistronic processed transcripts.
Geometric operators:
To resolve the problems of some parts being added to transcripts, therefore not being part of the genomic, we looked at geometric operators, catalogued by Egehofer, 1989. These operators allow us to relate terms with relationships that have defined positional meaning. In the same way that we can infer the coordinates of a part must be located within the coordinates of a whole, these new relationships allow us to infer other things about location. The operator we chose to explore at this time is 'meets' - when two regions abut each other, with only a junction between. This new relationship allows us to describe the relationship of the polyA tail and the cap to the mRNA. Using this relationship has allowed us to bring our definition of mRNA in line with that of the databank feature tables.
Action items
- Add new relationship 'meets'
- Redefine mRNA
- polyA_sequence meets mRNA
- cap meets mRNA
More about what is a gene:
Richard used to think that a gene was a bag of features, with no extent, but has now changed his mind. A gene must have an extent. Gene may not be a continuous region - trans-splicing. Also there are parts of genes that may belong to more than one gene. I.e. a regulatory region may belong to more than one gene. The same is true for non coding exons. A gene has no parts, it is associated with parts. Richards definition of a gene: An abstract concept to which things can be associated _with and may also have a location. So a gene is a region, and may be located, but everyone was happy to remove it from the hierarchy.
Action items
- Need to define gene more exactly.
- Gene is a region
- Gene associated with regulatory region, transcribed_region, transcript, CDS etc.
- Regulatory region is associated with both gene and transcribed region
DAS meeting comment from Lincoln:
Each term in SO must have a fetchable URL on the SO website.
Action item
- Create URL for each SO term
Pseudogene:
Everything that has a blastx hit is either a gene, a pseudogene or a transposon. There was more discussion about how to define if the gene is functional or not. The consensus remains that we need the term 'pseudogene' to annotate genomic sequence. The group approved the proposed new terms and relationships.
Phase of intron and CDS pieces:
Phase is a property not a concept. CDS pieces have a phase but not CDS itself.
Action item
- Create property called phase
Reading frame
A reading frame has potential to be translated. They do not have start and stop codons. This makes it different to CDS Richard is concerned that an open reading frame has a start when it shouldn't. He defines it as the in frame interval between stops. Michael gives the definition as - TER(NNN)nTER. There is concern that this will upset the prokaryote organism people. They will need to use CDS.
Action items
- Clarify the definitions of the reading frame terms.
- This should be noted in the annotation guide for use by prokaryote organisms.
GFF3 documentation
The documentation has got out of date. At last meeting Scott Cain took responsibility of this.
Action item
- Scott is maintaining the document in POD and will provide nice HTML mark ups for GFF3 documentation on the SO site
Addendum
After much discussion, the relationship between gene and transcript has been defined as a member-collection relationship, which is a subtype of part_of. A gene is a collection_of transcripts. A transcript is a member_of gene.