- contents -- entities -- signals -- coreference -- relationships -- general guidelines -- recipe -- histopath -- radiology -
- previous -- next -
Annotating text: general guidelines
Introduction
For our purpose, annotating text is the process of marking stretches, or spans, of text in some way, signifying that the span of text has particular semantics. Typically, an annotator will carry out this process with some tool. The tool will be used to associate annotations with bits of text, describing the semantics of those spans. In addition to annotations being associated directly with a span, other annotations may be added that describe relationships between bits of text. Annotation is about the text: what appears in it and what it means. It is not about building an abstract model of the text: it is grounded in the document itself.
In CLEF, the annotation process can be split into four sub-tasks:
- mark stretches of text as referring to entities, assigning them an entity type (such as locus)
- mark stretches of text as signalling something about an entity (such as the laterality of a locus)
- add other annotations to describe coreference links between those stretches of text that refer to the same entity
- add other annotations to describe the relationship between entities
These guidelines describe how annotators should map from the surface text to annotation:
- Which bits of text should be annotated?
- How should spans of text be mapped to entities and signals?
- When should annotations describing coreference and relationships be created?
- How should special cases be dealt with?
- What information should be recorded for a span of text?
This section gives some general guidelines for annotating text. This is followed by specific guidelines for each entity, signal, and relationship type.
Collaboration between annotators
- Annotators should not collaborate when marking up texts, unless explicitly requested to do so.
- A set of annotations should be the work of a single annotator only.
- There will be a designated phase of the annotation process for the discussion and resolution of differences.
Annotate words, not concepts
- Annotation is primarily about words.
- The presence or absence of the things in the world that those words refer to, is only of secondary importance.
- This means that words should be annotated even if the thing they are referring to does not really exist.
- Things that are in the future, are hypothesised, or even speculative, should still be annotated.
- For example,
- "we will need to check his X-rays when he is admitted"
- "x-ray" should be annotated as an investigation, even though it does not yet exist, and if circumsstances change, may never exist.
If something appears too complex to annotate, or you are unsure ...
- Then it probably is too complex to annotate, and should be left.
- There is no point in spending lots of time in philosophical knots about how something should be annotated.
- Annotation is not about trying to attach a label to every word.
- The guidelines can never cover every eventuality.
- For example:
- "he has difficulty clearing sputum"
- If you are not sure which of these words is a conditon entity, then don't worry.
- This point is particularly pertinent to highly qualified problems and loci. In these cases, just annotate the main word or words.
- For example,
- "mild sudden onset bronchitis" - just annotate "bronchitis".
- "multinodular goitre" - if you are not sure whether "multinodular" is important, ignore it and just annotate goitre.
Don't base annotation on your own view of what should be in a medical record
- For any annotators with an interest in medical information and records, this may be the hardest guideline to apply.
- Annotation is about finding those things that are listed in the guidelines.
- It is not about your own pre-conceptions.
- It is not about finding those things that you personally think should or should not be in a medical record
- Your own view of what should and should not be in a record may be very well founded and thought through, but it may be different to the view of the guidelines.
- We are interested in consistent annotation based on a single, written down set of instructions.
- We are not trying to collect annotations based on lots of different viewpoints, however expert they may be.
- Please try to justify every annotation against the guidelines.
- If you find yourself puzzling over whether something should or should not be annotated, and trying to squeeze something into the guidelines, then it is probably best not annotated.
- You may find that it is a complex phrase - perhaps you can annotate just the core part of it. See the examples in the previous section.
Overlapping and containment of annotations
- Mentions cannot overlap with other mentions, or be contained within them
- Signals cannot overlap with other signals, or be contained within them
- Mentions and signals cannot overlap each other, or be contained within each other
Breaking down phrases
- A key question when annotating entity mentions, is: what is the textual extent of a mention? What does it include, and what does it exclude?
- For example, "mild left groin pain" can be annotated in many ways:
- as "mild left groin pain", a mention of a pain condition
- as "left groin pain", a mention of a pain, with "mild" left unannotated
- as "left", a laterality, and "groin pain", a mention of a condition
- as a laterality, a condition "groin pain" and a locus "groin"
- as a laterality, a locus, and a condition
- etc...
- The general rule will be to break phrases apart into their component entities. Modifiers that are not commonly treated as part of an entity will be ignored.
- So the above example will be annotated as a laterality, a locus, and a condition. The word "mild" will be ignored.
- There are, however, many medical terms that commonly include modifiers as part of the term.
- For example, "full blood count" could be annotated as:
- a count intervention with locus blood, the modifier "full" left unnanotated
- an intervention with mention "full blood count"
- In these cases, the term will not be split. It will be annotated as a single mention.
- The decision as to whether a term should be split is left to the judgement of annotators
- General tests that are suggestive of a term that should not be split are:
- "Would the mention be found in a medical dictionary?"
- "Is the mention something that has an acronym in wide use?"
- Can the mention be rearranged syntactically, e.g. by switching words or by introducing a prepositional "of"?
- If it can't then it may be a term (but not vice versa)
- Annotators should not attempt to assign annotations to every word in complex phrases. Words that are not clearly one of the required annotations can be safely ignored. If there is any doubt, do not annotate a word.
- For example,
- "moderately differentiated adenocarcinoma"
- Only the word "adenocarcinoma" should be annotated
- For example,
- "partial nephrectomy" would not appear in a dictionary, though "nephrectomy" would.
- Just "nephrectomy" should be annotated.
- For the purposes of the dictionary test, the final arbiter will be:
- Stedman's 27 edition. This is available online at http://www.stedmans.com/section.cfm/45
- For terms that may be British English specific, the UK CancerWEB Online Medical Dictionary. This is available online at http://cancerweb.ncl.ac.uk/omd/
- If you are unable to use an online dictionary, a paper one may be provided. (For example, if network connections are restricted for confidentiality reasons)
- Some examples:
- "Full blood count" would be found
- Also, "count of full blood" makes no sense
- Myocardial infarction would be found
- Also, swithcing words makes no sense without changing the syntactic category of the words (infarct of myocardium)
- "mild left groin pain" would not
- "left groin pain" would not
- "groin pain" would not
- "pain" would be found, as would "groin"
Implied entities
- Entities must have at least one mention.
- Only entities that are explicitly mentioned in the text should be annotated.
- Inference using domain knowledge should not be needed to create an entity. If an entity can only be inferred using domain knowledge, then that entity shall not be created.
- Every mention must refer to a piece of text.
- For example:
- "Histology shows ..." implies that the patient had a biopsy
- If the text nowhere mentions that the patient had a biopsy, then an intervention entity for this biopsy must not be created
- Conversely, entities must not be ignored, because although the annotator recognises an entity, they think that it is unimportant to the narrative. All entities that appear in the text should be annotated, whether or not the reader thinks they are clinically important.
Relationships and domain knowledge
- In many cases, relationships are explicitly stated in the text.
- For example:
- "Paracetamol was prescribed for his pain"
- An has_indication relationship between "paracetamol" and "pain" must be annotated: it is clearly stated to exist in the text.
- There are lots of other common patterns that signify relationships:
- "Problem in the Locus"
- "Investigation showed Problem"
- "Problem seen on the x-ray"
- "Problem found on examination"
- Occasionally, however, some level of domain knowledge is required to infer that a relationship exists between two entities. These relationships will be annotated.
- For example:
- "He is in pain. Paracetamol was prescribed"
- A has_indication relationship between "paracetamol" and "pain" exists
- It requires (minimal) domain knowledge to infer this, but can also be guessed.
- The relationship will be annotated.
- For example:
- "He is suffering from nausea and severe headaches. Dolasteron was prescribed"
- There is a has_indication relationship between "dolasteron" and "nausea".
- This is not obvious without domain knowledge. But with domain knowledge, it is quite clear.
- The relationship will be annotated.
- Please try to only annotate those relationships that the text is telling you about. Often, such relationships are clearly stated. Sometimes, the text is saying something, but it needs some clinical knowledge interpret this and to decide on the relationship. This should not mean, however, that you try to deduce every single relationship between every single entity, regardless of whether the text is saying something about it. We are only interested in what the text is telling us.
- The guidelines on relationships should not imply that you should go on a hunt for tenuous and conjectured relationships holding between two entities mentioned several paragraphs apart. Relationships should be intentionally stated in the text, although in practice this might be hard to judge. If a relationship is not obvious with your clinical kowledge, please do not annotate it.
- Relationships should be annotated between the particular spans of text that seem relevant to your reading of a particular paragraph or section, i.e. spans of text that are "in the focus" of your attention.
Signals: modifying entities
Signals are additional words that modify an entity, to provide extra information about it. For eample, "_left_ leg", "_no_ meatstases", "_upper_ back". Signals always modify an entity that is closely associated with them. Signals are related to their main entity with a "modifies" relationship. So we might create annotations that say "left modifies leg". The modifies relationship is not like other relationships. It is saying something about the linguistic structure of a phrase, and is much less about clinical (domain) knowledge than other relationships.
- Signals are always in the same phrase as the word they are modifying. Never mark a signal as something modifying an entity in some other sentence or phrase.
- For example,
- "Fragmented core and blood clot together. Histological examination shows bone marrow with extensive necrosis"
- Do not annotate "core" as a sublocation modifying "bone marrow". "Core" is not signalling anything about any word in its immediate phrase surroundings.
- Signals are almost always before the word they are modifying. Think hard before using a signal to modify a word in front of it.
- For example,
- "lower back": "lower" modifies "back"
- "head of the pancreas": "head" should be marked as modifying "pancreas"
- Signals are often adjectives
- Every signal that is annotated should be related to at least one main entity. Signals should not be created that do not modify any entity.
- For example, please do not mark every occurence of the word "no" as a negation signal. Only those examples that are clearly referring to some condition are signals about conditions.
- For example,
- "no referral has been made"
- There is no need to mark "no" as a negation signal. It is not describing the absence of a condition.
- Signals may modify more than one main entity.
- For example:
- "no consolidation and collapse"
- Mark "no" as negating both "consolidation" and "collapse"
Metonymy
- Metonymy is where a feature of something is used to stand for that thing.
- Entities and interventions that depend on metonymy will not be annotated.
- For example,
- "we shall see him again in 6 weeks" implies an appointment.
- An intervention for this appointment will not be annotated.
Cross-document inference
- All annotation will be of a single document in isolation, and should not consider other documents for the same patient. In particular, any inference required should only make use of information within the document being annotated.
Plurals, conjunctions and sets
- A single term may appear to refer to more than one entity. For example,
- Two or more lesions: "lytic lesions in the spine and abdomen"
- Two x-rays: "x-ray of the leg and chest"
- One scan: "CT scan of her abdomen and thorax"
- Two scans: "CT scan of her head and neck"
- In all of these cases, a single mention for a single entity will be annotated.
- Sometimes, a plural or a set of things will be mentioned, and then a little later, a single member of that set. For example,
- "Her finger nails show onycholysis. The nail of the left index is bleeding from the bed"
- In such cases, the set (finger "nails"), and the indivudual ("nail" of the left index), should be annotated as entities.
- They should not, however, be coreferred - see Coreference
Spelling and other mistakes
- Misspellings should be annotated
- For example,
- "lumbar punction"
- is a misspelling of "lumbar puncture"
- it should be annotated as mention of an investigation entity
- The mention should be recorded as a spelling mistake
- Other changes made after the letter has been dictated and typed, and that alter the way the letter reads, should also be marked as spelling mistakes.
- For example, computer processing may introduce unintentional changes, such as:
- "r******otherapy", where some process has obliterated part of a word
- The word "r******otherapy" should be marked as a spelling mistake.
- Another mistake is where the typist misses a space between two words, running two different metnions together
- For example,
- "T1 G1 adenocarcinoma of the prostate. Presenting PSA8.5."
- In this case, the space has been accidentally ommited between an investigation, "PSA", and its result, "8.5"
- Where this has clearly happened, tha annotator should mark those parts of the non-spaced "word" that correspond to each mention type, regardless of the missing space, and additionally mark both as misspelt.
- In the example, the characters "PSA" in "PSA8.5" would be marked as an Investigation, and the characters"8.5" marked as a Result. Both would be marked as misspelt.