Annotation
How is the discourse divided up ?
In the vast bulk of the corpus, the speaker is the storyteller. The storyteller's speech is divided into utterances (encoded 'u' in the TEI). The major categories of speech and thought presentation were used as the basis for dividing the corpus into utterances: the majority of utterances are encoded either as sections where the narrator tells the story (NS), or as sections of direct discourse introduced by a verb of speech (DD), or sections of direct discourse where there is no verb of speech (fDD). However, while these are the dominant categories in this corpus, a number of other forms of speech and thought presentation are also encoded at utterance level and are discussed in detail below. Where the form of speech and thought presentation changes within an utterance (e.g. where the narrator shifts from telling the storyline (NS) into indirect discourse (ID)), the embedded form is encoded within a 'seg' within the utterance.
A small number of utterances are articulated by the audience and encoded accordingly as 'aud'.
How have the tagsets been designed ?
The four phenomena which have been marked up in the corpus are speech and thought presentation, detachment, inversion and negation. The tagsets for all four phenomena have been designed on the basis of recent and current research in the fields concerned. In the case of speech and thought presentation, this was largely a question of establishing a set of categories suited to this particular project and clear criteria for each category; for details of these categories, click here. In the case of detachment, inversion and negation, the researcher used recent and current research to establish the linguistic factors which appear to be of most relevance when analysing how these structures work in discourse. Each time a detached, inverted or negative construction occurs, it is marked up in a 'seg' for a series of linguistic factors which are outlined in more detail here: detachment, inversion, negation.
Tagset for Speech and Thought Presentation
Speech and thought presentation (STP) is marked up for a number of core categories, as follows. These are based on recent and current research in the field. The full set of categories and the complete tagset is given in the Header of each story file (the most frequently used tags are also given here, since STP is fundamental in terms of how the discourse is divided up):
- NS: utterances where the narrator is recounting the story
- NF: utterances where the narrator presents material around the main story, either at the beginning or the end, which is not part of the story itself [This can take a number of forms and is sometimes referred to as 'framing' material. Note, however, that this type of framing material is not connected to framing theory as articulated by Michel Charolles, which involves spatio-temporal adverbials which are detached at the head of independent clauses]
- NA: the narrator addresses the audience directly
- DD: reported discourse in the form of direct discourse
- ID: reported discourse in the form of indirect discourse
- FDD: reported discourse in the form of free direct discourse, i.e. with no verb of speech
- fDD: cases which formally resemble free direct discourse but function in practice as direct discourse, where it is clear that the storyteller is directly representing the speech or thought of a character (including vocal features) but where there is no verb of speech
- FID: free indirect discourse (FID). Encoded examples of free indirect discourse normally contain at least one linguistic element that is clearly characteristic of free indirect discourse, such as deictics relating to the character's 'here and now', subjective vocabulary or expressives that reflect the character's rather than the narrator's perspective (including questions), or intonation patterns that strongly suggest that the character is the enunciator rather than the narrator.
Utterances (or parts of utterances) that are ambiguous with respect to STP, i.e. cases where the discourse could be read as representing two possible STP categories, are marked up using portmanteau tags, e.g. where it is not clear whether the narrator is recounting the events of the narrative or whether an utterance is the speech or thought of a character through free indirect discourse (NS-FID). All reports of discourse are marked up as 'RD' (i.e. verbs indicating speech or thought processes, normally those introducing or following reported discourse). The overarching principles adopted with respect to reported discourse are:
- that there has to be a segment of discourse that is identifiable as the reported discourse for it to be marked up for STP: structures containing an infinitive rather than a finite verb (e.g. 'il a décidé de partir') are not marked up, nor are speech or thought acts without the reported discourse clause (e.g. 'il a décrit la situation'), representations of discourse without any segments of discourse or reported discourse (e.g. 'il a fait un discours excellent'), or reporting devices (e.g. 'selon lui ...');
- the 'report of discourse' must denote a speech or thought process in context: verbs of cognition or emotion that might be considered to be borderline cases of STP (e.g. 'savoir' or 'vouloir' or 'sentir/ressentir') are not normally included except in exceptional cases where there is a clear speech or thought process;
- negative clauses are included if the speech or thought process takes place, but otherwise not.
In a small number of stories, there are complex examples of embedded narratives. These vary in form and in complexity of speech and thought patterns. In general, the principles adopted are:
- where the embedding occurs at the outset and the embedded narrative is the main one, then the embedded narrative is the one encoded for STP and the introductory material is encoded 'NF';
- where the embedded narrative (or narratives) is quite substantial, but is not the main narrative, then the embedded narrative is 'doubly' encoded so that there is some flexibility around statistical calculations using search tools. So, for example, some utterances in an embedded narrative could be encoded as both narrator recounting the story (NS) and as Direct Discourse (DD). Where Direct Discourse is embedded within Direct Discourse this will be encoded as 'DD emb';
- where the embedded section is very short, it is not encoded as a separate narrative.
Tagset for Detachment
Detached structures are encoded as 'segs' within utterances for a number of factors that recent research has shown to be relevant for an analysis of their usage in discourse. The main factors are given below but a full list of categories and tags (with examples) is given in the Header of each story file:
- whether the detached element is to the right or left
- whether it is a pronoun (and if so, which pronoun), lexical noun or other item
- the nature of the 'replacement' pronoun within the main clause
- cases where the replacement pronoun is not straightforwardly co-referential with the detached element
- cases where there is no replacement pronoun but where there is a semantic or pragmatic link between the detached element and the main clause: in the literature on detachment, these links are often labelled 'effets de co-référence'
- cases of double and triple detachment
- cases where there are two detached elements in apposition or repeated detached elements
- cases where material is inserted between the detached element and the main clause
- cases where the detached clause is an interrogative.
[Note that cases of detached pronouns with 'aussi', e.g. 'moi aussi', are not marked up because of the distinctive semantic function of the combination of these two elements.]
Tagset for Inversion
Examples where the subject and verb are inverted in declaratives are encoded in segs within utterances for factors that research on inversion has shown to be significant. The possible 'trigger' for the inversion is marked up, i.e. the syntactic or discourse context, the nature of the element that precedes the inversion, or the presence of a syntactically 'heavy' subject. It is also noted if the inversion occurs at or near the beginning or end of the main narrative or of an embedded narrative. The full set of tags with examples can be found in the Header of each story file.
Tagset for Negation
Negative constructions are encoded as 'segs' within utterances and are marked up to indicate whether the 'ne' is retained or dropped, or whether the status of the 'ne' is ambiguous (i.e. where it is not possible to tell whether 'ne' is retained or not). Negative structures are further marked up for a number of factors that research on negation has shown to be relevant for an analysis of 'ne' deletion. The main factors are given below but a full list of categories with the complete tagset and examples is given in the Header of each story file:
- the grammatical subject of the negative construction (noun, pronoun etc.), including cases where there is no surface subject (e.g. infinitives) and those where the negative is part of a relative clause
- the negative particle involved (pas, rien etc.)
- cases where the negative particle precedes the verb
- intervocalic contexts such as 'tu as'
- contexts of interrogation involving inversion
- cases where the negative follows a subordinating conjunction or a relative
- cases where a non-subject clitic is inserted before the verb
- cases where other material is inserted before the verb
- cases where the verb is one of the modals 'devoir' or 'pouvoir'
- cases where the negative clause is hypothetical
- cases where the negative construction is a 'frequently used expression' such as 'je ne sais pas'.
Negative elements and/or structures which are not relevant for an analysis of 'ne' retention or deletion (e.g. where 'ne' is attested with no negative particle such as 'pas', cases of expletive 'ne', examples where 'ne' is compulsory) are marked up separately as 'nl' so that they can be excluded from key statistics.