SRCMF corpus: TIGERSearch web interface

Contents

Using the TIGERSearch web interface

Writing a query and browsing the results

In the TigerSearch tab, queries are entered in the top panel, and matching sentences are shown in tree form in the bottom panel. A tutorial on TigerSearch queries may be found in the section “Writing a simple query”.

If the query is well-formed, and if there are matching results in the corpus, the first tree in the forest will appear in the bottom panel.

The central bar gives the number of matches and the position of the sentence in the corpus, in the form sent: [sentence number] [match number] / [total matching sentences]. Note that subgraph navigation is not yet implemented, and the interface does not show the total number of matches, only the number of matching sentences. You can navigate through the forest of matches using the forward and back arrows on this bar. The ‘Export’ button displays the current tree as an .SVG file in the browser, which can be saved and downloaded. The ‘Export Concordance’ button allows matching sentences to be exported in concordance form.

Exporting the results

To export the results of your query, click the ‘Export Concordance’ button. An export window will appear, with the following options:

When you have filled in the form:

After a short delay, a new tab will open in your browser, containing the concordance in plain text tabular format (.csv).

Viewing the concordance

To view and manipulate the concordance, you will need to use a spreadsheet package.

You will need to correctly configure your spreadsheet software to read the file. We recommend using LibreOffice or OpenOffice Calc, which will prompt the user for settings whenever a .csv file is opened. The following settings are required for the import to function:

Troubleshooting likely problems:

Writing a simple query

The following section will enable you to write simple TIGERSearch queries for the SRCMF corpus. It is not comprehensive, and must be read in conjunction with:

Nodes in the TS graph

A TigerSearch graph is made up of two types of nodes: terminal and non-terminal nodes. In the graph viewer, terminal nodes appear at the bottom of the graph, while non-terminal nodes are represented by labelled white ovals, as shown in the example je puis dire.

Example TIGERSearch tree

Each node has a number of features (see section “Tagset used

SRCMF: ‘split’ nodes

In a true dependency graph, words form the only nodes.

In the TigerXML SRCMF corpus, each ‘word’ in the dependency structure is in fact split between a terminal node (which contains the lexical form and the PoS tag of the word itself) and a non-terminal node (which contains the syntactic features of the structure headed by the word). The non-terminal node and the terminal node are linked by an edge labelled ‘L’ (for lexical realization).

In the example tree, an ‘L’ edge links:

A ‘D’ edge links the ‘Snt’ node to the non-terminal nodes ‘SjPer’ and ‘AuxA’: this indicates that the subject je and the ‘auxiliated’ infinitive dire depend on the main verb puis.

SRCMF corpus node features

The SRCMF corpus has the following node features:

Terminal nodes:

Non-terminal nodes:

For simple queries, we will focus mainly on the word, pos and cat features.

Defining the feature specifications of a node

Node feature specifications are written between [square brackets] and take the following form:

where value is a string or

where value is a regular expression. Permitted operators are ‘=’ (equals) and ‘!=’ (does not equal). For example, the following expression identifies all nodes where cat is "SjPer" (personal subject):

If we wish to include impersonal subjects (i.e. "SjPer" and "SjImp") we can use a regular expression:

We can identify all nodes which are not subjects:

We may also the conjunction (&) operator within the square brackets to specify several properties. For example, we can search for subordinate clause subjects by requiring the subject to be headed by a finite verb (type is "VFin"):

Assigning a variable name to a node

A variable name may be assigned to the node definition. These are useful to refer to the same node several times in a complex query and are also used to indicate the pivot node to concordance scripts.

Variable definitions adopt the following syntax:

where definition is a feature specification as described above. Note that variable names must begin with hash (#) and are separated from their definition by a colon (:).

For example, we may to construct a concordance in which the subject forms the pivot. We define the #pivot variable as follows:

Node relations

All but the most simple queries will require more than one node to be defined, and will usually require the relationship between the nodes to be specified.

For example, suppose we wish to identify all subjects headed by the word Tristran. First, we define the subject:

Second, we define the word Tristran as a terminal node:

Finally, we must indicate the relationship between the nodes. The relationship between a non-terminal node and the terminal node representing its lexical content in the TigerSearch graph is one of direct dominance, labelled ‘L’ (lexical).

Direct dominance

In TigerSearch, direct dominance is expressed by using the operator ‘>’ with the following syntax:

where node and node2 are feature specifications or node variables, and label (optional) is a string.

To identify subjects headed by the word Tristran, the relationship between nodes #subject and #tristran is expressed as follows:

Left corner dominance

The ‘>@l’ operator specifies the leftmost terminal node dominated at any depth by a non-terminal node. It has the following syntax:

where node and tnode are feature specifications or node variables, and tnode is a terminal node.

For example, instead of searching for all subjects which are headed by the word Tristran, we may wish to identify all subjects beginning with the word Tristran. This relation would be written as follows:

Note that there is also a right corner dominance operator ‘>@r’.

Precedence

The precedence operator ‘.*’ permits the user to specify the word order of two terminal nodes with the following syntax:

where tnode and tnode2 are feature specifications or node variables representing terminal nodes.

For example, suppose we wish to identify all sentences in which the word Tristran heads the subject and precedes the main clause verb.

We need to add two additional conditions to the query in the previous section. First, we need to identify the terminal node containing the main verb of the sentence: i.e. the lexical realization of the non-terminal node ‘Snt’:

You may have noticed that #verb has no feature specification. This is perfectly valid in TigerSearch query syntax. In practice, we know that only one node can be linked to #snt by an ‘L’ relation in the corpus. #Verb is thus defined by its relation to #snt rather than by its features.

We then need to specify that the word Tristran precedes the verb:

Finally, we need to clarify that #subject is the the subject of #snt. Otherwise, we risk finding subjects of a subordinate clause which happen to precede the main clause verb:

Putting it all together, the query is as follows:

There is also a direct precedence operator, ‘.’, which specifies that the two terminal nodes must be directly adjacent.

Negation

It is important to learn one (extremely frustrating) golden rule of Tiger query syntax:

In practice, this means that when we write:

we have not found all null subject main clauses. Instead, we have asked for sentences (#snt) which contain a subject node (#subject) which is not the subject of a sentence. TigerSearch will return all sentences with subjects in a subordinate clause.

The SRCMF corpus provides a partial work-around for this problem by using the dom feature. The dom feature of a non-terminal node lists the cat features of all nodes linked to it by a ‘D’ edge in alphabetical order separated by an underscore. For example, the ‘Snt’ node in the example tree has two dependants: SjPer and AuxA. It therefore has a dom property ‘AuxA_SjPer’.

As a result, we can identify all main clauses without subjects by negating the dom feature:

This will return all ‘Snt’ nodes whose dom property does not contain the characters ‘Sj’: in other words, a main clause without an expressed subject.

Syntactic variation

TigerSearch syntax is quite flexible, and we may express queries in a number of ways. For example, the query identifying all subjects headed by the word Tristran may be expressed using three statements...

... or two statements, e.g.:

... or one statement:

... or without variable names:

Where multiple statements are used, the order of statements is irrelevant. Confusingly for programmers, you may reference variables before assigning a value, e.g.:

Using concordances

The SRCMF project has developed a number of concordances to present the results of TigerSearch queries in tabular format. Three concordances are currently implemented:

These concordances produce a text CSV file.

Principles

The concordances use the names of variables from the TigerSearch query to identify the syntactic constituents which should form the focus of the table. All concordances require a #pivot variable to be present in the query.

For example, the following query is correct in TigerSearch, but will not produce a concordance:

To produce a concordance, the query must identify a node as the #pivot, for example:

Basic concordance

The basic concordance has four columns:

The #pivot can be any node in the syntactic tree, either a single word or a larger structure. Currently, only lexical information (not annotation) can be shown in the basic concordance.

For example, we may wish to create a concordance of all the main clause subjects containing the word ‘Tristran’:

Note that the #pivot variable is attached to the subject node (cat = "SjPer").

Below is a selection of the results from the concordance:

ID contexte gauche pivot contexte droite
beroul_pb:8_lb:234_1263227636.06 di por averté Ce saciés vos de verité Atant s' en est Iseut tornee Tristran l' a plorant salüee Sor le perron de marbre bis Tristran s' apuie ce
beroul_pb:13_lb:415_1264876249.02 # croiz Einz croiz parole fole et vaine Ma bone foi me fera saine Tristran [remest] a qui * mot poise Tristran tes niés vint soz cel pin Qui * est laienz en cel jardin Si me manda
beroul_pb:134_lb:4365_1268928771.68 moi le reçoive En sus l' atent s' espee tient Goudoïne autre voie tient Tristran [remest] a qui * mot poise Ist du * buison cela part toise Mais por noient quar cil s' esloigne

Note that the pivot may be one or more words.

What do the square brackets ([]), slashes (/), asterisks (*) and hashes (#) mean?

The third example in the above table contains [square brackets] in the pivot. These are used in all concordances to indicate words which occur between parts of a discontinuous syntactic constituent.

The annotated subject in this sentence is Tristran ... a qui mot poise. The main verb of the sentence, remest, is not part of the subject, but occurs between its two parts. The verb remest is included in the pivot column, but surrounded by square brackets.

This means that:

Slashes (/) indicate division between sentences in the syntactic annotation. These will not correspond to the editor’s division into sentences as shown in the punctuation.

Asterisks (*) indicate that the preceding word has two syntactic functions (e.g. qui in a qui mot poise is both a relator and a subject). They may usually be ignored.

Hashes (#) are related to the representation of coordination, and may always be ignored.

Single word pivot concordance

The single word pivot concordance has a variable number of columns, based on the following structure:

The single word pivot concordance is designed to give as much information as possible about a single word. For example, a concordance could be created around the word "Tristran":

Below is a selection of the results from the concordance (some columns are omitted):

Left context in sentence Pivot Pivot-headed structure Right context in sentence
Sire Tristran Tristran por Deu le roi Si grant pechié avez de moi Qui * me mandez a itel ore
Tristran Tristran tes niés tes niés vint soz cel pin Qui * est laienz en cel jardin
# Que por Yseut que por Tristranz que por Tristranz Mervellose joie menoient

The ‘pivot-headed structure’ gives the noun phrase of which the word Tristan is head. In the second example, for instance, the word Tristran heads the structure Tristan tes niés.

Note that words appearing in the ‘pivot-headed structure’ column are also found in the two context columns. The original sentence may be read across the columns left context — pivot — right context.

Pivot and block concordance

Introduction

The pivot and block concordance is designed to highlight the position of certain constituents, called ‘blocks’ (e.g. the subject) with respect to a pivot (e.g. the verb). The resulting CSV files are complex, with a large number of columns, and are intended as the basis for more detailed analysis in spreadsheet software.

The pivot and block concordances has the following basic structure:

As with the other concordances, TigerSearch queries must define a #pivot variable. However, any number of variables whose name begins ‘#block’ may be defined. At least one ‘#blockXX’ variable is required.

For example, the following query will generate a pivot and block concordance to show the position of the subject (#block1) with respect to the finite verb (#pivot):

In essence, the central section of the resulting concordance will take the following form:

Left context Block Pivot Block Right context
Li rois pense que par folie Sire Tristran vos aie amé
Si voient il # Deu et son reigne

Where the subject is pre-verbal, it appears in the block column to the left of the pivot. Where it is post-verbal, it appears in the block column to the right of the pivot.

Why are there square brackets ([]) and curly brackets ({}) in the concordance?

As with other concordances, square brackets denote words occurring between two parts of a discontinuous unit. The difference in this concordance is that blocks may be discontinuous, as well as the pivot.

Curly brackets denote words which occur between the block and the pivot (or, in more complex examples, between two blocks).

Left context Block Pivot Block Right context
Vos {n'} entendez pas la raison
Dex qel pitié Faisoit {a} {mainte} {gent} li chiens
Ta parole [est] [tost] [entendue] Que li rois la roïne prent est tost entendue Que li rois la roïne prent
Tuit [s'] [escrïent] la gent du * reigne {s'} escrïent la gent du * reigne

In the table above, note the use of curly brackets in the first example to mark the negative adverb n’, which occurs between the subject-block vos and the verb-pivot entendez. In the second example, the prepositional phrase a maintes gens is marked with curly brackets, as it separates the verb-pivot Faisoit from the post-verbal subject-block li chiens.

In the third example, a discontinuous subject Ta parole ... que li rois la roïne prent appears in a pre-verbal block. The pre- or post-verbal position of a block is determined by the position of its first word relative to the pivot. The words est tost entendue, which separate the two parts of the block, are marked with square brackets.

In the fourth example, the word s’ appears (i) in square brackets, between the two halves of a discontinuous subject-block and (ii) in curly brackets, between the first part of the discontinuous subject tost and the verb-pivot escrïent.

Why are there so many columns? I only asked for one block!

The pivot and block concordance shows only one result per pivot. Continuing to work with the same example, if a single verb-pivot has multiple subject-blocks (which is quite possible in cases of coordination), each subject occupies a separate column:

Block3 Block2 Block1 Pivot Block
Ne tor ne mur ne fort chastel {Ne} {me} tendra

However, due to the way the number of columns is calculated, it is possible that some will be empty. These may be deleted in the spreadsheet software, if you wish.

Note that the concordance will never represent the two halves of a single discontinuous block in separate columns. The following representation therefore indicates a coordination:

Left context Block Pivot Block Right context
Tristran {en} bese {la} {roïne} {Et} ele lui par la saisine

The SRCMF of the sentence in this table identifies two coordinated subjects of the verb bese. One is pre-verbal (Tristran), one is post-verbal (ele); both occupy separate blocks.

Adding annotation information

When a concordance is launched from the TXM-web interface, you may specify which properties of terminal and non-terminal nodes you wish to see in the concordance.

Each added property will be placed in a separate column next to the block or pivot. For example, if the ‘cat’ property is selected for non-terminal nodes, and the ‘pos’ property is selected for terminal nodes, the query above will produce the following concordance:

Left context Block Block Cat Pivot Pivot Pos Block Block Cat Right context
Li rois SjPer pense VERcjg que par folie Sire Tristran vos aie amé
Si voient VERcjg il SjPer # Deu et son reigne

Tagset

Non-terminal nodes

Non-terminal nodes have the following properties and values:

cat

Gives the syntactic function of the element. For more details, please refer to the SRCMF website.

type

Gives the syntactic category of the head of the structure.

dom

A ‘dom’ property is added to each non-terminal node in the tree listing the functions of all its dependants and relators in alphabetical order, separated by underscores. For example, if a finite verb has a subject, object and two adjuncts, the property [dom = "Circ_Circ_Obj_SjPer"] will be added.

This resolves to an extent the problem of ‘negative’ queries. Recall that it is impossible to query the non-existence of a node:

Contrary to appearances, this query DOES NOT mean ‘node #suj does not exist’: it means that the node #suj exists, but is not dependant on #clause.

However, it is possible to find all finite verbs without a subject by using the dom property of the finite verb:

The query specifies that we wish to find a node #clause which is a finite verb and does not have the string ‘SjPer’ in the list of dependant nodes given by the dom property.

coord

A ‘coord’ property is added to each non-terminal node in the tree. If the node represents a coordinated structure, [coord = "y"].

For example, in the sentence Sade et douz est quanqu’est de li (gcoin1: p. 3, l. 31), sade and douz are coordinated AtSj. The non-terminal nodes dominating the words sade and douz have the properties [cat = "AtSj" & coord="y"].

The ‘coord’ property exists primarily to allow non-coordinated structures to be identified. In the original format, this is not possible, as it would require a query specifying the non-existence of a node [cat = "Coo"]. However, with the coord property, it is possible to restrict a query to non-coordinated structures only:

headpos

A ‘headpos’ property is added to each non-terminal node in the tree. If the text is correctly annotated at the deep level, each non-terminal node representing a structure should directly dominate at most one terminal node in the tree, the word representing the lexical content of the head of the structure. If this is the case, the ‘headpos’ property is equal to the ‘pos’ property of the dominated terminal node. Thus:

is equivalent to:

The headpos property does not improve the usability of the corpus in TigerSearch, but is useful in producing concordances, providing a more detailed morpho-syntactic tag for the head of a structure than the SRCMF ‘NV’ (non-verbal) type tag.

If the non-terminal node directly dominates more than one terminal node, the algorithm generating the headpos property makes an calculated guess as to which word is the head, and inserts the tag of this word as the ‘headpos’. For example, if a non-terminal node dominates a word with pos ‘NOMcom’ and a word with pos ‘DETdef’, the algorithm will guess that the noun is the head, and insert the headpos ‘NOMcom?’.

Note that headpos values which have been ‘guessed’ are always suffixed by a question mark (e.g. NOMcom?). There will be no guessed headpos values in texts with full NP annotation.

Terminal nodes

Terminal nodes have the following properties:

pos

Part-of-speech tag (Cattex). For more information, please refer to the Cattex documentation on the BFM website.

form

Each word has a property “form”. For texts in prose, the value of the “form” tags is always “prose”. For texts in verse, the form tag is:

It is thus possible to formulate a TS query focusing on words at the beginning or end of a line of verse:

In Aucassin and Nicolete, the form tag correctly distinguishes the verse and prose sections of the text.

q

Each word has a property “q”. This is equal to ‘y’ when the word occurs as part of direct discourse, and ‘n’ when it does not. This annotation is automatically generated by the BFM team from the position of quote marks in the text.

Sample queries

The following sample queries may be tested by copying and pasting into the query panel.

Find all main clause verbs:
[cat = "Snt"]

Find all structures introduced by a preposition:
#n >R #relnc:[cat = "RelNC"]
& #relnc >L [pos = /PRE.*/]

Find all post-verbal NP subjects:
#verb:[type = "VFin"] >D #suj:[cat = "SjPer" & type="nV"]
& #suj >L [pos = /NOM.*/]
& #suj >@l #sword
& #verb >L #vword
& #vword .* #sword

Find indefinite subjects introduced by qui:
[type = "VFin"] >D #suj:[cat = "SjPer"]
& #suj >R #relnc:[cat = "RelNC"]
& ( #relnc >L [word = /[QqKk]u?i/]
| #relnc >~dupl [word = /[QqKk]u?i/] )

Find sentences with coordinated subjects:
#coo:[cat = "Coo"] >~coord #sj1:[cat = "SjPer"]
& #coo >~coord #sj2:[cat = "SjPer"]
& #sj1 $ #sj2

Find sentences with possible gapping of the finite verb (i.e. coordination of subject–predicate pairs):
#gpcoo1:[cat = "GpCoo"] >~ #suj1:[cat = "SjPer"]
& #gpcoo1 $.* #gpcoo2:[cat = "GpCoo"]
& #gpcoo2 >~ #suj2:[cat = "SjPer"]
& #gpcoo1 >~ #pred1:[cat = /Cmpl|Obj|AtSj/]
& #gpcoo2 >~ #pred2:[cat = /Cmpl|Obj|AtSj/]

Useful links