SRCMF Tutorial

SRCMF Tutorial
SRCMF Project, February 2012
1 Writing a simple query
The following section will enable you to write simple TigerSearch queries for the SRCMF corpus
(modified format). It is not comprehensive, and must be read in conjunction with:
•
chapter III of the TigerSearch user’s guide (http://www.ims.unistuttgart.de/projekte/TIGER/TIGERSearch/manual_html.html)
•
the query reference manual for the SRCMF corpus [TODO].
1.1 Nodes in the TS graph
A TigerSearch graph is made up of two types of nodes: terminal and non-terminal nodes. In the
graph viewer, terminal nodes appear at the bottom of the graph, while non-terminal nodes are
represented by labelled white ovals:
Illustration 1:
TigerSearch graph
Each node has a number of features (in TigerSearch, these are listed in the top left-hand panel).
1.1.1
SRCMF: ‘split’ nodes
In a true dependency graph, words form the only nodes.
In the TigerXML SRCMF corpus, each ‘word’ in the dependency structure is in fact split between a
terminal node (which contains the lexical form and the PoS tag of the word itself) and a nonterminal node (which contains the syntactic features of the structure headed by the word). The nonterminal node and the terminal node are linked by an edge labelled ‘L’ (for lexical realization).
In figure 1, an ‘L’ edge links:
•
the terminal node puis to the non-terminal node ‘Snt’: these nodes represent the finite verb
which heads the sentence
•
the terminal node je to the non-terminal node ‘SjPer’: these nodes represent the subject of
the sentence je.
•
the terminal node dire to the non-terminal node ‘AuxA’: these nodes represent the infinitive
verb dire.
A ‘D’ edge links the ‘Snt’ node to the non-terminal nodes ‘SjPer’ and ‘AuxA’: this indicates that the
subject je and the ‘auxiliated’ infinitive dire depend on the main verb puis.
1.1.2
SRCMF corpus node features
The SRCMF corpus has the following node features:
Terminal nodes
•
word: the word from
•
pos: part-of-speech tag (Cattex)
•
form: whether the text is verse or prose, and position of the word in the line of verse.
Non-terminal nodes
•
cat: function of the structure headed by the node
•
type: morpho-syntactic category of the node (VFin, VPar, VInf, NV)
•
headpos: part-of-speech tag of the head word
•
coord: set to ‘y’ if the structure forms part of a coordination
•
dom: underscore-separated list of all functions dominated by the node (e.g. for the "Snt"
node above "AuxA_SjPer").
•
nodom: [TODO: remove from corpus?]
For simple queries, we will focus mainly on the word, pos and cat features.
1.1.3
Defining the feature specifications of a node
Node feature specifications are written between [square brackets] and take the following form:
[feature operator "value"]
where value is a string or
[feature operator /value/]
where value is a regular expression. Permitted operators are ‘=’ (equals) and ‘!=’ (does not equal).
For example, the following expression identifies all nodes where cat is "SjPer" (personal subject):
[cat = "SjPer"]
If we wish to include impersonal subjects (i.e. "SjPer" and "SjImp") we can use a regular
expression:
[cat = /Sj.*/]
We can identify all nodes which are not subjects:
[cat != /Sj.*/]
We may also the conjunction (&) operator 1 within the square brackets to specify several properties.
For example, we can search for subordinate clause subjects by requiring the subject to be headed by
a finite verb (type is "VFin"):
[cat = /Sj.*/ & type = "VFin"]
1 There is also a disjunction operator (|) which may be used in the same way, although in practice the author finds it
less useful.
1.1.4
Assigning a variable name to a node
A variable name may be assigned to the node definition. These are useful to refer to the same node
several times in a complex query and are also used to indicate the pivot node to concordance scripts.
Variable definitions adopt the following syntax:
#name:[<definition>]
where definition is a feature specification as described above. Note that variable names must begin
with hash (#) and are separated from their definition by a colon (:).
For example, we may to construct a concordance in which the subject forms the pivot. We define
the #pivot variable as follows:
#pivot:[cat = /Sj.*/]
1.2 Node relations
All but the most simple queries will require more than one node to be defined, and will usually
require the relationship between the nodes to be specified.
For example, suppose we wish to identify all subjects headed by the word Tristran. First, we define
the subject:
#subject:[cat = /Sj.*/]
Second, we define the word Tristran as a terminal node:
#tristran:[word = "Tristran"]
Finally, we must indicate the relationship between the nodes. The relationship between a nonterminal node and the terminal node representing its lexical content in the TigerSearch graph is one
of direct dominance, labelled ‘L’ (lexical).
1.2.1
Direct dominance
In TigerSearch, direct dominance is expressed by using the operator ‘>’ with the following syntax:
node >[label] node2
where node and node2 are feature specifications or node variables, and label (optional) is a string.
To identify subjects headed by the word Tristran, the relationship between nodes #subject and
#tristran is expressed as follows:
#subject >L #tristran
1.2.2
Left corner dominance
The ‘>@l’ operator specifies the leftmost terminal node dominated at any depth by a non-terminal
node. It has the following syntax:
node >@l tnode
where node and tnode are feature specifications or node variables, and tnode is a terminal node.
For example, instead of searching for all subjects which are headed by the word Tristran, we may
wish to identify all subjects beginning with the word Tristran. This relation would be written as
follows:
#subject >@l #tristran
Note that there is also a right corner dominance operator ‘>@r’.
1.2.3
Precedence
The precedence operator ‘.*’ permits the user to specify the word order of two terminal nodes with
the following syntax:
tnode .* tnode2
where tnode and tnode2 are feature specifications or node variables representing terminal nodes. 2
For example, suppose we wish to identify all sentences in which the word Tristran heads the
subject and precedes the main clause verb.
We need to add two additional conditions to the query in 1.2.1 . First, we need to identify the
terminal node containing the main verb of the sentence: i.e. the lexical realization of the nonterminal node ‘Snt’:
#snt:[cat = "Snt"] >L #verb
You may have noticed that #verb has no feature specification. This is perfectly valid in TigerSearch
query syntax. In practice, we know that only one node can be linked to #snt by an ‘L’ relation in the
corpus. #Verb is thus defined by its relation to #snt rather than by its features.
We then need to specify that the word Tristran precedes the verb:
#tristran .* #verb
Finally, we need to clarify that #subject is the the subject of #snt. Otherwise, we risk finding
subjects of a subordinate clause which happen to precede the main clause verb:
#snt >D #subject
Putting it all together, the query is as follows:
#subject:[cat = /Sj.*/] >L #tristran:[word = "Tristran"]
& #snt:[cat = "Snt"] >L #verb
& #tristran .* #verb
& #snt >D #subject
There is also a direct precedence operator, ‘.’, which specifies that the two terminal nodes must be
directly adjacent.
1.2.4
Negation
It is important to learn one (extremely frustrating) golden rule of Tiger query syntax:
•
you can negate a feature specification (e.g. [cat != "SjPer"]);
•
you can negate a relation between nodes (e.g. #subject !>L #tristran)
•
but you can’t negate the existence of a node!
In practice, this means that when we write:
2 This is not quite correct: precedence operators can also be used with non-terminal nodes. In this case, precedence is
implicitly based on the position of the left-most terminal node dominated. However, it is best to avoid ‘implicit’
steps in your query, as they are likely to cause you to make mistakes!
#snt:[cat = "Snt"] !>D #subject:[cat = /Sj.*/]
we have not found all null subject main clauses. Instead, we have asked for sentences (#snt) which
contain a subject node (#subject) which is not the subject of a sentence. TigerSearch will return all
sentences with subjects in a subordinate clause.
The SRCMF corpus provides a partial work-around for this problem by using the dom feature. The
dom feature of a non-terminal node lists the cat features of all nodes linked to it by a ‘D’ edge in
alphabetical order separated by an underscore. For example, the ‘Snt’ node in figure 1 has two
dependants: SjPer and AuxA. It therefore has a dom property ‘AuxA_SjPer’.
As a result, we can identify all main clauses without subjects by negating the dom feature:
#snt:[cat = "Snt" & dom != /.*Sj.*/]
This will return all ‘Snt’ nodes whose dom property does not contain the characters ‘Sj’: in other
words, a main clause without an expressed subject.
1.2.5
Syntactic variation
TigerSearch syntax is quite flexible, and we may express queries in a number of ways. For
example, the query identifying all subjects headed by the word Tristran may be expressed using
three statements...
#subject:[cat = /Sj.*/]
& #tristran:[word = "Tristran"]
& #subject >L #tristran
... or two statements, e.g.:
#subject:[cat = /Sj.*/]
& #subject >L #tristran:[word = "Tristran"]
... or one statement:
#subject:[cat = /Sj.*/] >L #tristran:[word = "Tristran"]
... or without variable names:
[cat = /Sj.*/] >L [word = "Tristran"]
Where multiple statements are used, the order of statements is irrelevant. Confusingly for
programmers, you may reference variables before assigning a value, e.g.:
#subject >L #tristran
& #tristran:[word = "Tristran"]
& #subject:[cat = /Sj.*/]
2 Using concordances
The SRCMF project has developed a number of concordances to present the results of TigerSearch
queries in tabular format. Three concordances are currently implemented:
•
basic concordance
•
single word pivot concordance
•
pivot and block concordance
These concordances produce a text CSV file and are integrated into the TXM-Web interface (see
section Fehler: Referenz nicht gefunden. If you are not using TXM-Web, concordances can still be
produced, but the procedure is more complex (see section 2.6 below).
2.1 Principles
The concordances use the names of variables from the TigerSearch query to identify the syntactic
constituents which should form the focus of the table. All concordances require a #pivot variable to
be present in the query.
For example, the following query is correct in TigerSearch, but will not produce a concordance:
[word = /Tristr?a[nm][sz]?/]
To produce a concordance, the query must identify a node as the #pivot, for example:
#pivot:[word = /Tristr?a[nm][sz]?/]
2.2 Basic concordance
The basic concordance has four columns:
•
sentence ID
•
left context
•
pivot
•
right context
The #pivot can be any node in the syntactic tree, either a single word or a larger structure.
Currently, only lexical information (not annotation) can be shown in the basic concordance.
For example, we may wish to create a concordance of all the main clause subjects containing the
word ‘Tristran’ [modified format query]:
#snt:[cat = "Snt"] >D #pivot:[cat = "SjPer"]
& #pivot >* [word = /Tristr?a[nm][sz]?/]
Note that the #pivot variable is attached to the subject node (cat = "SjPer").
Below is a selection of the results from the concordance:
ID
contexte gauche
pivot
contexte droite
beroul_pb:8_lb:234_1263 di por averté Ce saciés
227636.06
vos de verité Atant s' en
est Iseut tornee
Tristran
l' a plorant salüee Sor le
perron de marbre bis
Tristran s' apuie ce
beroul_pb:13_lb:415_126 # croiz Einz croiz parole
4876249.02
fole et vaine Ma bone foi
me fera saine
Tristran [remest] a qui *
mot poise
Tristran tes niés
vint soz cel pin Qui * est
laienz en cel jardin Si me
manda
beroul_pb:134_lb:4365_1 moi le reçoive En sus l'
Tristran [remest] a qui *
268928771.68
atent s' espee tient
mot poise
Goudoïne autre voie tient
Tableau 1: Basic concordance, subject contains "Tristran"
Note that the pivot may be one or more words.
Ist du * buison cela part
toise Mais por noient
quar cil s' esloigne
2.3 What do the square brackets ([]), slashes (/), asterisks (*) and
hashes (#) mean?
The third example in table 1 contains [square brackets] in the pivot. These are used in all
concordances to indicate words which occur between parts of a discontinuous syntactic
constituent.
The annotated subject in this sentence is Tristran ... a qui mot poise. The main verb of the sentence,
remest, is not part of the subject, but occurs between its two parts. The verb remest is included in
the pivot column, but surrounded by square brackets.
This means that:
•
the pivot column contains all parts of discontinuous pivots;
•
reading the concordance from left to right will always give the original sentence.
Slashes (/) indicate division between sentences in the syntactic annotation. These will not
correspond to the editor’s division into sentences as shown in the punctuation. [TODO: implement
these in the basic concordance, currently only in the other two.]
Asterisks (*) indicate that the preceding word has two syntactic functions (e.g. qui in a qui mot
poise is both a relator and a subject). They may usually be ignored.
Hashes (#) are related to the representation of coordination, and may always be ignored [TODO:
remove from export?]
2.4 Single word pivot concordance
The single word pivot concordance has a variable number of columns, based on the following
structure:
•
ID
•
Left context outside the SRCMF sentence containing the pivot
•
Left context within the SRCMF sentence containing the pivot
•
Pivot word followed by all its properties in separate columns (e.g. word, part-of-speech tag,
form)
•
Structure headed by the pivot [only correct if modified format corpus used].
•
Function of the structure headed by the pivot [only correct if modified format corpus used].
•
Right context within the SRCMF sentence containing the pivot
•
Right context outside the SRCMF sentence containing the pivot
The single word pivot concordance is designed to give as much information as possible about a
single word. For example, a concordance could be created around the word "Tristran":
#pivot:[word = /Tristr?a[nm][sz]?/]
Below is a selection of the results from the concordance (some columns are omitted):
Left context in sentence
Sire
# Que por Yseut que por
Pivot
Pivot pos
Pivot
form
PivotPivotheaded
headed
structure structure
function
Right context in
sentence
Tristran
NOMpro
vers
Tristran
ModA
por Deu le roi Si grant
pechié avez de moi Qui
* me mandez a itel ore
Tristran
NOMpro
vers_first Tristran
tes niés
SjPer
tes niés vint soz cel pin
Qui * est laienz en cel
jardin
vers_end que por
Tristranz
Circ
Mervellose joie
menoient
Tristranz NOMpro
Tableau 2: Single word pivot concordance, pivot is "Tristran"
The ‘pivot-headed structure’ gives the noun phrase of which the word Tristan is head, and the
following column the syntactic function of the pivot-headed structure. In the second example, for
instance, the word Tristran heads the structure Tristan tes niés, which is the subject of the sentence.
Note that words appearing in the ‘pivot-headed structure’ column are also found in the two context
columns. The original sentence may be read across the columns left context — pivot — right
context.
2.5 Pivot and block concordance
2.5.1
Introduction
The pivot and block concordance is designed to highlight the position of certain constituents, called
‘blocks’ (e.g. the subject) with respect to a pivot (e.g. the verb). The resulting CSV files are
complex, with a large number of columns, and are intended as the basis for more detailed analysis
in spreadsheet software.
The pivot and block concordances has the following basic structure:
•
ID
•
Left context outside the SRCMF sentence containing the pivot
•
Left context within the SRCMF sentence containing the pivot
•
Pre-pivot blocks
•
Pivot
•
Post-pivot blocks
•
Right context within the SRCMF sentence containing the pivot
•
Right context outside the SRCMF sentence containing the pivot
As with the other concordances, TigerSearch queries must define a #pivot variable. However, any
number of variables whose name begins ‘#block’ may be defined. At least one ‘#blockXX’ variable
is required.
For example, the following query will generate a pivot and block concordance to show the position
of the subject (#block1) with respect to the finite verb (#pivot) [modified format syntax]:
#snt:[cat = "Snt"] >D #block1:[cat = "SjPer"]
& #snt >L #pivot
In essence, the central section of the resulting concordance will take the following form:
Left context
Block
Li rois
Pivot
Block
pense
Si
Right context
que par folie Sire
Tristran vos aie amé
voient
il
# Deu et son reigne
Tableau 3: Simplified pivot and block concordance: block is subject, pivot is finite verb.
Where the subject is pre-verbal, it appears in the block column to the left of the pivot. Where it is
post-verbal, it appears in the block column to the right of the pivot.
2.5.2
Adding annotation information
When the concordance is launched from the TXM-web interface, you may specify which properties
of terminal and non-terminal nodes you wish to see in the concordance.
•
On the ‘Export Concordance’ form, select ‘pivot and block concordance’. The last two
options will become active;
•
Select the features of terminal and non-terminal nodes that you wish to show in the
concordance from the two drop-down lists.
•
Click ‘OK’.
Each added property will be placed in a separate column next to the block or pivot. For example, if
the ‘cat’ property is selected for non-terminal nodes, and the ‘pos’ property is selected for terminal
nodes, the query above will produce the following concordance:
Left Context
Block
Li rois
Si
Block
Cat
SjPer
Pivot PivotPo
s
Block
Block
Cat
pense VERcjg
voient
VERcjg il
Right Context
que par folie Sire Tristran vos
aie amé
SjPer
# Deu et son reigne
Tableau 4: Pivot and block concordance with added annotation
2.5.3
Why are there square brackets ([]) and curly brackets ({}) in
the concordance?
As with other concordances, square brackets denote words occurring between two parts of a
discontinuous unit. The difference in this concordance is that blocks may be discontinuous, as
well as the pivot.
Curly brackets denote words which occur between the block and the pivot (or, in more complex
examples, between two blocks).
Left context
Block
Vos {n'}
Dex qel pitié
Pivot
Block
entendez
Faisoit
Right context
pas la raison
{a} {mainte} {gent} li
chiens
Ta parole [est] [tost] est
[entendue] Que li
rois la roïne prent
tost entendue Que li
rois la roïne prent
Tuit [s'] [escrïent] la escrïent
gent du * reigne {s'}
la gent du * reigne
Tableau 5: Simplified pivot and block concordance: use of brackets
In table 5, note the use of curly brackets in the first example to mark the negative adverb n’, which
occurs between the subject-block vos and the verb-pivot entendez. In the second example, the
prepositional phrase a maintes gens is marked with curly brackets, as it separates the verb-pivot
Faisoit from the post-verbal subject-block li chiens.
In the third example, a discontinuous subject Ta parole ... que li rois la roïne prent appears in a preverbal block. The pre- or post-verbal position of a block is determined by the position of its
first word relative to the pivot. The words est tost entendue, which separate the two parts of the
block, are marked with square brackets.
In the fourth example, the word s’ appears (i) in square brackets, between the two halves of a
discontinuous subject-block and (ii) in curly brackets, between the first part of the discontinuous
subject tost and the verb-pivot escrïent.
2.5.4
Why are there so many columns? I only asked for one block!
The pivot and block concordance shows only one result per pivot. Continuing to work with the
same example, if a single verb-pivot has multiple subject-blocks (which is quite possible in cases of
coordination), each subject occupies a separate column: 3
Block3
Ne tor
Block2
ne mur
Block1
ne fort chastel {Ne}
{me}
Pivot
Block
tendra
Tableau 6: Coordinated subject-blocks
However, due to the way the number of columns is calculated, it is possible that some will be
empty. These may be deleted in the spreadsheet software, if you wish.
Note that the concordance will never represent the two halves of a single discontinuous block in
separate columns. The following representation therefore indicates a coordination:
Left context
Block
Tristran {en}
Pivot
bese
Block
{la} {roïne} {Et} ele
Right context
lui par la saisine
Tableau 7: Representation of coordinated subjects
The SRCMF of the sentence in table 7 identifies two coordinated subjects of the verb bese. One is
3 Those used to TigerSearch will note that the concordance combines multiple matches within a single sentence where
the #pivot variable is the same.
pre-verbal (Tristran), one is post-verbal (ele); both occupy separate blocks.
2.6 Using concordances with TigerSearch export
The concordances are also accessible outside the TXM-web interface. Note that it is not possible to
restore the punctuation of the base edition.
2.6.1
Create a TigerXML export with matches
Once you have executed your query in TigerSearch:
•
Select ‘Query > Export Matches’ from the toolbar above;
•
On the export form:
◦ set ‘Export Format’ to XML;
◦ set ‘Export to file’ to the output file of your choice;
◦ set ‘Export includes’ to ‘Whole corpus’.
•
Click ‘Submit’.
Illustration 2: TigerSearch export settings
This will create a Tiger-XML file to which you must apply a concordance script
2.6.2
Download the concordance scripts
The concordance scripts are found in the SRCMF Tiger-XML svn repository (see section TODO
above).
•
Basic concordance: xsl/concordance_simple.xsl
•
Single word pivot concordance: xsl/concordance_mot-pivot.xsl
•
Pivot and block concordance: scripts/concordance_pivot_blocks.py OR
scripts/concordance_blocks.groovy
2.6.3
Applying XSL concordances
For concordances in XSL format, you will need an XML stylesheet processor (e.g. Oxygen,
xsltproc) to apply the XSL to the Tiger-XML file. The procedure will depend on the tool you use.
The size of the context can be modified by passing the parameter ‘cx’ to the stylesheet.
2.6.4
Applying concordance scripts
The .py and .groovy scripts run from the command line, and require you to have installed the
python or groovy interpreter. If this is second nature to you, you will find the scripts easy to
execute: simply run the script to view the correct command line syntax. If not, we suggest using the
TXM-web interface.
The scripts are only tested on Linux, and we offer no guarantees or warranty!