In the followings we present the principles of a parser which departs in many ways from previous approaches to sentence analysis and keeps the following aims in view:
|a) The system is psycholinguistically motivated, meaning that as far as possible it follows the patterns of human language processing, as set forth in (Pléh & Lukács 2014).|
|b) As a performance based system, it attempts to process all linguistic forms that exist in texts (Prószéky et al 2015), not putting a particular emphasis on theoretically grammatical but naturally very rarely occurring phenomena. At the same time, our system tries to treat all texts, including ill-formed or agrammatical ones, as valid linguistic input.|
|c) Utilizes a strictly left-to-right approach, processing the input word by word.|
The parser makes no assumptions about the not yet processed part of the input text and does not refer to it in any way. If a decision cannot be made based on the currently available information, the processing moves on and the decision is deferred until such information becomes available.
|d) The parser has a parallel architecture. As opposed to previous approaches, where analyses were usually the result of a series of processing modules connected as a pipeline, our system uses parallel threads to process the current input word (morphological analyser, modules responsible for different syntactic phenomena, statistical checks, anaphora resolution, focus identification, etc.). These threads communicate and co-operate with each other correcting each other's mistakes, and the final output is a result of the interplay among all these processing units.|
|e) The whole text rather than just a single sentence is the basic unit of representation, making it possible to represent anaphoric relationships inside and between sentences in a uniform manner.|
|f) As a result of this, and the parallel nature of the whole parsing process, the final representation is not necessarily a tree, but a contiguous graph, possibly with different types of arcs.|
Performance based approach
Computational linguists noticed early on that generative models, which have become prevalent in the past few decades of linguistic research, are not particularly efficient at analysing real-world, non-ideal texts. One of the reasons, identified a long time ago, was that transformations as introduced by Chomsky (1957) and utilized in subsequent generative models, are irreversible. This, however, is not the main reason, since we now have many non-transformational generative models (GPSG, LFG, HPSG, TAG, etc.) The problem lies more within the fact that generative theories are centred around competence as opposed to performance and efficient parsing methods are beyond their horizons.
Being performance based means to us first and foremost the requirement that all naturally occurring texts should be analysable by our system. Contrariwise, even if a linguistic unit is theoretically well-formed, but not attested in a sufficiently large text corpus, it is in a sense less important to us. Human language processing works simultaneously with the perception of utterances – in a left-to-right direction, so to speak – and makes use of all available information that is required to make sense of it, regardless of its well-formedness in terms of traditional grammar. Therefore, our system does not allow reference to sentence constituents which have not yet been heard or read, however, it is possible to anticipate certain continuations based on present data and statistical evidence, as long as the input does not end. By stressing the importance of sequential processing, we obviously do not mean to imply that garden path sentences do not exist, backtracking does occur in human language processing, but it would seem that in everyday communication following Gricean maxims (Grice, 1975) people tend to avoid such constructions, and they mostly occur in jokes and intentionally confusing language. We used existing large text corpora to substantiate this claim (Prószéky et al 2015), and we also started building a modern web based corpus collecting all available text on the Hungarian internet to complement the now 10-year-old Webkorpusz (Halácsy et al 2004). Another important goal of the corpus building endeavour is our study of rare linguistic phenomena, which are often examined by modern theorists, but seem to have little relevance in real-life applications according to our statistics (Endrédy & Novák 2013, Ligeti-Nagy 2015).
As Proszéky (2000) points out, our decisions made while choosing among different linguistic constructs during parsing can override the lexicon. Rule-based parsers using databases representing our previous linguistic knowledge, such as the lexicon and syntactic patterns, as well as statistical systems relying on frequency data relating to certain linguistic units obtained from corpora, all of which represent 'past knowledge', do not consider extra-linguistic features (such as the place where the utterance was made). Our initial hypothesis states that the processes invoking previous linguistic knowledge and the ones making decisions based on the current situation are at work simultaneously. The latter is able to process linguistic input in real time even if 'learned' linguistic constructions contradict the data to be processed (e.g. they have incompatible semantic feature sets). See (1) as an example. According to our previous ontological knowledge, it should be the dog that bites, not the postman (at least in relation to a dog).
(1) A postás megharapta a kutyát.
The postman bit the dog.
Psycholinguistically and linguistically motivated approach
It was more than forty years ago that the notion of comprehension strategies was first introduced, which should have been ample time to see them implemented in a computational language processing system. At the time when transformational generative grammar gained impetus, Bever's observation (1970) that the comprehension processes cannot be modelled as a reversal of generative derivations, was already relevant. Instead, he proposed 'comprehension strategies' that work probabilistically and compete with each other. What is more, contrary to the still prevalent view in theoretical linguistics, he assumed these to be non-monotonic processes. Bever already made reference to slower, but sometimes indispensable grammatical rule systems, which are needed to complement more efficient frequency based approaches. Naturally, grammatical rules are also arrived at as an abstracted summation of frequency information.
Bever also introduced the notion of an 'arbiter', which is responsible for making the ultimate decision when background knowledge and linguistic idiosyncrasies are in contradiction. Kimball (1973) algorithmised these ideas and created what can be viewed as a computational strategy for analysing natural language based on seven psycholinguistic principles. Kimball's approach was not independent of the then prevalent generative paradigm and the English language, which was pretty much a synonym for 'human language' at the time. Frazier and Fodor (1978) in their study of difficult to parse sentences and structural ambiguities proposed a strategic approach that deferred decisions to later times whenever a constituent could not be identified as having a certain grammatical function based on the information available at that point. Interactional comprehension models (in: Pléh 1998) appearing in the 1980s introduced some new ideas. One of these new ideas was taking the speed of processing into account. Another important discovery was the fact that non-linguistic knowledge also played a part in low-level processes, since efficient processing depends on making use of all kind of information as early as possible, i.e. different levels of linguistic processing, which until then had been thought to operate hierarchically, are in fact in constant interaction. Pléh (1998), while presenting different comprehension strategies and trying to reconcile them, describes the comprehension process as the combination of three processes: word recognition, sentence comprehension, and text analysis. According to him, any system modelling full comprehension has to provide a solution both to the problem of describing linguistic constructions and to the ordering of their processing, as well as cataloguing the linguistic features that allow the system to identify these constructions in real time.
There are other linguistic phenomena apart from syntactic roles that are being processed by our system concurrently, such as theme-rheme analysis, and anaphoric relationships. The former task, i.e. identifying already introduced and newly presented information is facilitated by word order and stress patterns in Hungarian. Stress is not indicated in writing, so humans rely on their previous experience of similar sentence structures, which strategy is obviously not available to computers processing written text. As a fall-back mechanism, we have to make do with what little information can be gleaned from different parts of the sentence, allowing us to reconstruct the boundaries between the main constituents. For example, the position immediately preceding the main verb can give us a clue, as this structural position usually holds the most salient piece of information conveyed by the sentence. Constituents before this position and the ones following the verb do not have this extra meaning. The focus position can only be determined once the main verb has been identified: this is the position right before the verb. Anything preceding the focus will be the topic of the sentence (2).
Anaphora resolution also happens simultaneously with syntactic and theme-rheme analysis, whether it is a case of identifying pronominal antecedents, synonyms or other kinds of anaphoric relationship (3).
The foundations of psycholinguistically motivated analysis
Our system is based on an analytic grammar, hence the name ANAGRAMMA. Following the principles set forth in Pléh (1999), the parser processes the input tokens, in our case, words, strictly in a left to right direction. At any given step, processing the current token means evaluating information provided by all running threads, then (a) stopping (b) starting (c) continuing threads based on the evaluation. We claim that morphological disambiguation and the prevention of combinatorial explosions happens as a by-product of the interaction among threads responsible for different aspects of linguistic analysis, the latter being a major problem for rule-based systems while parsing long sentences even when morphological disambiguation is provided. The same problem is handled by local optimization in statistical systems, which may result in completely erroneous analyses when faced with rarely occurring constructions.
Our system processes linguistic input as a series of discrete steps. In our model, the basic unit of processing is a word, technically defined as a string of whitespace-separated characters, so we may also think of the series of input words as a clock signal coordinating the work of the processing threads.
The first processing thread to be started on each 'tick' is morphological analysis. For the sake of simplicity, and because of our focus on syntax, we conceptualise this process as a monolithic one, even though in human language processing this process, similarly to all the others, takes place in several steps over a period of time, interacting with other processes. Morphological analysis provides the features which serve as the basis for further processing. It determines the lemma and other morphosyntactic information required by the next steps. Some of these features belong to the demand class, which create threads that look for suitable features that may satisfy them, while others belong to the supply class, which may satisfy the demand created by already existing or future threads.
We will demonstrate the principles of ANAGRAMMA at work through the steps of the following sentence's analysis (4).
The first word processed by the system is beüzemelték ('install/put into operation'). This is a finite verb, serving as the main verb in the clause. This information (FIN) appears as a supply thread, but in most cases this will not be connected to any other node, unless the clause is subordinated to another (main) clause. The verb stem is beüzemel (containing an incorporated verb modifier), which starts two demand threads for a nominative (NOM?) and an accusative (ACC?) case marking, since our lexicon of verb complementation patterns indicates that the verb in question has a subject and an object in normal usage.
So far as the subject is concerned, it is already apparent from the finite ending on the verb that it has to be in the 3rd person, plural and nominative. This semantically underspecified constituent could end up being realized by a noun phrase appearing later in the sentence, which would further specify the semantic content of the subject (e.g. Beüzemelték a gyártók a … 'The manufacturers installed the …), or failing that, it could remain unspecified, being a general subject. We are also looking for an object (ACC?+DEF?), which is definite (as indicated by the DEF? demand). The requirement for a definite noun phrase comes from the verb ending. Hungarian has definite-indefinite agreement between the verb and its object. The indefinite verb form would have been 'Beüzemeltek'.
At the next clock signal the morphological analyser fetches the morphosyntactic lexicon data relating to the definite article 'a'. DET is a part of speech feature, which starts a supply thread, just like DEF, the feature indicating definiteness. Definiteness information will be associated with the head of the noun phrase once the element closing the NP has been identified (case marking or postposition).
At the moment, we have two demands waiting to be satisfied: the two complements of the verb, one of which is optional, since the subject can be realized by the verb ending. We also have a supply thread running (DET) which will be consumed when a demand for a determiner has been found.
The next word is Balaton. This is noun (N) with a zero ending. At this point in the analysis, it is still unknown whether this zero is a nominative ending or an unmarked possessive, but we already know that the subject is plural, so to make this example easier to follow (the real analysis would be somewhat more complicated), we can assume the noun to be part of a possessive construction, which starts a demand for a person marker (PERS?), as possessive markers have person features.
The lemma of the surface form vihar-előrejelző ('strom signalling') is the same, and its part of speech is adjective (ADJ).
The last word is rendszerét ('system-POSS.Sg3-ACC'). The lemma is rendszer, part of speech is N. Nouns start optional demand threads for any preceding ADJ nodes (
The word rendszer offers the feature ACC as supply, which is immediately used to satisfy the demand for an (ACC?) present since the first step. In other words, the noun rendszer (or more precisely the NP headed by rendszer), becomes the object of beüzemel. This object NP may be definite or indefinite depending on the presence of a definite determiner on the left edge of the constituent. The case ending, in this example, the ACC feature, starts a demand thread (DET?), which finds the definite article, as a result of which a DEF supply thread is created. This, in turn, satisfies the demand for a definite feature started earlier in relation to the object of the verb beüzemel. The possessive marker (PERS) appearing between the stem and the case ending, is immediately associated with the demand (PERS?) started by the word Balaton.
The full stop at the end of the word is not part of the lexical form (as it would be in the case of the abbreviation 'stb.', for example), so the sentence is ended. Since no 3rd person plural NP was found in the sentence, the subject remains a general pronoun realized by the verb ending.
Some linguistic phenomena in the ANAGRAMMA system
In our parser, as seen in the previous example, demand and supply arise simultaneously, which lends itself to a parallel solution as opposed to traditional sequential methods. In ANAGRAMMA, not only strictly morphological and syntactic processes can run simultaneously, but threads responsible for statistical information, corpus frequency, ontologies, world descriptions, etc. can be applied side by side.
ANAGRAMMA attempts to avoid combinatorial explosion along the lines of human language processing, i.e. using prior knowledge in the form of usage statistics. Frequently occurring structures may enter the analysis with their internal structure already in place. In computer science we may refer to such solutions as caching, but the process is also well known in psycholinguistics where it is called 'Gestalt' (Pléh & Lukács 2001). In the case of human language comprehension, the term holistic processing is used. This is how we store frequent multi-word sequences such as proper nouns, idioms, and even often used conversational formulas. ANAGRAMMA keeps looking out for the occurrence of such larger units, and whenever the system seems to encounter one of these based on the initial slice currently available, it proceeds to identify them word by word until it can make sure whether the multi-word unit is present in its entirety. For the duration of this process, other threads are suspended, since it would be wasteful to analyse such structures word by word when their complete analysis may already be available. (5) illustrates the processing of such multi-word units, which do not belong in a basic lexicon.
Competence based syntactic models 'by their very nature' do not assume any kind of interaction with non-linguistic information processing subsystems. Performance, however, cannot be separated from the effect of other cognitive processes on language (cf. Frazier & Fodor 1978 and Pléh 1998), therefore, from the very first ANAGRAMMA was designed with the parallel application of extra-linguistic (world knowledge, mood, etc.) modules in mind.
Furthermore, unlike traditional solutions, we do not restrict ourselves to processing just a single sentence at a time. Instead, we analyse whole text units (representing a unit of thought, typically a paragraph), since the presence or absence of a conjunction should not result in drastically different representations of the same semantic content, just because of an almost irrelevant surface distinction such as different sentence boundaries.
The directed edges of structural representations – as seen in the diagrams – are similar to those of dependency grammars (Tesniére 1957), and the operation of the system is not unlike incremental parsers (Menzel 2013).
Since not all processes end at sentence boundaries, connecting partial representations may not be realized within a single sentence. Referential elements, for example, may introduce edges that are part of ANAGRAMMA representations, but are akin to coindexation in traditional generative grammars (7). Identifying the entities bearing different thematic roles and establishing their coreferentiality is important in the final representation, because we need to see which ones are identical in the real world ("who is who?") In other words, we would like to determine correctly if a referent is "new" at a given point in the discourse, and which linguistic units refer to entities already introduced in the text, and whether they have any kind of relationship with previous ones
Another reason why it is important for our parser to look beyond sentence boundaries is ellipsis. We believe that analyses should not stop at sentence boundaries, because the natural basic units of human communication are longer text units, not isolated sentences. The topic of successive sentences is often the same, so it is quite possible, in fact it happens very often, that certain constituents are left out (i.e. ellipsis occurs), which poses a serious difficulty for most traditional parsers, even when the ellipsis is sentence-internal (8). Our system aims to process coherent texts consisting of a few contiguous sentences, larger bodies of text are beyond our scope at present.
Ellipsis is related to the problem of coordinate structures, which can only be identified when a coordinator is encountered in the input stream. The coordinator may be an actual conjunction (e.g. and, or), or it may be a simple comma. These elements introduce the next constituent forming the coordinate structure (9). When the parser recognizes such an element (but not earlier), the representation of the last constituent has to be modified according to the new structure, since at the time of its creation it was still unknown that it would later become the first conjunct.
Coordination structures are treated as single units, without taking positions on whether such constructs have heads.
Connecting sentence parts based on referentiality (resolving relative pronouns and anaphora, etc.), and displaying theme-rheme distinction, as well as syntactic dependencies yields a special graph. The output contains information other than just mere syntactic analysis, since our goal is identifying all participants in the given linguistic situation, as well as the events they take part in, together with their relationships to each other. The final representation produced by our system is a contiguous directed graph depicting a sentence or a paragraph, which allows us to answer questions like 'Who did what when and where?' (10)
Sub-projects supporting the development of the ANAGRAMMA parser
Automatic corpus building
When collecting texts from the internet in order to build a text corpus we face many different problems. One such problem is the presence of repeated superfluous elements in the data, also known as 'boilerplate'. To strip the data of such elements, we created the GoldMiner algorithm (Endrédy 2014). This software is also able to differentiate between different types of text, such as edited articles and the comment section. Different text types have different errors or idiosyncrasies, and keeping them separate allows us to normalize them more effectively. For any given text chunk we need to decide if it is in Hungarian, if it has character encoding problems, or if diacritics need to be reconstructed.
Automatic correction of non-standard text
We have two different solutions for restoring missing diacritics. One is an SMT based approach (Novák & Siklósi 2015), while the other is a rule based system utilizing a morphological analyser (Endrédy-Novák 2015).
Misspellings are also automatically corrected with an SMT based spelling checker (Siklósi, Novák & Prószéky 2015), where the original text containing typical misspellings is treated as the source language, and the corrected text is the target language. This approach allows us to correct spelling mistakes that would not have been possible using traditional methods, which operate at the word level. Using SMT to 'translate' from the original to the corrected version of the text makes it possible to correct mistakes that involve more than one word, (e.g. compounds mistakenly written as two words, false positives that can be disambiguated using context.) Once spelling mistakes have been corrected and diacritics have been restored, the corpus is ready for further processing. If necessary, the modules can also be used concurrently with the parser thanks to its parallel architecture.
Before performing normalization, we can estimate the overall quality of the input using our text quality assessment tool, originally developed to evaluate the output of SMT (Yang. Laki & Prószéky 2015). Our parser expects its input to be more or less well-formed, not the kind of noisy data that is often present in web based corpora. In the event of non-standard input text being fed to the parser, it has to decide whether to try and cope with it, or if it should perform automatic normalization first. This decision is made by the text quality assessment tool, which is the first module to process the input, and the resulting quality score is used to determine the necessity of automatic normalisation.
NP chunking using automatically corrected tagging and finding sentence patterns
We made an inventory of possible internal NP structures using the InfoRádió corpus (Ligeti-Nagy 2014). In order to be able to find NPs with a high degree of confidence, we had to make the morphological annotation more accurate (Prószéky & Kis 1999). Minimal NPs do not contain participial phrases, coordination and possessives, while maximal NPs are made up of minimal NPs using these three operations. We examined syntactic patterns in text corpora (Endrédy & Novák 2013). We obtained skeletal sentence structures by collapsing maximal NPs into single nodes. Now that we have a database of real-life sentence structures, parsing sentences becomes a lot easier, since we can match the string of words on the input with "pre-existing" sentence structures.
The parser needs to fill in the empty spaces in the skeletal sentence structures with actual data from the input to arrive at the final representation.
Multilevel n-gram models
We created a language model reminiscent of factored LMs, which contains statistics of frequent word sequences taking into account several features (in our initial prototype these features were lemmas and POS), in the form of a multilevel n-gram model.
The n-grams, which in many cases are much longer than 3-grams, allow us to recognize frequently occurring constructions (in preparation: Indig & Laki). Patterns obtained using this method are often dependent on some of the features only. For example, the idiom 'esik szó valamiről' ('sth is mentioned') is represented by the following 'mixed' pattern: es-* szó *-SUB, where the first element is a verb stem (which may appear in any tense), the second element is a surface form, and the last element is any word with a particular case ending.
It is unnecessary to store all sequences matching this general pattern, as is usual in traditional n-gram models. Since only frequent contiguous sequences are taken into account, which are specified to the necessary degree, we can use much longer n-grams to build our LM than standard solutions usually do, and we may use these without the risk of combinatorial explosion. This property separates our language model sharply from traditional factored LMs.
Predictive POS disambiguation
It follows from our basic principles that we cannot use traditional POS disambiguation methods that use all available information in the sentence to determine the POS of a given word. The ANAGRAMMA parser cannot refer to elements to the right of the currently processed sentence position, because those have not been read. Therefore, we use an n-gram model of the word's left context , which assigns probabilities to possible POS tags solely on the basis of the words preceding it. This module is a modified version of PurePOS, tailored to the needs of the ANAGRAMMA project (Orosz 2014).
Creating a database of verb complementation patterns
Initially we created the rules by hand, but using the statistical modules available to us, we made the process more efficient by utilizing existing databases. While creating our parser, we relied on the comprehensive rule system developed for the MetaMorpho project (Prószéky & Tihanyi 2002), containing detailed descriptions of more than 34,000 Hungarian verbal complementation patterns. The mechanics of identifying argument structure are the following. A supply thread adds basic syntactic and semantic features to possible arguments using its lexicon of more than 118,000 words and multi-word expressions. Finite verbs appearing in the input stream cause relevant verbal complementation patterns to be fetched from the database and demand threads are created to look for suitable arguments. More than one VP pattern may find its demands satisfied, at the end of the sentence all VP patterns with unsatisfied argument requirements are discarded. Using HuWordNet (Prószéky & Miháltz 2008) we may further classify arguments according to the hypernyms of their constituent words, also allowing us to refine semantic restrictions a verb arguments, resulting in VP patterns such as the following:
- Iszik (drink) ACC folyadék (liquid)
- Kigombol (unbutton) ACC ruha (clothing)
- Olvas (read) ACC könyv (book)
- Ül (sit) SUP ülőbútor (sitting furniture)
- Vádol (charge) INS bűncselekmény (crime)
- Megold (solve) ACC nehézség (problem)
The basic architecture of our parser is derived from the GIT version control system, which is being used in more and more fields quite different from its original purpose, for example as a database management system. The basic philosophy of GIT and our parser show a number of similarities. It is linear in nature, focuses on storing changes only in a very memory-efficient way, a commit operation being comparable to one of our 'clock signals'. Changes committed are easily restored, different states can be traversed back and forth when backtracking becomes necessary. GIT supports branching, so at any point during analysis we can easily create parallel alternative representations of the input, so the parser is not forced to make decisions immediately when more than one structure seems probable. These points should suffice to demonstrate that GIT is adaptable to the needs of our parser. GIT stores data very efficiently, but is optimized for files and folders which have a hierarchical structure, more alike to trees created by phrase structure grammars than dependency graphs. Because of the way our parser represents data, we had to depart from the original operation of GIT, but it is applicable to any kind of DAG, so it is able to handle dependency trees as well. The important feature for our purposes is that GIT stores new objects only once and then keeps track of changes going forward in time in such a way that any previous state of the system is easily accessible starting from certain special entry points, without modifying the system significantly.
The goal of our research is developing a psycholinguistically motivated, performance-based, parallel parser. While reviewing the literature relevant to our endeavour, we found that currently there are no parsers capable of simulating the aspects of human language processing that we considered important. This makes it impossible to compare our results to those of similar systems. The only algorithmic approach known to us that aimed at full text analysis and parallel operation was Word Expert Parser (Small 1983). In WEP, individual words were represented as separate modules of then popular expert systems. The interaction between these small program modules was similar to the cooperation of threads in ANAGRAMMA. However, WEP mostly operated on semantic or conceptual information, consequently morphological and syntactic analysis was entirely missing from it. We hope to have given some insight into the inner workings of a completely novel approach to computerized linguistic analysis, ANAGRAMMA.