NL1 Exam

Name: Alexander Weichart Matrikelnummer: 2170749

1

1.1

Structural ambiguity: “The bank robbers escaped in a car” It is not clear whether the bank robbers escaped with the car or in the car. Semantic ambiguity: “The bank robbers are all either white or black” It is not clear whether “white” and “black” are being used as adjectives or as racial categories.

1.2

The regular expression matches 1-1 but not 1+1. That’s because the second group (\[0-1\- \(\)]*\) states, that any of these characters will be matched until there’s none left to be matched: "0,1,-,(,)". A plus sign is not part of this group, and the first group (\(?\+?[0-1]*\)?)? only matches a + sign at the beginning of the string and not the end. That’s because the group is enclosed by () brackets, which require order and not [] ones, which only require one of the characters of that group.

1.3

One problem might be, that an apostrophe ' can be used to indicate contractions like “I’m” or posessions “Bob’s”. Therefore, if the tokenizer doesn’t realize that is the case and confuses it with for example the start of a citation, it will split tokens, which belong together, into separate ones. The result will be that tokens like “m” or “s” will appear in the output.

1.4

  1. Compared to sparse vectors, they are easier to use in machine learning as they are smaller, thus requiring less storage and computing time.
  2. Also, they are better able to generalize, since they have more information per dimension and are more resistant to overfitting.
  3. Dense vectors are relatively easy to train, and many tools exist for doing so. One popular example is word2vec.

2

2.1

S → VP NP S → CMD VP → V NP VP → V NP PREP ADV NP → DET N NP → DET ADJ N DET → a, the PREP → for ADV → tomorrow ADJ → dentist’s N → alarm, appointment V → book, stop V → CMD CMD → restart

2.2

A context free grammar is right-linear, if each production rule’s right-hand side contains at most one non-terminal symbol, which is also the rightmost symbol. Furthermore, right linear grammars produce regular grammars, which are a subset of context free grammars in the chompsky hierarchy. Therefore, it is the question, whether my context free grammar can also be represented as a regular grammar. In this case, it would be possible to represent it as a regular grammar, since there is no storage in form of a stack required, to parse it, because there is no recursion happening.

2.3

  1. Check S
    1. Check first rule S → VP NP
      1. Check VP
        1. Check first rule VP → V NP
          1. Check V
            1. Check first rule V → book, stop for input “restart the alarm” - failed
            2. Check second rule V → CMD
              1. Check CMD → restart
                • success match “restart” as CMD
              • success match “restart” as V
            • success match “restart” as V
          2. Check NP
            1. Check first rule NP → DET N
              1. Check DET → a, the for input “the alarm”
                • success match “the” as DET
              2. Check N → alarm, appointment for input “alarm”
                • success match “alarm” as N
              • success match “the alarm” as NP
            • success match “the alarm” as NP
          • success match “restart the alarm” as VP
        • success match “restart the alarm” as VP
      • success match “restart the alarm” as VP
    • success match “restart the alarm” as S
  • success match “restart the alarm” as S

→ parse success

3

3.1

First of all we would need to specify which type of n-grams we are going to use. For this example I will choose bi and tri-grams. Bi-grams will look at couples of words, while tri-grams will look at word triples. Example Bi-gram:

  1. Training
    1. Calculate the likelihood for each word as it’s relative frequency: P(w_i) = word_occurences / total_num_words
    2. Calculate the Probabilities for all Bi-Grams in the text as P(w_i|w_i+1)
  2. Predicting:
    1. Take the last word, that has been typed in, in this case “WILL”
    2. Check all Bigrams P(“will”|w) and choose the w that results in the highest likelihood
    3. Suggest the w with the biggest likelihood as the next word Example Tri-gram:
  3. Training
    1. Calculate the likelihood for each word as it’s relative frequency: P(w_i) = word_occurences / total_num_words
    2. Calculate the Probabilities for all Tri-Grams in the text as P(w_i|w_i+1|w_i+2)
  4. Predicting:
    1. Take the last 2 words, that have been typed in, in this case “WE WILL”
    2. Check all Tri-Grams P(“we”|“will”|w) and choose the w that results in the highest likelihood
    3. Suggest the w with the biggest likelihood as the next word

3.2

Preprocessing is the process of manipulating raw input in a way that it can be used for NLP by machines. Techinquest are for example tokenization, sentence splitting and text normalization. Preprocessing improves a systems performance massively. For the example above, tokenization is absolutely curcial, otherwise you wouldn’t be able to generate n-grams at all. The tokenization will also have to be done very well, because otherwise tokens will be mistakenly split up, thus worsening the predictions. Text normalization, like for example lemmatization/stemming is also important in this case. Often times, the same word appears in different shapes, e.g. “university”, “universities”. If this word does not get normalized by reducing it to it’s stem, the n-grams won’t represent the true frequency of that word.

3.3

Intrinsic evaluation: I would define word suggestions for specific contexts, which are counted as valid. Then the system is getting tested against with these contexts and it is evaluated how many of the gold standard results are beign predicted. One can then create a confusion matrix, measure accuracy/precision/recall and evaluate the system.

Extrinsic evalaution: I would let users test out the system in an application and let them share their impressions with me either via qualitative data collection like interviews or surveys, or via quantitative ones like speed. One could ask users for example, how “human like” the predictions were or how usable it was.