Language, Attention and Empiricism

This blog is an informal introduction to the ideas of language modeling and its machine-learning aspects.

Imagine this, I give you a sentence, ‘I saw a beautiful girl from the window eating an ice cream.’ Read the sentence again and answer this —

Who is eating the ice cream?

It’s obvious, right? It’s the girl. Too simple? Let’s up the difficulty…Take this sentence, ‘I saw a beautiful girl looking at me through the window’. Now, who is looking at whom through the window? Confused? Expected. This is not uncommon. In science, we refer to these as ambiguous sentences. Take one more sentence, for example —

The chicken is ready to eat.

Got it? Looking closely, one can infer (at least) 2 meanings out of this simple sentence. One referred to the cooked chicken, and the other referred to feeding the chicken.

You might ask this question: Why am I asking this question? Does it really matter? I mean, the sentences we use in real life are trivial and irrespective of how ambiguous they might be, it’s does not seem to bother us. And I’ll give it to you, you’re right. For the human mind, it does not really matter as they always have more information from their surrounding experiences, which helps them to disambiguate such scenarios. For example — In the previous sentence, The chicken is ready to eat. If you say it at the dinner table, it invokes a different meaning than saying it in a henhouse.

Ok, so where am I getting at? It might seem very simple for us humans to pass off these sentences as not-so-ambiguous. But when you look through the scope of science, it can be very difficult to find the cause of ambiguity. Exactly, what in this sentence, ‘The chicken is ready to eat,’ makes it ambiguous? Go on, think.

Do you have an answer? I don’t think so. But this is an important problem to solve for most of our day-to-day vocabulary consists of ambiguous sentences. Some examples are —

The man saw the boy with the telescope.
They are hunting dogs.
Visiting relatives can be a nuisance.
I saw her duck.

Therefore, to have working chatbots and AI machines replacing your day-to-day menial jobs, they have to understand this language. And since we are the ones designing them, we gotta do our homework first.

It’s perhaps why scientists had to think of another way to tackle this problem of ambiguity, especially in computer applications. After all, the computer never knows if you are at the dinner table or henhouse. The solution to this problem is to give the computer somehow more information about the surroundings and probably now they can disambiguate a bit better. But is it that simple?

Experience precedes language

It turns out that when we humans talk about any experience, we tend to only transmit a fraction of the experience via words, while the rest of the experience is just a guess from the listener. That’s why different people have different reactions to the same sentence being told. This idea that words are never enough to describe human experience is something very debatable, and perhaps I’ll not go deep into it. Hard to follow? Consider 2 sentences —

The bank is closed today.
The bank is eroding away.

Pay attention to the word bank. In the first sentence, it means the financial institution, while in the second sentence, it means a riverbank. How do you know this? Even though the words are the same, how do you know the meaning or sense is different? If you really think about it, the answer is simple — The surrounding words do the job of disambiguating the different senses.

It seems that words in any sentence have an intimate relationship with each other, which, in a synergic way, gives rise to meaning in the entire sentence. An exercise for the reader would be to — Take any sentence and try to find the function of each word there. It may be adding some object or action, or maybe just for continuity of meaning it is there. But surely, every word is there for a reason. Using this idea, scientists have tried to formulate some innate ideas about how words combine together to give rise to a meaningful sentence. You can even think of words having some fixed valencies associated with them so that only a few other words can combine with them in a meaningful way. Ideas like this came up and culminated in something quite remarkable — That words never have any meaning of their own but assume one when used in the company of others. Read this statement again and try to capture the essence it. It might seem like a very simple thing, but in reality, this is why ChatGPT probably exists.

A great simplification in this case was that now scientists did not have to worry about the representation of meaning per se, but by simply using the surrounding words, it could represent the meaning. Another way of saying this is that rather than trying to find an accurate measure of meaning, it’s better if you just represent it using the surrounding words that give rise to this meaning. Read again if you are still not clear.

Using this idea of surrounding words representation, was born the great world of embeddings. You might’ve heard this word embedding before and perhaps wondered what it really means. In our context, it’s simple. It’s just a quick-fix way to represent words/meanings using other words/meanings. With this, you never need to know the absolute definition of what word meanings are, but just relative definitions using other words/meanings.

This brings us to our next topic of discussion (and perhaps, the most important :)) — Attention.

Attention

Read this sentence once, and try to answer the following questions,

The football match ended late at night. It was a good match.

What does ‘It’ here refer to?
Which words, if I remove them, do not really alter the meaning of the sentence?
Which words did you actually pay ‘attention’ to before you got the meaning of the sentence?

Think about these questions seriously, and perhaps you should be able to answer them easily. But what I want is not the correct answer but to realize that the language we speak is terribly unorganized.

It here refers to football. How do you know this ? Is it a rule always to map It with football just like we do in other sciences? Why are we writing some useless words (the, a, late) that only lengthen the sentence but contribute next to nothing to the meaning of the sentence? It might seem that our vocabulary is kinda flawed. (Perhaps newspeak would be better :))

I think it is clear by now what I was trying to say… that we humans never really pay attention to all the words in the sentence, but only to some words and, from there on, use them to form meanings about the sentences. Technically, this type of eye movement from keyword to keyword is called a saccade.

The question is how you find those words and, more importantly, how do you teach it to a computer?

Think about this — How do you represent a word in mathematical form? There are multiple ways to do it, but we’ll stick to what’s important. It’s called an embedding (discussed above). As I said, embedding is just a representation of the word with respect to some other words. Therefore, when you place that embedding vector in a set of other embedding vectors, certain regions activate, and some others are suppressed. Hence, you get a more pronounced specific meaning from the sentence. This idea of activating certain regions is an important one and can be very helpful in the disambiguation of meanings/senses for sentences. Perhaps if we could mathematically define this idea, then a lot of our work would be done.

An Attention Mechanism essentially does this. It takes in the input of a sentence(s). For each word, there is an embedding that defines it, and when the words were combined together in the sentence, the embedding vectors acted together and gave out some special values, which helped in generating the meaning of the sentence.

This sounds complex, I know. And perhaps I’ll not go into the mathematical details of how things work there. But that is not important. What I believe to be important is the idea and appreciation of the fact that our English language is not only biased but also far from universal. A very great example of the inconsistencies in the language comes from the domain of Machine Translation. And that is perhaps the best way I can explain the idea of Attention.

Machine Translation

Here’s a sentence — ‘Today is a good day’. Now translate it into Deutsch — ‘Heute ist ein guter tag’. It’s simple, right? Now a simple exercise for you — Can you map all the words in the English sentence to German meaning?

Easy right? And probably this is how mostly all Machine Translation is done, i.e., Create a dictionary of all English-German pairs and store them.
But what about this sentence ‘We are playing soccer today’. It’s german counterpart is — Wir spielen heute Fußball. Do you see something odd here?
Needless to say, the word order has changed. In the English sentence, today comes after football. But in the German one, it’s the reverse. Why did that happen? Perhaps a better question is, How did that happen?

There is no definite answer to this question. Asking ‘How language evolved?’ is almost as synonymous with asking ‘How humans evolved?’ with no definite singular answer.

But the question is, how do you model this? After all, you cannot say just because you don’t understand the generative process underlying the creation of any language, you cannot model it. Issac Newton modeled most of the classical dynamics we see today, but needless to say he himself never understood what exactly gravity consists of. Or how did the planets bring themselves to their current places? Therefore, no matter whatever research you do, at the end of the day, the science we look at is assumed with its base fundamental questions still unsolved. We assume that mathematics or the system of numbers is good enough to define natural phenomena. Is it always true? Is nature obliged to follow our system of representing its workings? We assume it to be true. Or else we cannot do any science then. Therefore, Max Planck, a German theoretical physicist, at one point in his life, said, ‘To enter the temple of science, you need faith.’

Now, back to the question. A quick fix around these kinds of problems that gained prominence during the late 17th century was the idea of Empiricism, upon which the whole world of ML/AI stands. Empiricism states simply, ‘You don’t need to know the generative process behind any phenomenon. Just conduct some experiments and learn the patterns from there.’ Sounds a lot like training a neural net XD.

And this is what is going to help us in our quest of understanding language and more importantly, the idea of word interrelationships and word order.

As always, an attention mechanism model takes in a huge amount of sentences to train itself. What it looks for is the strength of the relationship of 2 words in a sentence. For example, if I feed in sentences —
I love football. It’s a good game.
It will return me a matrix that tells me the relationship of a word in the sentence. Look at the figure below for better visualization.


The numbers are there for representation purposes only

And same goes for machine translation as well. When I feed in a corpus of pairs of sentences of 2 different languages, it will understand the relationship between each word in both languages. And when it comes to dealing with the problem of word order conundrum. It knows that by looking at the previous words and infers the next word. Perhaps, it’s a bit confusing but what I mean to say is — Translation is a sequential task i.e. The model never really outputs the final sentence at once. But it’s always one word after another. And also, before any releasing any output it always checks with the previous output so that the sentence is coherent. Therefore, if I train the model with German-English sentence pairs. It will understand 2 things —

That today translates to heute in German.
heute comes before football in German (Not in English though).

English translated sentences are only required to do check off the first point. The second point is taken care of by the independent German sentence training data. As long as heute never comes before football in the German corpus, we are fine.

Hopefully, it is somewhat clear? If not read again, and If not still understood, go die :).

Now you might be tempted to ask, ‘Ok this all makes sense, but is it working?’ The answer is simple — ChatGPT. ChatGPT uses an underlying model architecture called a Transformer (perhaps, I’ll write about it sometime), which basically deals with I mentioned above, but of course in a more technical and high-sounding way (as always with latest tech :)). The encoder in the Transformer architecture takes care of the word relatedness part (hence, attention!!) and the decoder takes care of the sequence word generation task. Look at the figure below to understand the 2 parts but do not read any of the shit written (confusing af).


Typical Transformer architecture

But before I close off this blog, there is something else I want to discuss. This I believe is probably the strongest yet most latent undercurrent in today’s world of ML/AI — Empiricism.

Empiricism

Imagine this — I have an unbiased coin. I toss it in the air and now I am asking you — what will be the outcome of the toss? Just by looking at the word unbiased you might blurt out — ‘50% chance heads and 50% chance tails’. You gave me a very ambiguous answer, but in reality, everything was non-ambiguous. The coin was there, the gravity was acting as always, the wind was blowing etc. Everything in that moment seemed deterministic, with no multiple answers. Then why did the outcome become a multi-outcome answer (or in other probabilistic?). This is an important point to note. When/Where in the entire experiment of the coin toss, did the answer shift from deterministic to probabilistic? Think….

If you really think about the question, you might arrive at an answer like this — In the entire experiment, everything was controlled and recorded except after the point when the coin was tossed, from there we stopped caring about how the coin was rotating/revolving, because it’s either too fast or too cumbersome to record. Due to this ignorance, we never come to a concrete answer of what causes the coin to land tails or heads.

The fact that you ignored the generative process in the generation of the tails/heads in the outcome fills you with uncertainty. So what do you do? Because you made a bet on the coin toss and now you want to know how much money you are going to loss/gain.

It’s problems of this sort that Empericism was developed to solve. Where the generative process is too difficult to study, then we rely on past historic data and patterns hidden in them to infer the output.

For our case, we know the probabilities of heads/tails to be 0.5 because someone has done this experiment numerous times and found out that half the time it’s heads and other half tails.

Neural Networks and all Machine Learning algorithms work in the same way. When you train a neural net to identify cancer in the patient using X-ray data, you only teach it the shallow patterns that appear in the X-ray images, but never the reason why did those patterns appear. Same goes with financial forecasts. You might be able to guess the price of AAPL correct tomorrow using a neural net but will never know the root reason why it predicted a 52-week high.

This is the basic idea of Empericism. Is it bad that we are ignoring the latent generative knowledge in any phenomenon and instead, relying on patterns and numbers? Perhaps not, because for our everyday day-to-day human life we probably never need to estimate anything with 100% certainity. As long it involves humans, errors are inevitable :). But the downside is perhaps assuming this as the only to do science and pushing the frontier of knowledge. We perhaps need to understand that we humans have in no way conquered language just because we developed some mind-blowing chatbots.

Conclusion

This was perhaps quite a lengthy discourse I wrote about. Maybe it was helpful. And perhaps, if you did not like it, well fuck you then.