Learning Language

Written by David Gulajan | Feb 23, 2021 2:50:05 PM

Are you in need of more clarity in terms of the ever-developing field of Natural Language Processing? Good news! Avineon is launching a series of Natural Language Processing (NLP) blog posts to outline our success with Machine Learning and other Deep Learning endeavors.

 

For your benefit, the list of covered topics is illustrated here below:

 

 

These topics are geared towards those with little to no prior knowledge in the field but will aim to engage those of all knowledge levels. Whether you know only a few terms or whether you have built your own model, we look forward to embarking on this series with you!

 

 

 

Language is Learned

 

Now that we have discussed the value and ubiquity of unstructured data in last week’s post, I would like to dive into the nuts and bolts of Natural Language Processing (NLP). Most of the time, the architecture for NLP solutions is “language agnostic.” It does not mind what language is used, if the general structure is similar. It is this general underlying structure that NLP solutions interpret and learn.

 

Avineon’s work surrounding NLP started in 2018 and was centered around the question, “how can we help our clients engage with their text?” This sparked development into a question answering system that allows a user to parse questions to large text corpuses and receive instantaneous responses rooted in the text. This would need to go beyond a simple search function in a document. How do you get a computer to understand what you mean by your question, even if you do not fully know how to ask it?

 

 

The Role of Word Embeddings

 

One answer to this question is word embeddings, a foundation of computing language. Word embeddings are numerical representations of the meaning of words. It is best understood if words have a noticeably clear distinction between them. An illustrative example is the difference between “language” and “song.” The linguistic distinction, clearly, is how the words are communicated, but there are other underlying differences in their usages too based on historical context and social norms. A computer needs to capture all those differences and, more importantly, be able to compute them mathematically.

 

 

Somehow, that math equation can make intuitive sense to us. It is this type of equation that needs to exist between all words in each language for computers to be able to process it. That is the role of word embeddings and they accomplish this goal by acting as a certain number of “sliders” for the wide range of differences that make up words.

 

 

Linear Algebra

 

These sliders are a representation of the linear algebra that is occurring under the hood. Each slider represents a dimension that the word can move across. So when we take the sliders for “Syntax ” and “Melody” and apply the basic addition and subtraction ...

 

 

... we find that the result is close to what we would expect for “song.”

In practice, instead of three sliders for a given word, there are hundreds and we let the computer figure out what the categories are. Once the computer has made sure that all the words in the language fit nicely enough to math equations, we can start our work building a model that processes all these word embeddings.

 

Most of that topic is outside the scope of this blog, but it can be understood using this slider metaphor. The machine learning model takes the sliders for every word in the text input and passes it through a blender of complex mathematical equations that are fine-tuned by large volumes of sample text. The output is determined based on the needs and provided data set.

 

 

Natural Language Processing

 

Avineon’s NLP solutions have utilized this methodology to solve problems ranging from categorizing statements of work into NAICS codes by having the machine read the documents, to providing insightful answers rooted in text to challenging, domain specific questions. No matter the challenge, it all begins by breaking down the complex problem into smaller, bite-sized pieces, which we will discuss further in next week’s blog on Topic Modeling.