Comparing Machine Translation to Native Language Analysis

Megaputer Intelligence
Geek Culture
Published in
6 min readMay 10, 2019

--

Which one to choose?

Image by Megaputer

As our world becomes increasingly global, so does our data. Being an analyst today often means working with text data that contains multiple languages. So what do you do?

Essentially, there are two options we may consider: machine translation or native language analysis.

  1. With machine translation, we actually create a new dataset where the text has all been translated into a single language before we do the analysis. This makes the subsequent analysis much easier, as we only need to use a single language grammar module for the analysis.
  2. Native language analysis means that we keep documents in their original languages and perform a separate analysis for each language with the corresponding grammar module.

To demonstrate how these options work, let’s imagine we are working with a dataset that has records (or rows) of textual responses that are either in English or Spanish. Regardless of whether we want to use machine translation or to process individual documents in their native languages, we first need to split the dataset up by language.

Translation services like Google and Microsoft are fully capable of identifying the language of texts on a record-by-record basis, but they will charge you a fee even if you ask them to translate English to English. Therefore, it is best to do the split beforehand and send them only the texts that really need translation. Some of you may even go so far as to split individual texts into multiple records when that text contains multiple languages. This will minimize translation costs.

Once we’ve separated our data based on the language, it’s time to decide what approach to use: Machine Translation (MT) or Native Language Analysis (NLA). Let’s review some pros and cons of each of these approaches.

Machine Translation

Machine translations is relatively cheep, simple, and easy to maintain as compared to native language analysis. But as with all things, these advantages come with a tradeoff. In this case the tradeoff is an additional source of error.

The accuracy of machine translation is still relatively low. Even with manual translation, some things like sarcasm and figures of speech may not translate well. For example, if you translated “break a leg” into another language, the meaning of “I wish you good luck” is likely to be lost. Additionally, different languages may be more or less difficult to translate to or from. Going from Spanish to English will likely result in a reasonable translation, but most translation services that have been tested by our analysts at Megaputer performed relatively poor when working with languages like Japanese, Chinese, and Korean. Anyone looking to use machine translation will need to run some tests to see if the accuracy is at a level that can meet the output goals.

Here is an example of poor translation from Google Translate. As you may know, Japanese is a highly contextual language, which makes machine translation difficult.

Original Text: 生懸命指でまぶたを広げて目薬を差しました。
Google Translate: I spread my eyelids with my fingers and put on my eyes.
Manual Translation: With great effort I held his eyelids open with my fingers and dropped in the eye medicine.

Native Language Analysis

Native language analysis generally results in much more accurate results. This is, of course, dependent on the analyst. But of course, there will be costs for hiring additional analysts to cover different languages. And those analysts will be more expensive since they need to not only have the skills of an analyst, but also the skills of a polyglot linguist.

An analysis in several languages are more difficult to maintain. When models and algorithms need to be adjusted, the work will be multiplied by the number of languages being worked with. Additionally, when consumers of the results review the analysis, they may not be able to independently read some of the records or to understand the supporting information.

Other Options and Common Questions

There is actually a third option that is the most expensive choice. You can use native language analysis for the model building and analysis to ensure high accuracy of the results, but then also use machine translation so that end users reading the report can get a general idea of what each record says. However, it may not be 100% clear to them why it was processed as it was since the analysis was done on the untranslated text.

As for which option should you choose… How much accuracy are you willing to sacrifice for cost?

As for which option you should choose, there is no way to know except to do a small-scale test analysis. Try some MT analysis, and if the accuracy is acceptable, then that might be the better choice for you. How much accuracy are you willing to sacrifice for cost? The answer to that will vary from company to company and task to task.

Another common question is, “Which translation service should I use?” Here is a suggestion on how to decide when your goals involve a categorization task. Suppose you have a dataset of customer complaints and you want to categorize each text based on what complaints were made, thus allowing you to get the count for each complaint type. It is recommended that you do a small-scale analysis for each MT service you are considering, and compare this to the results you achieve using NLA. Make sure that the analyst-driven portion of the NLA is rock solid, then compare the results of each MT service to your NLA results. To make this comparison, treat the NLA as 100% correct, then calculate the precision, recall, and F score of the post-MT analysis. This means you will count how many categorizations were made incorrectly, and how many categorizations the post-MT analysis failed to make that it should have made.

For Example

  • Your NLA made 100 categorizations.
  • You use translation service “A”.
  • When you run your algorithm on the machine translated data from “A”, it makes 70 categorizations, 10 of which were not made by the NLA (and therefore are incorrect), and 60 of which are identical to the NLA results.

Precision is the number of correct results divided by the number of all returned results. In this case, we had 60 correct results out of 70 returned categorizations.

P = 60/70
P = .857

Recall is the number of correct results divided by the number of results that should have been returned. In this case, we had 60 correct results found out of the possible 100 correct results.

R = 60/100
R = .6

F Score is the harmonic mean of recall and precision. The harmonic mean is the preferred method for averaging ratios. The F score is a good measure of how “correct” are the categorization results.

F = ((2)(Precision)(Recall))/(Precision + Recall)
F = ((2)( .857)( .6))/( .857 + .6)
F = .706

Then you try translation service “B”, and perform a similar calculation of precision, recall, and F-ratio for the corresponding categorization results. Let’s assume the F-score on the machine translated data from “B” is .75. Since service “B” facilitated a higher F-score (.75 > .706), we conclude that service “B” provides a more accurate machine translation than service “A”. So all other things the same, you should go with service “B”.

All that being said, there is at least one other factor to consider. If your data is highly sensitive, remember that services like Microsoft and Google have rules about keeping a sample of your data, which they can use for improving their algorithms. SDL, on the other hand, does not keep any of your data. Unfortunately for those with highly sensitive data, it appears that generally Microsoft and Google have put to good use the additional data they are receiving. In the tests Megaputer staff have run, these services tend to outperform SDL in terms of accuracy.

Originally published at https://www.megaputer.com on May 10, 2019.

--

--

Megaputer Intelligence
Geek Culture

A data and text analysis firm that specializes in natural language processing