Cleansing unstructured data

4 min readFeb 25, 2019

A brief introduction to working with text data

Data cleansing is critical in data analysis. The quality of data cleansing has a direct impact on the accuracy of the derived models and conclusions. In practice, data cleaning typically accounts for between 50% and 80% of the analysis process.

Traditional data cleansing methods are mainly used to process structured data, including the completion of missing data, modification of format and content errors, and removal of unwanted data. Resources on these methods are widely available. For example, big data engineer Kin Lim Lee published an article on this topic. Lee’s article introduced eight commonly used Python codes for data cleaning. These codes for data cleaning were written by functions and can be used directly without changing parameters. For each piece of code, Lee gave the purpose and also gives comments in the code. You can bookmark this article and use it as a toolbox.

However, the amount of unstructured (textual) data in the world has grown exponentially in recent years. Today, more than 80% of data is unstructured. It has now become important to have high-precision text data cleansing capabilities built into your analysis platform. In fact, this should be a fundamental requirement when assessing text analysis software.

In this article we will discuss the three main steps of text data cleansing (spell check, abbreviation expansion, and identification of abnormalities). Meanwhile, we will gradually introduce a few advanced natural language processing techniques that improve the precision of the data cleansing process.

Spell Check:

The spell check process includes finding slang terms, splitting merged words, and correcting spelling errors. Social media is full of slang words. These words should be converted to standard ones when working with free text. Words like “luv” or “looooveee” will be converted to “love”, and “Helo” will be converted to “Hello”. The text is sometimes accompanied by words that are merged together, such as RainyDay and PlayingInTheCold. Spelling errors can also occur by a simple transposition error, such as “sepll cehck”. One difficulty in spell checking is to find semantic errors, where the word is spelled correctly but it does not fit the context. For example, “there are some parts of the word where even now people cannot write”. In this case, “world” was mistakenly spelled as “word”. Without context, this type of error is hard to find.

Abbreviation Expansion:

Another important part of text data cleansing is abbreviation expansion. There are many forms of abbreviations. Common forms often omit the last/intermediate letters or word, or extract the first letter of each word in a phrase. For example, “Rd” for “road”, “ tele” for “television”, and “ASAP” for “as soon as possible”. Some abbreviations are just based on people’s daily usage habits, like “Tsp” for “teaspoons”. Some abbreviations are borrowed from another language, like “LB” being the Latin abbreviation of “ libra pondo”, which we commonly use in English to abbreviate “pound”. One problem with abbreviation expansion is that the same abbreviation may have different expansions, depending on the topics been discussed. For example, “ASA” can represent “American Society of Anesthesiologists” , “Acoustical Society of America”, “American Standards Association” or “acetylsalicylic acid” (a.k.a., Aspirin).

Identify Abnormalities:

If an abbreviation cannot be deciphered from the text of expanded with a dictionary, a machine learning approach must be used to learn patterns of abbreviation/acronym formation on the basis of the local context. For example, when dealing with English medical abbreviations, the customized abbreviation expansion model can be trained on the basis of the English language database PubMed. And the algorithm can be trained to work with abbreviations from any other thematic area.

Abnormal values may be caused by system errors or by human input errors. A common method of detecting abnormal values is to set a normal range to locate suspicious records that are not within the defined range. For example, if records show some employees are younger than 18 years old, or if their annual income is over 5 million dollars, we can label them as suspicious records. In transactional data, if we found some transactions were made before customer account data was created, then we can label them as abnormal.

Conclusion:

Data cleaning is indeed a key part of data analysis. The addition of text content has increased the difficulty of data mining. But the gain is obvious. It brings much more information. Proper handling of the text data will help us obtain more insightful and enriched conclusions from our data analysis.

Originally published at https://www.megaputer.com on February 25, 2019.

Cleansing unstructured data

Written by Megaputer Intelligence