Introduction to survey data cleaning and why it matters
If a place is messy, it won’t be easy to see any good in that place unless some cleaning is done. For survey creators and researchers to get valid data and their intended results, it is important to employ survey data cleaning to filter out errors, inconsistency in every response. Survey data cleaning is a process through which errors in survey responses are removed to ensure consistency, validity and reliability of data. This process is usually carried out on raw data before the general data analysis.
Why does it matter? It matters because of the following reasons:
- It ensures data accuracy and validity.
- It helps businesses and companies to make better decisions which tends to improve their services.
- It helps prevent biased analysis and conclusions.
- It also helps researchers, survey creators and businesses save cost.
What messy survey data looks like in real projects
A lot of factors contribute to messy survey data ranging from inconsistent formatting, duplicate entries, missing values e.t.c. here is how messy survey data looks like in real projects:
- Duplicate response: Respondents submitting the same response repeatedly.
- Inconsistent responses: Responses that contradict the survey question. E.g 50 years’ response for a question and 1970 in another response to a question.
- Missing data: some respondents tend to abandon surveys halfway and this contributes to missing data. E.g asking sensitive questions like “when last did you test for STI? This type of question leads to high non response rates, blank columns left unfilled.
- Inconsistent format: recording the same variable in different format. E.g 10/04/1980, 10th April, 1980 or 10-04-1980.
- Speeding through surveys: A survey which is meant to take 10m mins duration being completed in 2 mins often results in random answers.
Common sources of errors in survey response
- Response error: This type of error occurs through the response of respondents, when they give an inaccurate or socially desirable response. This alters the validity of the research data.
- Nonresponse error: This stems from when participants refuse to give responses in surveys and this could be caused by busy schedules, poor internet or when the survey is not mobile friendly.
- Sampling error: This occurs at the point when the population used does not capture the population that is needed. E.g surveying the working class in urban areas when the whole working class in the country is needed for the data. This shows sampling error because it did not capture the demographic needed.
- Measurement errors: When survey instruments e.g (questionnaires,scales) adopted do not measure the data needed, the data is rendered invalid.
- Coverage error: This error occurs when there is a difference between the sample obtained and the true value of the population. This difference that occurs is what is called sampling error.
- Interview bias error: This occurs when interviewers gestures, tone, body language influences the respondents response.
- Data error: This error occurs during data collation and entry and this usually alters data analysis which in turn leads to inaccurate data.
- Question and wording: Use of leading or loaded questions leads to this error which distorts data. Ambiguous and unclear usage of words also contribute to this error.
Understanding duplicate responses and how they happen
It is important to know that duplicate responses happen due to various reasons which can be either from researchers or survey creators or even from respondents while this can frustrate the researchers while collating their data. It is essential to know what it is and how they happen. What then are duplicate responses?
Duplicate responses are responses usually found in surveys which tend to appear twice or more than twice. It could stem from the multiple responses, slightly different responses or when respondents use different devices or identities to attempt the same survey. It could be accidentally or intentionally most especially when it attracts incentives.
How do they happen? It could occur through any of the following ways:
- Survey design or technical issue: Poor survey design navigation e.g submission button not clear and unstable internet connection cause respondents to submit twice.
- Incentive reward: Respondents submitting multiple entries to get the incentive reward attached to the survey.
- User behaviour: Respondents not remembering whether they have submitted before or resuming from where they previously stopped.
- Shared devices: Respondents using the same device are likely to appear as single respondents most especially when they are not aware of the email address option which should be edited at the top of the survey.
- Accidental submission: Submission of an entry twice or refreshing the page causes duplicate entries.
How to identify and fix incomplete responses
Incomplete data does not need to be eliminated completely because it holds a few information that respondents fill. To identify and fix incomplete response, take the following steps:
- Filter the response: Use the finish status to filter the incomplete response. Adopt the use of survey platforms like Qualtrics to filter the response and focus on identifying the incomplete ones.
- Analyze the progress: Ensure to check at what point did the respondents drop off the survey using the progress button and use speeders to identify those who rush through their responses.
- Identify drop off points: Look out for questions that got most respondents to drop off which can point out confusing or use of unclear terms.
- Same answer response: Check for respondents who fill in the same answer throughout the survey which indicate lack of interest or inattentiveness.
- Use metadata: Adopt the use of IP addresses, timestamps and device IDs to identify duplicate and bot suggested responses.
To fix incomplete responses, take the following steps listed below:
- Identify partial response: To fix a problem, you have to be able to identify what the problem is. The first step to fixing incomplete responses to identify it out of the general data.
- Analyze drop off points: if the drop off points of all respondents is at a particular question, there is a high tendency that there is a problem with the questions and it is best to fix it for future occurrence.
- Impute missing data: For missing data, use the calculated value of others answers to fill the questions with missing responses.
- Send a follow up reminder: Craft a personalised follow up reminder for respondents who filled the survey through emails, remind and encourage them to fill up the survey.
- Re-evaluate your data usage: If respondents provide answers to the important and critical questions while missing a few others, you can keep your data should in case a need for it arises.
Handling inconsistent and invalid answers in surveys
Survey validity and credibility can be greatly affected by inconsistent and invalid answers and if it is not handled properly, it gets the data skewed. Handling inconsistent and invalid answers can be done through the following steps:
- Define the problem: In knowing what inconsistent and invalid answers are, there is a need to know what each means. Being able to define both correctly will help identify it within the data.
- Identify the problem: To identify inconsistent and invalid data, there is a need to know what each means. Being able to identify both helps differentiate it in data and also tackles it properly. E.g inconsistent response = answers contradict each other while invalid response = answer does not make logical sense.
- Handling data: Review each response carefully and delete responses from participants who did not engage fully in the survey. For missing responses, use the calculated value obtained from other respondents’ answers to fill in the answers.
- Delete clear bad responses: Remove impossible entries and bot influenced responses.
- Prevent the problem: Use clear, simple and understandable terms and refrain from using loaded or leading questions which tend to influence respondent answers. Use skip logic to allow respondents to answer questions that apply to them and test run the survey within a small group so to help identify any technical issue and for better user experience.
Best ways to detect duplicate survey entries
To be able to detect duplicate surveys, it is important to know what duplicate surveys are. Duplicate surveys can be identified and detected through through the following ways:
- Unique identifiers: Look for emails, addresses and contact and unique IDs. When one identifier has multiple entries then it is a duplicate response.
- IP addresses: Multiple entries from just one IP address signifies duplicate response although a few persons can be using one IP address mostly especially in workplace settings.
- Timestamps: Analyze timestamps between similar entries because this is usually caused by twice submission or refreshing the same page.
- Answer patterns: Compare the patterns used in answering in a multiple entries response, if it is the same, there is a likelihood of duplication as this is most common in anonymous surveys.
Techniques for removing duplicates without losing valuable data
Removing duplicates in survey data is not just about erasing very inconsistent or wrong responses or rows and columns but rather it is about vetting your data so as to keep the best version of it. The following techniques can be adopted in removing duplicate without survey data losing its value:
- Define the duplicates: Set guidelines you will use in identifying what duplicate is within your data. E.g same answers, same email address e.t.c
- Unique identifiers: Assign a unique identifier number to each respondent and use it to identify the response of each respondent giving no room for two persons to use the same UID number. This helps to track each participant across the dataset. Use Qualtrics or SurveyMonkey which help store this directly.
- Keep the best of the entries: Do not delete based on the first entry but rather vet each response and select the best response out of all. E.g most completed answers, most accurate answers e.t.c
- Merge duplicate response: This is a method that is underused by most researchers but it is very useful. Entries that are duplicated but with valuable information should be merged.
- Use timelogic stamp: Compare the time stamps between similar entries, if it’s within a few minutes interval, it is possible to be accidental. If it’s within hours, it could be an update on the first entry submitted by the respondent.
Dealing with outliers and suspicious responses
In dealing with outliers and suspicious responses, it is important to note that these two factors can skew data and if not carefully removed, it can also affect overall data. Below are ways to deal with outliers and suspicious responses:
- All outliers are not bad: Learn to identify responses that seem simply unusual but are valid if evaluated within the study context and target population.
- Suspicious responses: Duplicated responses, selecting the same answer across all questions and answering surveys slated for 10 mins within mins.
- Invalid responses: Erase questions with multiple entries that have invalid responses.
Survey data cleaning tools and software to consider
To rid off dirt in an environment, you employ the best tools to get the work done neatly and perfectly. It is also the same with surveys, when it gets messy and mixed up, it is important to use the best tools and software to clean up so as to get valid and accurate data. To clean and adjust messy survey, adopt the following tools and software listed below:
- Microsoft Excel: it is an online spreadsheet application and it is used to remove duplicates, split data and prepare datasets for analysis.
- Googlesheet: It is a cloud based spreadsheet used to organize, sort, filter and analyse large datasets.
- OpenRefine: It is a free open source desktop application designed for cleaning, transforming and exploring large datasets.
- Tableau Prep: It is a visual data preparation tool that allows users to clean, make adjustments and combine data before visualisation.
- DisplayR: It is a cloud based AI powered data analysis and reporting platform that helps detect missing data and bad responses.
- SurveyMonkey: It is a common and widely used online survey platform designed to create, distribute and analyse survey form, polls to gather feedback and data. It also provides tools for data analysis.
- QuestionPro: It is a survey and research cloud based platform that is widely used for market research. It helps detect frauds and clean data.
Best practices for maintaining clean survey data over time
Maintaining a clean survey is not a child’s play adventure rather it helps serve respondents better and increases the validity and credibility of survey data. Clean survey data can be maintained through the following steps listed below:
- Design your survey: Carefully design your survey, write in clear and simple terms. Include skip logic so as to allow respondents to see questions that apply to them.
- Standardize response formats: Use options like multiple choice, drop down and fixed format instead of open text as this keeps responses consistent and there won’t be need for heaving cleaning.
- Monitor data: Regularly monitor and review your responses as they come in so as to identify errors and missing data early. This will help minimise low quality data in analysis.
- Set a cleaning protocol:Set clear rules in identifying outliers and other irregularities and ensure to apply the same rule across all data to ensure analysis validity and reliability.
- Document your process: Ensure to take note of all the processes used so as to be able to apply it across future projects.
Conclusion
Survey data cleaning is not a one day job but rather it requires processes which takes patience and intentionally filtering out errors with data without clearing out important data. To carry out an effective survey data cleaning, it is very important to understudy the different kinds of errors that usually surface in surveys, when this is done, it will be easy to carry the cleaning without altering the validity and reliability of the overall data. Clean survey produces clean data.
