Coding data for qualitative analysis

Ross Woods, 2022-24

Thematic coding is a common method of analyzing documentary data, usually transcripts of interviews or focus groups with open-ended questions. As a qualitative methodology, it gives researchers a way to interpret and analyse data.

Check the purpose of the research:

If the purpose is to understand respondent's perceptions of X, your evidence should reflect their perceptions and attitudes.
If the purpose is to understand X itself, you are looking for facts and evidence.

Thematic coding is derived from a research approach called grounded theory. In essence, this is a method of using a comprehensive set of examples to identify patterns, from which the researcher can create a theory. The theory is justified by the range of real examples.

Thematic coding is really just a systematic way of analyzing data to reach a conclusion. It has become increasingly popular in recent years, partly because it looks like a set of steps. However, the latter stages are less procedural and require more thought. It is not actually a set of steps, but is more like a set of phases that can overlap. For example, you can start transcribing and analyzing data as soon as it is collected.

Thematic coding has several advantages. First, the researcher simply has to follow the method. Second, it gives a way to systematically analyze lots of data, such as when writing a longer thesis or a dissertation. Third, the researcher can use it as in stages, giving an opportunity to adapt the method as needed, and perhaps hold more interviews. Fourth, it is easier if the researcher uses voice-to-text software to transcribe interviews.

It also has several disadvantages. Although it is quite flexible, it probably doesn’t allow much scope for innovation. If you do not use transcription software, it is very time-consuming to transcribe interviews by hand, or quite expensive if you have to pay someone else to do it for you.

A series of stages

Stage 1: Continue keeping a diary

You should already be keeping a written diary of your methodology, including what you did, why you did it, your methods, and your observations. Your description is essential to your accountability. In principle, it must to be detailed enough to enable someone else to follow your method. (In the Grounded Theory literature, the diary is often called keeping a memo or memoing.)

Write your notes in full sentences so that you can understand them even after you've forgotten the actual situation in which you wrote them. (A list of unexplained topics is not helpful.)

Add records of your reflections to your diary. You can start to informally analyze data it as soon as you start collecting it. You should take notice anything relevant to your research question, for example:

Questions on what seems to be important or curious
Observations
Emerging patterns
Questions that aren’t answered yet
Outliers and anomalies
Specific “turns of phrase,” interesting quotations that best encapsulate a strand of meaning in the data.
Contradictions and things that don’t seem to make sense,
Suspicions that things are not quite what they appear to be.

In your diary, you should also write down the reasons why you interpreted the data the way you did. The description will probably be quite simple at first, but any later changes or elaborations will be significant because they indicate a better interpretation of the data.

💡 If you are writing a dissertation ...

Follow the methodology plan in your proposal. (Consequently, you could eventually copy the relevant parts of your diary into your methodology plan, and edit it into a harmonious, flowing document in past tense. It is then the analysis chapter in your dissertation.)
Bring your diary to meetings with your dissertation supervisor and be prepared to discuss anything in it.

Stage 2: Decide whether or not you will use inductive or deductive themes

You can can start formulating deductive a priori themes very early in the whole process, even before you collect any data. It is quite permissible to derive a set of themes from your literature review or your statement of the research question. However, it would be a mistake to use deductive themes exclusively because other unexpected but significant themes might emerge in the data later on.

The other alternative is to use inductive codes and themes, that is, those that emerge in the data during your analysis. This has the advantage of including those that you could not anticipate in your original plan. You can even modify your system of inductive codes and themes during data-gathering and analysis in order to get themes that better represent your data.

Stage 3: Let your system evolve

Some of your ongoing analysis might affect your data gathering. Qualitative research is often iterative, and this method allows you to improve your data collection and analysis as you progress:

If you use free informal interviews with lots of open-ended questions, you are not limited to a previously-defined list of questions. As you hold more interviews, you might find ways to improve your questions or add more questions. For example, you might want to add follow-up questions on themes that occur frequently and appear to be more significant. You might also be able to delete questions that show no promise of rich, useful data.
As you continue coding, you should be able to improve your definitions of each theme and to improve your system of themes. Don’t let that disappoint you; it means you have a better understanding of the data. If your themes are more specific and better defined, they might occur less often, but your data will be more finely grained and probably help you reach better conclusions. You might add other themes, or combine or separate themes. If you delete themes, the reason is because you better understand the data and can better determine whether data is relevant or irrelevant to your topic. If you abandon a topic, write down the reason for doing so.

Stage 4: When to stop collecting data

You have enough data when any more would not improve, strengthen, nor add to your conclusions. This point is called “data saturation.” If you use interviews, data saturation is usually reached with fewer than 20 interviews. One of these methods will help you when to decide to stop collecting data:

As a rule of thumb, you have reached data saturation when you have conducted three consecutive interviews that do not produce any new themes.* However, this depends on the quality of data. You need enough rich data when you can identify patterns in the data that are useful as conclusions, and can confirm them based on the data available. Some researchers compile quasi-statistical records of occurances of themes, but the richness of the data is more important.
Some supervisors or institutions have rigid rules requiring a specific minimum number of interviews, often either 20 or 25. This is not helpful if you reach saturation at much fewer interviews, because any subsequent interviews are normally a waste of time.

In some kinds of research, such as ethnography, it is usually possible to keep collecting more and more data. In these cases, the criterion is your research question. In other cases, you can consider stopping when you have obtained data from everybody in your sample.

Stage 5: Transcription

You can start transcription as soon as you have collected data. Transcribe it word-for-word into documents, although you might be able to exclude anything clearly irrelevant to your research purpose. Warning: Some things that look irrelevant at first might appear more relevant later on when you understand the data better.

Most researchers prefer to use transcription voice-to-text software or external services to do transcriptions. A few, however, prefer to do it manually because it brings the very close to the data, even though it is horrendously time-consuming.

Stage 6: Familiarize yourself with the data

Start reading and re-reading all your data while it is still coming in, and make diary notes of any other questions arising. (If you transcribe manually, this will come very easily.)

When you become very familiar with your data, it might look very little even when you have enough. Don’t worry.

Stage 7: Select quotations

You might have started collecting quotations, but now you can treat it as an extra stage. Using direct quotations from respondents in your final report has two particular benefits:

Your readers will see the lives of real people whom you have interviewed. This personalizes your research report and makes it easier to read.
It shortens the distance between respondents and your reader.s This helps to prevent an analysis that is largely an artificial construct that you have created.

Stage 8: Coding

Coding is a way of simplifying and breaking down a large amount of real data into smaller, more useful pieces.

Codes have the following advantages:

They are directly relevant to your research question,
They are small and manageable,
They are a more theoretical form that the original data, and,
They can then be categorized and analyzed for patterns.

If possible, start coding as soon as you have transcriptions and are familiar with the texts, while you are still collecting data. Mark all parts of the text that are relevant to your research question with a color-code or symbol. These might be “recurring patterns, terms, or visual elements.” (Naeem et al. p. 2.) On each part of the text that you marked, put a brief label of a single word or a short phrase that says what is going on. These labels are your codes. Coding is itself part of analysis, because you are sorting raw data into structured meaning.

When you have finished coding, you will have a patchwork of the meanings of everything in your data that is relevant to your topic, and it will help you to develop theory. It is also simpler and briefer than the full text of raw data.

💡 It is good practice to have someone else check your coding; it will help prevent or minimize personal bias in interpreting data.

💡 The simplest way is to color-code documents by hand, usually in a word processor, but paper might be easier for some people. Although time-consuming, hand-coding is still a good option because you get a better idea of what is going on in the data as you work through the details. Otherwise, you can use software, like Zotero, which is free and online; some institutions use it as their standard method. Just check that it will do what you want for your particular research project.

⚠ Some mistakes are easy to make if you make incorrect assumptions about your respondents:

A common comment might represent many respondents, but it might not be very insightful.
Similarly, a rarely occurring comment might be very insightful.
A short comment might be valuable and a long comment might not be.
An inarticulate respondent might be very insightful.
An unintelligent or poorly educated respondent might still make a very helpful contribution.
An unexpected response or theme does not mean it is an aberration. On the contrary, it might be valuable.

Stage 9: Assign themes

Group related codes together and represent them with a theme, that is, an overarching idea that represents what is happening. Themes are a higher level of abstraction.

Stage 10: Check your themes

Do your themes accurately represent the theoretical ideas in your data and codes?

Stage 11: Develop a Conceptual Model

When your have created a system of themes, compare different occurrences and look for patterns in the data. By this stage, you should be able to see patterns; the sooner you spot the patterns and confirm them, the faster you make progress. You will find that you read the transcripts again and again, and become very familiar with them.

What are the relationships between codes and themes? You can use diagrams or models to represent the relationships among these concepts. (Naeem et al. p. 4.) Can you accurately define those relationships and demonstrate them from your data?

If your questions addressed your research question ane purpose, you will find an answer in the data, even if it is not the answer you expected.

A Conceptual Model: Method 1

This approach is primarily expository:

Introductory paragraph or section
Theme A

Exposition of a group of codes related to theme A, with quotations
Exposition of a group of codes related to theme A, with quotations
Exposition of a group of codes related to theme A, with quotations

Theme B

Exposition of a group of codes related to theme B, with quotations
Exposition of a group of codes related to theme B, with quotations
Exposition of a group of codes related to theme B, with quotations

Theme C

Exposition of a group of codes related to theme C, with quotations
Exposition of a group of codes related to theme C, with quotations
Exposition of a group of codes related to theme C, with quotations

And so on …
Closing paragraph or section

A Conceptual Model: Method 2

This approach is best for making sense (i.e. creating a theory) of an unusual or counterintuitive phenomenon. However, don't treat it as a rigid set of steps that will meet all your conceptualization needs:**

Think of the object of your research as a phenomenon that is not yet understood.
Think of your data as a set of examples of the phenomenon.
Ensure that you really have only one phenomenon, and not multiple different phenomena that should be kept separate. (This is the eidetic question.)
Sift your data to answer the following questions:

What events or condition actually caused this phenomenon?
What intervening factors determine the path of events and cause variations in outcomes?
In what contexts does it occur? Describe it as a specific set of properties.
What interactions occurred?
Were there changes in the phenomenon over time?
Were there changes in the whole process over time?
What were the particular aims or purposes of the phenomenon?
Were there occurrences of the phenomenon that failed to achieve their purpose?
What are the results of the phenomenon?

Compare examples to resolve apparent contradictions.
If necessary, get more data to answer all the above questions accurately.

Questions

How many themes?

There is no rule about specific numbers of themes. The principle is that you need enough to represent the data accurately and to help you reach sound conclusions. If the number of codes hinder and confuse your analysis, you should ask whether the number of them is the cause of the difficulty.

The data saturation level indicates that between ten and twenty themes is probably enough if your interviews are well-focussed on your topic. If a smaller number of themes accurately represents both the data and the research problem, then you might not need more.

If you have a large number of themes, some will probably have very few occurrences and they will not tend to be helpful. However, a large number of themes is not a bad thing in some circumstances. First, the research might have a problem of diversity. For example, the phenomena in your research problem might have a wide variety of causes, manifestations, or symptoms. Second, a small number of occurances (the outliers) might be significant, and you cannot presume that the bulk of data is always the best data. For example, a treatment might be quite safe for 98% of subjects, but a 2% death rate might be unacceptably high.

How can I code qualitative data from my interviews so that I work smarter, not harder?***

Organizing large amounts of data is possible with a computer, but it might not be the best way for everybody. Besides, if you make a mistake with a computer you might not notice it or might not be able to reverse it. Advice so far:

Many students use software to do coding and this works for them. If so, keep backups of your original data files. Make progressive backups every time you make major changes.
Many students like to print them out on paper, use color coding, and perhaps even put them on a wall as a huge chart.

Some people simply prefer to work with paper.
Printouts help them to visualize the entire dataset. It might also help them to feel more familiar with the data and perhaps to identify patterns in it. At the very least, seeing the whole thing at once will make you feel very satisfied.
Paper printouts might even be necessary if you still cannot identify patterns in the data after coding with software.

__________
* A mathematical proof of data saturation is unlikely because qualititative data is not appropriate for a mathematical proof.
** Ross Woods, 2020, '24, derived from Strauss and Corbin, 1990, pp. 99-107.
*** With thanks to Rιchαrd Scοtt Bαskαs, Rαιnεε Βrγαnt, Lγndα Dανis.

Muhammad Naeem, Wilson Ozuem, Kerry Howell, and Silvia Ranfagni. A Step-by-Step Process of Thematic Analysis to Develop a Conceptual Model in Qualitative Research International Journal of Qualitative Methods Volume 22:1–18 (2023) DOI: 10.1177/16094069231205789

Ross Woods, 2020, '24. Toolkit of research methods.

Anselm Strauss and Juliet Corbin. 1990. Basics of Qualitative Research: Grounded Theory and Procedures and Techiniques (Newbury Park, Ca.: Sage Publications).

Appendix: Τοm Grαnοff and a priori coding

Τοm Grαnοff offered another way of looking at it. If the literature is fairly mature and you already have a quite good idea of what most of the top ten responses will be, give interviewees a checklist of all those responses for them to endorse those that are relevant to them. Then follow it with an open-ended question like, “Please comment on the responses that are most important to you.” This approach has several advantages:

The checklist is generated directly from the literature, so content validity is clearly established.
This is much easier for respondents to complete. Think about classroom testing situations: It is generally easier to “recognize an answer (multiple choices or matching)” than to “recall an answer (fill-ins or essay exam).”
When compared to strict open-ended questions, this method makes for easier coding, loading of responses into the computer, analyzing the data, and comparing the findings to the literature in the Discussion Chapter.
Most respondents will not invest the time or energy to give long thoughtful answers to open-ended questions on a survey. This results in under-reporting. However, most will check off possible answers and then add some comments once they have been given some reminders as to what to say.
It helps the reader to clearly understand what you are looking for so they can respond correctly.
Because the Checked/Not Checked format yields a dichotomous variable, these answers can be easily correlated with any demographic variables to see who is more likely to endorse that characteristic.
It increases reporting of potentially embarrassing behaviors because the list normalizes a wide range of behavior.

This is better than surveys with open-ended questions where the majority of respondents gave either no answer or an answer of less than five words, which is basically useless.

With thanks to Tom Granoff. He thinks he didn't invent it, but doesn't know its origins. It might of been the use of symptom checklists as a way to do quick clinical assessments, and he adapted it to dissertation work.