Coding data for qualitative analysis

Ross Woods, 2022-24

Thematic coding is a common method of analyzing documentary data, usually transcripts of interviews or focus groups with open-ended questions. As a qualitative methodology, it gives researchers a way to interpret and analyse data.

You might be able to develop a coding system during the preparation stage. Compare these ways of doing it:

Thematic coding is derived from a research approach called grounded theory. In essence, this is a method of using a comprehensive set of examples to identify patterns, from which the researcher can create a theory. The theory is justified by the range of real examples.

Thematic coding is really just a systematic way of analyzing data to reach a conclusion. It has become increasingly popular in recent years, partly because it looks like a set of steps. However, the latter stages are less procedural and require more thought. It is not actually a set of steps, but is more like a set of phases that can overlap. For example, you can start transcribing and analyzing data as soon as it is collected.

Thematic coding has several advantages. First, the researcher simply has to follow the method. Second, it gives a way to systematically analyze lots of data, such as when writing a longer thesis or a dissertation. Third, the researcher can use it as in stages, giving an opportunity to adapt the method as needed, and perhaps hold more interviews. Fourth, it is easier if the researcher uses voice-to-text software to transcribe interviews.

It also has several disadvantages. Although it is quite flexible, it probably doesn’t allow much scope for innovation. If you do not use transcription software, it is very time-consuming to transcribe interviews by hand, or quite expensive if you have to pay someone else to do it for you.

A series of stages

Stage 1: Continue keeping a diary

You should already be keeping a written diary of your methodolology, including what you did, why you did it, your methods, and your observations. Your description is essential to your accountability. In principle, it must to be detailed enough to enable someone else to follow your method.

Write your notes in full sentences so that you can understand them even after you've forgotten the actual situation in which you wrote them. (A list of unexplained topics is not helpful.)

Add records of your reflections to your diary. You can start to informally analyze data it as soon as you start collecting it. You should take notice anything relevant to your research question, for example:

  1. Questions on what seems to be important or curious
  2. Observations
  3. Emerging patterns
  4. Questions that aren’t answered yet
  5. Outliers and anomalies
  6. Specific “turns of phrase,” interesting quotations that best encapsulate a strand of meaning in the data.
  7. Contradictions and things that don’t seem to make sense,
  8. Suspicions that things are not quite what they appear to be.

In your diary, you should also write down the reasons why you interpreted the data the way you did. The description will probably be quite simple at first, but any later changes or elaborations will be significant because they indicate a better interpretation of the data.

💡 If you are writing a dissertation ...

Stage 2: Decide whether or not you will use deductive themes

You can can start formulating deductive themes very early in the whole process, even before you collect any data. It is quite permissible to derive a set of deductive themes from your literature review or your statement of the research question. However, it would be a mistake to use deductive themes exclusively because other unexpected but significant themes will probably emerge later on in the data.

The other alternative is to use inductive themes, that is, themes that emerge in the data, that you identify as you assign codes and themes. You can even modify your system of inductive themes during data-gathering and analysis in order to get themes that better represent your data.

Stage 3: Let your system evolve

Some of your ongoing analysis might affect your data gathering. Qualitative research is often iterative, and this method allows you to improve your data collection and analysis as you progress:

Stage 4: When to stop collecting data

One of these methods will help you when to decide to stop collecting data:

  1. You have reached “data saturation” when you have gathered data such that gathering more would not improve, strengthen, nor add to your conclusions.
  2. If you use interviews, data saturation is usually reached with fewer than 20 interviews, although some supervisors or institutions have rigid rules requiring a specific minimum number of interviews, often either 20 or 25.
  3. As a common rule of thumb, you have reached data saturation when you have conducted three consecutive interviews that do not produce any new themes.* However, this depends on the quality of data. You need enough rich data when you can identify patterns in the data that are useful as conclusions, and can confirm them based on the data available. Some researchers compile quasi-statistical records of occurances of themes, but the richness of the data is more important.

Stage 5: Transcription

You can start transcription as soon as you have collected data. Transcribe it word-for-word into documents, although you might be able to exclude anything clearly irrelevant to your research purpose. Warning: Some things that look irrelevant at first might appear more relevant later on when you understand the data better.

Most researchers prefer to use transcription voice-to-text software or external services to do transcriptions. A few, however, prefer to do it manually because it brings the very close to the data, even though it is horrendously time-consuming.

Stage 6: Familiarize yourself with the data

Start reading and re-reading all your data while it is still coming in, and make diary notes of any other questions arising. (If you transcribe manually, this will come very easily.)

When you become very familiar with your data, it might look very little even when you have enough. Don’t worry.

Stage 7: Select quotations

You might have started collecting quotations, but now you can treat it as an extra stage. Using direct quotations from respondents in your final report has two particular benefits:

  1. Your readers will see the lives of real people whom you have interviewed. This personalizes your research report and makes it easier to read.
  2. It shortens the distance between respondents and your reader.s This helps to prevent an analysis that is largely an artificial construct that you have created.

Stage 8: Coding

If possible, start coding as soon as you have transcriptions and are familiar with the texts, while you are still collecting data.

Mark all parts of the text that are relevant to your research question with a color-code or symbol. These might be “recurring patterns, terms, or visual elements.” (Naeem et al. p. 2.) On each part of the text that you marked, put a brief label of single word or a short phrase that says what is going on. These labels are your codes. Coding is itself part of analysis, because you are sorting raw data into structured meaning.

You now have a patchwork of the meanings of everything in your data that is relevant to theory development. It is also simpler and briefer than the full text of raw data.

💡 It is good practice to have someone else check your coding; it will help prevent or minimize personal bias in interpreting data.

💡 The simplest way is to color-code documents by hand, usually in a word processor, but paper might be easier for some people. Although time-consuming, hand-coding is still a good option because you get a better idea of what is going on in the data as you work through the details. Otherwise, you can use software, like Zotero, which is free and online; some institutions use it as their standard method. Just check that it will do what you want for your particular research project.

Stage 9: Assign themes

Group related codes together and represent them with a theme, that is, an overarching idea that represents what is happening. Themes are a higher level of abstraction.

Stage 10: Check your themes

Do your themes accurately represent the theoretical ideas in your data and codes?

Stage 11: Develop a Conceptual Model

When your have created a system of themes, compare different occurrences and look for patterns in the data. By this stage, you should be able to see patterns; the sooner you spot the patterns and confirm them, the faster you make progress. You will find that you read the transcripts again and again, and become very familiar with them.

What are the relationships between codes and themes? You can use diagrams or models to represent the relationships among these concepts. (Naeem et al. p. 4.) Can you accurately define those relationships and demonstrate them from your data?

You can try this approach as long as you don't treat it as a rigid set of steps that will meet all your conceptualization needs**:

  1. Think of the object of your research as a phenomenon that is not yet understood.
  2. Think of your data as a set of examples of the phenomenon.
  3. Ensure that you really have only one phenomenon, and not multiple different phenomena that should be kept separate. (This is the eidetic question.)
  4. Sift your data to answer the following questions:
    1. What events or condition actually caused this phenomenon?
    2. What intervening factors determine the path of events and cause variations in outcomes?
    3. In what contexts does it occur? Describe it as a specific set of properties.
    4. What interactions occurred?
    5. Were there changes in the phenomenon over time?
    6. Were there changes in the whole process over time?
    7. What were the particular aims or purposes of the phenomenon?
    8. Were there occurrences of the phenomenon that failed to achieve their purpose?
    9. What are the results of the phenomenon?
  5. Compare examples to resolve apparent contradictions.
  6. If necessary, get more data to answer all the above questions accurately.


How many themes?

There is no rule about specific numbers of themes. The principle is that you need enough to represent the data accurately and to help you reach sound conclusions. If the number of codes hinder and confuse your analysis, you should ask whether the number of them is the cause of the difficulty.

The data saturation level indicates that between ten and twenty themes is probably enough if your interviews are well-focussed on your topic. If a smaller number of themes accurately represents both the data and the research problem, then you might not need more.

If you have a large number of themes, some will probably have very few occurrences and they will not tend to be helpful. However, a large number of themes is not a bad thing in some circumstances. First, the research might have a problem of diversity. For example, the phenomena in your research problem might have a wide variety of causes, manifestations, or symptoms. Second, a small number of occurances (the outliers) might be significant, and you cannot presume that the bulk of data is always the best data. For example, a treatment might be quite safe for 98% of subjects, but a 2% death rate might be unacceptably high.

How can I code qualitative data from my interviews so that I work smarter, not harder?***

Organizing large amounts of data is possible with a computer, but it might not be the best way for everybody. Besides, if you make a mistake with a computer you might not notice it or might not be able to reverse it. Advice so far:

  1. Many students use software to do coding and this works for them. If so, keep backups of your original data files. Make progressive backups every time you make major changes.
  2. Many students like to print them out on paper, use color coding, and perhaps even put them on a wall as a huge chart.
    1. Some people simply prefer to work with paper.
    2. Printouts help them to visualize the entire dataset. It might also help them to feel more familiar with the data and perhaps to identify patterns in it. At the very least, seeing the whole thing at once will make you feel very satisfied.
    3. Paper printouts might even be necessary if you still cannot identify patterns in the data after coding with software.

* A mathematical proof of data saturation is unlikely because qualititative data is not appropriate for a mathematical proof.
** Ross Woods, 2020, '24, derived from Strauss and Corbin, 1990, pp. 99-107.
*** With thanks to Rιchαrd Scοtt Bαskαs, Rαιnεε Βrγαnt, Lγndα Dανis.

Muhammad Naeem, Wilson Ozuem, Kerry Howell, and Silvia Ranfagni. A Step-by-Step Process of Thematic Analysis to Develop a Conceptual Model in Qualitative Research International Journal of Qualitative Methods Volume 22:1–18 (2023) DOI: 10.1177/16094069231205789

Ross Woods, 2020, '24. Toolkit of research methods.

Anselm Strauss and Juliet Corbin. 1990. Basics of Qualitative Research: Grounded Theory and Procedures and Techiniques (Newbury Park, Ca.: Sage Publications).

Appendix: Τοm Grαnοff and a priori coding

Τοm Grαnοff offered another way of looking at it. If the literature is fairly mature and you already have a quite good idea of what most of the top ten responses will be, give interviewees a checklist of all those responses for them to endorse those that are relevant to them. Then follow it with an open-ended question like, “Please comment on the responses that are most important to you.” This approach has several advantages:

  1. The checklist is generated directly from the literature, so content validity is clearly established.
  2. This is much easier for respondents to complete. Think about classroom testing situations: It is generally easier to “recognize an answer (multiple choices or matching)” than to “recall an answer (fill-ins or essay exam).”
  3. When compared to strict open-ended questions, this method makes for easier coding, loading of responses into the computer, analyzing the data, and comparing the findings to the literature in the Discussion Chapter.
  4. Most respondents will not invest the time or energy to give long thoughtful answers to open-ended questions on a survey. This results in under-reporting. However, most will check off possible answers and then add some comments once they have been given some reminders as to what to say.
  5. It helps the reader to clearly understand what you are looking for so they can respond correctly.
  6. Because the Checked/Not Checked format yields a dichotomous variable, these answers can be easily correlated with any demographic variables to see who is more likely to endorse that characteristic.
  7. It increases reporting of potentially embarrassing behaviors because the list normalizes a wide range of behavior.

This is better than surveys with open-ended questions where the majority of respondents gave either no answer or an answer of less than five words, which is basically useless.

With thanks to Tom Granoff. He thinks he didn't "invent" it, but doesn't know its origins. It might of been the use of symptom checklists as a way to do quick efficient clinical assessments, and he adapted it to dissertation work.