Additional file 2. A Phyton program calculating the three descriptive missing data indicators. This custom-made Phyton 2.7 program was written for the analysis of the example datasets of the paper by Wærsted et al. In the user input part all the specifications (infile, outfile, delim, nfup, ncolskip, npers, nrowskip, colind ID, yes string, no string) have to be changed according to the actual dataset that shall be analyzed. The version reproduced below analyzed the Norwegian dataset (see example in Figure 1 in Wærsted et al.). # Program to calculate statistics of attendance of subjects to scientific study with N follow-ups: Last response, Retention and Dispersion (in %). # The statistic Dispersion was invented by Morten Waersted. # This python 2.7 program is written by Eivind G. Waersted. # Version of 6. June 2016 # Instructions: # - Modify the user input parameters below to fit the file format and location of your input data, before running the program. # - The program reads the 'infile', calculates the statistics 'last response', 'retention' and 'dispersion', and then # writes a new file 'outfile' where these statistics are included. # Requirements for the input file: # - The infile should be in .csv (ascii) format (comma-separated values), which is normally a possible format to save to from Microsoft Excel. # The outfile will also be in this format, which can then be opened in Excel again. The delimiter does not have to be comma; # the important point is that the delimiter character must never be used inside any of the cells. # - The infile should have one row for each person, which must be consecutive (no empty rows). One or more header rows before the data is allowed. # - Entirely empty rows and columns should be avoided, as these may cause wrong behaviour of the code # - The infile must include one column for each follow-up, in which the cells only have two different values, one indicating that the person attended # and one indicating not attended. These two values can be any symbol, as long as they are consistently used throughout the file. # - The infile should also include a column giving the ID of each person; this can be any of the columns. # - Additional columns in the infile are allowed, as long as all the follow-ups come chronologically and in consecutive columns. # --------- start user input ----------- # Provide infile name with path (string) infile = "/Users/eivindwaersted/src_python/data/mams.csv" # Provide outfile name with path to be created by this program (string) outfile = "/Users/eivindwaersted/src_python/data/mams_output.csv" # The delimiter character that is used to separate the columns in the input (and output) file (string) delim = ',' # The number of follow-ups (not including the fictitious one) (int). NB: The follow-ups must be in consecutive columns in the input file nfup = 20 # The number of columns in the input file before the column with data from the first follow-up (int) ncolskip = 2 # The number of persons in the study (nr of rows with data in the input file) (int) npers = 420 # The number of lines in the input file before the data begins (including header lines) (int) nrowskip = 1 # The column where the ID of the persons is written: Provide here the number of columns before this column (e.g. 0 if ID is the first column) (int) colind_ID = 0 # Provide the encoding of the data (strings): # The string that indicates that a follow-up was attended yes_string = 'x' # The string that indicates that a follow-up was not attended no_string = '.' # ---------- end user input ------------- # Import required python packages import numpy as np import sys ####################### ### READ INPUT DATA ### ####################### # Define the lists for input data fupdata = [] # list of boolean arrays for each person, including the nfup True/False data IDs = [] # ID of each person (same order as datai!) # Read the data of follow-up responses fh = open(infile,mode="r") # Skip header lines for i in range(nrowskip): fh.readline() # Read the follow-up response or not from each person for i in range(npers): try: words = fh.readline().strip().split(delim) datai = np.empty(nfup,dtype=bool) # array to contain yes/no info about attending to each follow-up from this person IDi = words[colind_ID] # ID of this person for j in range(nfup): # go through all follow-ups symbol = words[j+ncolskip].strip() if symbol==no_string: # the person did not attend this follow-up datai[j] = False elif symbol==yes_string: # the person attended this follow-up datai[j] = True else: # invalid data for this follow up print "Error: Invalid string '%s' in follow-up %d of the person with ID '%s'!"%(symbol,j+1,IDi) sys.exit(1) except IndexError: print "Error: Data dimensions exceeded the size of the input. Please verify that your input parameters are consistent with the input file!" sys.exit(1) # Store the data read from this person IDs.append(IDi) fupdata.append(datai) fh.close() # ------------------------------------------- #################### ### CALCULATIONS ### #################### # Arrays to store the statistics (in %) for each person retention = np.empty(npers)+np.nan lastrep = np.empty(npers)+np.nan dispersion = np.empty(npers)+np.nan # For each person, calculate the statistics for i in range(npers): ### CALCULATION OF LAST RESPONSE AND RETENTION retention[i] = sum(fupdata[i])*100.0/nfup if retention[i]==0: lastrep[i] = 0 else: lastrep[i] = (np.where(fupdata[i])[0][-1] + 1)*100.0/nfup ### CALCULATION OF DISPERSION ### # 1. Assign a variable for each scheduled data collection points (baseline and follow-ups) indicating yes or no on whether the subject has attended # or not. All subjects will have yes for baseline idata = np.concatenate(( [True], fupdata[i])) # add baseline to the follow-up data vector of the ith person # 2. Subjects that hava attended all data collection points shall have a dispersion indicator score of 100. if np.all(idata): dispersion[i] = 100 # 3. Subjects attending only baseline shall have a dispersion indicator of zero. elif not np.any(idata[1:]): dispersion[i] = 0 # 4. For all other subjects do the calculation steps 5-15. else: # 5. Before doing the calculations add an extra "data collection point" after the final follow-up of the project period. Regard this fictitious # time point as real in all the following calculations and assign the value no for all subjects. idata = np.concatenate(( idata, [False] )) # add the fictitious follow-up to the data vector # 6. Count number of attended data points (including baseline and the fictitious follow-up) A = np.sum(idata) # 7. Count number of unattended follow-ups (including the fictitious follow-up) B = np.sum(~idata) # 8. Calculate: B / A = C C = B*1.0 / A # 9. For each attended data point (baseline and follow-ups), count the number of unattended follow-ups until next attended follow-up = D. # For the last attended follow-up count the number of data collection points until the end of the project period (including the fictitious time point) D = [] unatt_count = 0 # counter to keep track of the number of unanttended follow-ups since last attended data point # Go through all follow-ups to determine the spacings between the attended ones for j in range(1,len(idata)): if idata[j]: # This follow-up was attended -> save the count from the previous attended one, then reset the counter D.append(unatt_count) unatt_count = 0 else: # This follow-up was not attended -> increase count by one unatt_count = unatt_count + 1 # Finally, save the spacing after the last follow-up D.append(unatt_count) # 10. Calculate: (D-C)^2 = E E = (D - C)**2 # 11. Sum E for all attended data collection points of that subject = F F = sum(E) # 12. Calculate the lowest possible value of F for the actual number of A, by doing calculation steps 9-11 when D is differing with a maximum of # one = Fmin Cfloor = np.floor(C) # C rounded down to whole number # Calculate the number of attended follow-ups that should followed by nCfloor+1 unattended follow-ups to get Fmin. # The rest of the attended follow-ups will be followed by Cfloor unattended follow-ups. nbigdists_Fmin = B - A*Cfloor Dmin = np.concatenate(( nbigdists_Fmin*[Cfloor+1], (A-nbigdists_Fmin)*[Cfloor] )) # the values of D to get Fmin Fmin = sum((Dmin - C)**2) # 13. Calculate the highest possible value of F for the actual number of A, by doing calculation steps 9-11 when all attended follow-ups come # consecutively after baseline followed by all the missing follow-ups = Fmax Fmax = (A-1)*C**2 + (B - C)**2 # 14. Calculate: (F - Fmin) / (Fmax - Fmin) x 100 = G G = (F - Fmin)*100.0 / (Fmax - Fmin) # 15. Calculate the Dispersion indicator score = 100 - G dispersion[i] = 100 - G # ---------------------------------------------------- ############################ ### WRITE OUTPUT TO FILE ### ############################ # Save to file (columns will be ID, base, fups..., lastrep, retention, dispersion) fh = open(outfile,mode='w') # Write header strh = "ID" + delim + "b" # ID and baseline for j in range(nfup): # follow-ups strh = strh + delim + str(j+1) # count from 1, not 0 strh = strh + delim + "last-response" + delim + "retention" + delim + "dispersion" # statistics fh.write(strh) # Write data and statistics for each person for i in range(npers): stri = "\n" + IDs[i] + delim + yes_string # ID and baseline for j in range(nfup): # follow-ups if fupdata[i][j]: stri = stri + delim + yes_string else: stri = stri + delim + no_string stri = stri + delim + "%.2f"%lastrep[i] + delim + "%.2f"%retention[i] + delim + "%.2f"%dispersion[i] # statistics fh.write(stri) fh.close() print "Statistics were written to '%s'."%outfile