Saturday, April 25, 2009

Data Patterns in the Mediterranean Diet Score

The original construction of the edifice known as the Mediterranean Diet began with a paper which used a scoring system to handle the mass of data that results when you give thousands of people food frequency questionnaires. The data manipulation is roughly like this: People say how often and how much they eat of typical dishes and foodstuffs; these quantities and frequencies are converted to daily food group consumption for which a score is given. RESULT massive amounts of data reduced to one number.

Let's recall the basics: the food groups were: vegetables, fruits & nuts, legumes, meat & poultry, fish, dairy products, cereals, monounsaturated to saturated fat intake ratio and alcohol intake. The bad score was 0 for eating more of the bad groups (meat & poultry, dairy, low mono to saturated fat ratio and alcohol) and not enough of the good groups (vegetables, fruit & nuts, legumes, fish, cereals) and the good score was 1 for doing the opposite. When these scores are added up, the lowest possible score (bad) is 0 and the highest (good) is 9.

Now, it turns out that the pattern of scores expected can be modeled by a mathematical probability distribution known as the binomial distribution. Strictly speaking, to adopt this model of the situation we need to make two assumptions about the behaviour of the participants. Firstly, we assume that each participant is operating (i.e. choosing foodstuffs and quantities of these to eat) independently of (i.e. not influenced by) each other participant (this is quite likely) and secondly, we need to assume that a participant's scoring on each food group is independent of (i.e. not influenced by) the score of other food groups. This second assumption is not entirely true, for example, it is clear that there would be some correlation between both dairy products and meat & poultry consumption and the monounsaturated/saturated fat ratio. However, for now, let's make this assumption and then we can check whether the actual data support this view.

With this model for the scoring process, it becomes possible, given the total number of participants, to calculate the expected numbers with different scores. It is quite easy to see that, given the way the scoring system has been constructed, there will be a full spread of scores, because on every food item, there is a 50-50 chance of scoring 0. Another point to consider is that, with the exception of scores 0 and 9, there are multiple ways of obtaining the other scores. For example, a score of 1 may be obtained by being above the median on one and only one of the 9 food groups - which means there are 9 different ways of getting this score. Whereas a score of 2 may be obtained by combining a score of 1 from 2 out of 9 groups: there are 36 ways of doing this. For the most likely scores of 4 and 5, it can be shown that there are 126 different ways to obtain each of these scores.

Table 2 (and you should look at this table while reading the next bit) in the paper shows the individual food group scores versus the Mediterranean diet score. This table is important because it gives some insight into the raw results of the scoring process which is otherwise obscured because score results for the individuals are grouped into three categories: low diet score (0-3), medium (4-5) and high (6-9). This is distinctly unhelpful and does not let us see how many got each individual score. However we can see – within each diet score category – how many people scored 0 or 1 for a particular food group (i.e. how many people ate more than the median amount and how many ate less than it - or vice versa).

Using the binomial distribution and the total number of individuals, it is possible to predict how many people should score 0 or 1 for each food group in the three diet score categories under our assumptions outlined above. (But note that we will get the same prediction for each and every food group because the model makes no distinction between these.) For example, for men in the category of low diet score (0-3), we would expect 2257 individuals and to see only 643 (28%) scoring 1 but 1616 (72%) scoring 0. In the category of medium diet score (4-5) 4378 individuals are expected, with equal numbers scoring 0 and 1 and in the category of high diet score (8-9), we would expect to see 643 of 2257 (28%) scoring 0 and 1616 (72%) scoring 1.

How does the prediction compare to the actual values? Quite well in fact for legumes and fruit & nuts (both 23%/77% and 50/50 in the medium score category) and dairy products (31%/69% in the low and high categories and 50/50 in the medium category) not quite so well for fish, vegetables and fat ratio where the ratios are actually 'more extreme' than predicted (18%/82% or 20%/80% in the low and high score categories). Cereals (36%/64%) and meat & poultry show the worst correspondence where the ratios are 'less extreme' than predicted. Overall we slightly overestimate the number in the medium score category (actual number 3808) and underestimate the numbers in the low and high categories.

There are some foodstuffs included in Table 2 which are not included in the calculation of the Mediterranean diet score: eggs, potatoes and sweets. As they are not used to calculate the score, it can be expected, that any participant in any diet score group would be equally likely to be above as below the median consumption of these items and so there would be approximately a 50%/50% split in the consumption in each diet score category. A sizeable departure from these figures would suggest that the participant's consumption of these non-scoring foodstuffs is in some way dependent on or linked to consumption of a scoring food group. However, this is clearly not the case, except possibly for potatoes, which show about 16% deviation from the expected 50-50 split in favour of (unsurprisingly) vegetables.

This leads to the most notable find in this table - a point which was completely unmentioned in the original article. The distribution of meat and poultry consumption appears to be essentially independent of the Mediterranean diet score. Within each scoring category, the distribution of above and below median consumption for meat and poultry is much more like that of eggs and potatoes and sweets than it is of the other items making up the diet score. Whatever the diet score group, there is close to a 50-50 split in the distribution of individuals' meat consumption. On a Chi-sqared test on the actual versus the expected values, the meat and poultry item shows the strongest result for independence in common with the foodstuffs which are independent of the diet score (e.g. eggs, potatoes, sweets) because they are not used in its calculation. It is interesting that this result is not remarked upon in the paper. In fact, to the contrary, when giving an example based on the link between a 2-point increment in the diet score and improved survival, it is mentioned that such an increment could be achieved by `making a substantial reduction in meat intake' despite the evidence that many high scorers score highly in spite of an above median meat intake!

No comments: