Too much choice can be a bad thing. You have a dataset, you need to fit a regression model to predict something, but you have possible predictor variables coming out of your ears. How are you going to decide which predictors to leave in and which to leave out of it? That's a really important question. Let's look at some guiding principles to steer your course. So a good start is to read existing relevant literature. Studies in high profile peer review journals are more likely to have been done well than those in little-known journals at unscrupulous publishing houses that will accept anything if you pay them, you can also ask experts if you know any. These sources will give you a few suggestions of what to include but they probably won't do the whole job for you. So before going any further, let's be clear about what your model is trying to do. You want to predict a patient outcome with enough accuracy to be useful and realistic, but you don't want a model that's so complicated that you can't interpret its coefficients, which for logistic regression are odds ratios, once you've done the explanation. You also need your model to be robust, that means that it should also work well when you apply it to another dataset with different patients. So let's first consider the pros and the cons of the model with only two predictors, and then for one with a 100. I always like to exaggerate to illustrate a point, it's fun. First, let's say your model has only one predictor which you've selected based on your reading of the relevant literature, to make it really easy, let's say it's gender defined as either a male or a female. This model will have a grand total of two parameters. One for one gender say female, and one for the intercept which will include the odds for the other gender, so males. This model has some obvious advantages, it's quick to run even on a slow computer, and simple to interpret and explain to other people. The parameters of the model, that's the intercept and the odds ratio for the effects of being female compared with being male, will have nice narrow confidence intervals because they're each based on a lot of patients. If your dataset contains 1000 patients and the gender split is 50-50, you've 500 patients to estimate the odds for each gender, that's a lot. This model will be robust but the outcome is hardly likely to be only due to the patient's gender, the models predicted power would be poor. To get better prediction, you'll need to use more predictors, so let's consider a model with a 100 of them. Say you've just thrown them all in together, the discrimination of the model as measured by the C-statistic, may well be high. Let's say it's now 0.85, whereas with just gender in it, it was just 0.53, so a huge improvement. But this model would have taken much longer to run, and you've got a lot of interpreting and explaining to do. Some predictors will have low p-values, but many won't. Some predictors will have large standard errors and wide confidence intervals, meaning that the estimated odds ratios for these predictors have a lot of uncertainty about their real values, that is they are unstable. This model is not robust, its output probably can't be trusted. If you fitted the same model to a different set of patients, you'd probably get some very different odds ratios, this is called overfitting which I'll explain in more detail separately. So what should you do? You need to prune the model and clear out the junk. To do this, there are some exotically named technical tricks that can be used in regression, but are also considered machine learning methods, these are beyond our scope in this course. If prior knowledge isn't enough to help, there are some other commonly used approaches, commonly used but really smelly, so smelly that I can barely bring myself to describe them, but I must because they are so widespread. The first is forward selection. Here, you start off with no predictors in the model and then you try them one at a time, starting with one with the lowest p-value, then you keep adding variables until none of the remaining ones are significant. Often defined by p less than 0.1, so you don't miss important ones, this is horrible. You might think you're keeping only those where p is less than 0.1, but actually all this testing of this and testing of that, means you've no idea what the real p-values are, and the confidence intervals don't make sense in the situation either, it's not robust. A variant on this is stepwise selection, which allows you to drop variables if they become nonsignificant when you add a new one, this is also horrible for the same reasons. Thirdly, there's backwards elimination. Here, you put all the possible predictors in at once and you drop the nonsignificant ones, beginning with the least significant one, that with the highest p-value. This is the least bad of the three, I use it sometimes, though with caution. Choosing which predictors to keep in your model is a vital task and an art, but can be fraught with danger. Using prior knowledge is good, and backwards elimination is useful when used carefully, but forward selection and stepwise selection are too smelly, even to be considered.