When fitting fails, it fails for basically two reasons

- bad initial conditions
- bad model (ill-conditioning/multicollinearity)

##### Initial conditions.

90% of the time it's the initial conditions. In fitting, finding good initial conditions is usually one of the hardest things to do. You expect the fitter to do all the work but its really not that good - it just refines solutions when you're already in the vicinity of the solution.

As a RULE, always plot these three things overlaid:

- Plot the data you want to fit, AFTER any transforms you might apply (for example, if you're going to fit the log-transformed data, PLOT the log-transformed data).
- Plot the predictions based on the initial guesses for the parameters.
- Plot the predictions based on the fit parameters.

How do we obtain good initial guesses?

- One option is to guess some parameters (guided by what they mean in your model), check to see if they yield predictions anywhere in the vicinity of the data they're supposed to fit. This often works okay if you're only fitting one dataset. But what if you've got to fit 100 datasets? Then it's unlikely that a single initial estimate will work for all of them, and we're going to need to come up with initial estimates for each one.
- A better option is to use heuristics to get you to ballpark estimates, correct to perhaps an order of magnitude. Examples of heuristics:
- If I were fitting a straight line, e.g. y = mx+b (note that this is just a toy example and you'd never use an iterative solver to fit this), i might get an initial guess for the slope parameter m by first sorting my data in order of ascending x, and then using (y_last-y_first)/(x_last - x_first).
- If I were fitting a logistic function: y= K / (1+P * exp(-r*x)), I might choose K = max(x), because in my model, K is the maximum value it ever attains. For an initial guess for P, since I know at x=0, then y=K/(1+P), i might see if i have a datapoint around x=0. If I do, call it (x*, y*), then a decent initial guess for P might be P = (K/y* - 1), or max(x)/y*-1.

##### Fitting on a log or linear scale (and transforms more generally):

While I can fit models many ways, one of the most common issues I've had arise is whether to fit them on a log or linear scale. These do NOT provide the same result - e.g. fitting y=mx+b, or fitting log(y) = log(mx+b). Why? Because when we fit, we're implicitly trying to minimize the overall difference between the left-hand side and the right hand side of the equation. Strictly, most fitters default to minimizing the sum of squared errors between the left- and right-hand sides (observed data and predicted data, respectively). On a linear scale, the difference between 10 and 100 is a lot more than the difference between 1 and 20, whereas on a log-scale, the latter is a much larger discrepancy.

- Practically speaking, the way to choose whether to fit your data on a linear or log scale is to ask - do i care about absolute deviations? or relative deviations? Suppose I have some data (y) that span say four orders of magnitude, from 0.01 to 100.
- This comes down to - do I believe the errors in my data are additive or multiplicative? In the former, fit the data on a linear scale. If errors are multiplicative, fit in on a log scale.

- Instead of using a function that fits formulas, it often helps clarify the problem to formulate it as an optimization (maximization or minimization problem), and forces you to think clearly about what fitting actually means. Typical fitting function (e.g. lm() or nls() in R) use the sum of squared errors between the data and the predictions.