This article is intended to help students understand the concept of a coverage probability involving confidence intervals. Mathematica is used as a language for describing an algorithm to compute the coverage probability for a simple confidence interval based on the binomial distribution. Then, higher-level functions are used to compute probabilities of expressions in order to obtain coverage probabilities. Several examples are presented: two confidence intervals for a population proportion based on the binomial distribution, an asymptotic confidence interval for the mean of the Poisson distribution, and an asymptotic confidence interval for a population proportion based on the negative binomial distribution.
### 1. Introduction

### 2. A Population Proportion and the Binomial Distribution

#### The Simplest Confidence Interval

#### A Better Confidence Interval for a Population Proportion

### 3. The Mean of the Poisson Distribution

### 4. The Population Proportion and the Negative Binomial Distribution

### 5. Summary

### Acknowledgments

### References

### About the Author

]]>Introductory courses in mathematical statistics present the rudimentary concepts behind confidence intervals. The creation of confidence intervals often involves the use of maximum likelihood estimation and the central limit theorem along with estimated standard errors.This is described in Casella and Berger [1] p. 497. Consequently, the level of confidence is often only approximate. This is particularly the case when continuous probability models are used to approximate discrete probabilities. The probability that the interval surrounds the unknown parameter depends on the value of the unknown parameter. Such a probability is called a *coverage probability*. Confidence is defined as the infimum of the coverage probabilities. The following definitions can be found in Casella and Berger [1] p. 418.

**Definition (Coverage and Confidence)**

Let where the are all independent from a distribution with probability density (or discrete mass) function given by . The support of each is and the parameter space is . Let and be the lower and upper limits of a confidence interval. Then the coverage probability of the interval evaluated at is . The level of confidence is .

Students are often confused about how to compute coverage probabilities. This tutorial is intended to help students understand them. We give a detailed explanation of calculating one particular coverage probability. This also allows one to perform the calculations with a minimum of distraction involving programming. We then compute coverage probabilities using higher-level functions and that allow specifying a function of a random variable along with its distribution. In both cases these functions allow one to focus on the higher-level ideas rather than low-level nuts and bolts of programming.

Coverage probabilities are best calculated by computer. This necessitates the choice of a programming language and programming environment. Statisticians are generally familiar with one or more statistical programming languages such as SAS, R and so on. Such languages are necessary productivity tools due to their significant data handling capabilities as well as their statistical methods. They are indispensable to the statistician. However, they are not as useful as a language for describing algorithms. Small “bookkeeping” matters often obscure the algorithm or method to be calculated. This tutorial uses Mathematica as a language to describe the computation of coverage probabilities. With a little additional effort, one can produce graphs of coverage probabilities as well as dynamic demonstrations that use a slider to illustrate the effect of the sample size on the graph. The Wolfram Demonstrations Project website contains numerous Demonstrations involving a wide variety of topics. One such Demonstration provided by Heiner and Wagon [2] involves coverage probabilities for a population proportion using a Wald approach as well as a Bayesian approach. This article takes a different approach than Heiner and Wagon.

We illustrate the idea of coverage (and hence confidence) with several examples.

Section 2 describes two asymptotically justified confidence intervals for estimating a population proportion based on the binomial distribution. The first confidence interval is a simple hand calculation interval contained in many textbooks. We present a step-by-step algorithm for computing the coverage probability for one specific value of the population parameter. We stress clarity of computation rather than efficiency. The approach is adequate for a population described by a discrete distribution with a finite number of possible values. We then compute the coverage probability using a much higher-level function, , to automatically compute the probability associated with an inequality. We also use for subsequent calculations. We produce a typical graph of coverage probabilities found in some textbooks. The second confidence interval for a population proportion (again based on the binomial distribution) is more complicated but has gained popularity. Naturally, it will be seen that coverage probabilities are generally higher than the level of confidence when approximations are used to create a confidence interval. This is illustrated in the examples below.

Section 3 presents an asymptotically justified confidence interval for the mean of a population described by a Poisson distribution. The Poisson distribution has infinitely many possible observable values. The function used to evaluate coverage probabilities automatically takes this into account.

Section 4 presents a graph of coverage probabilities based on an asymptotically justified confidence interval for estimating a population proportion based on the negative binomial distribution.

Section 5 presents a summary.

A population has a proportion of members with a given characteristic. In order to estimate , one randomly selects members of the population with replacement, say , where the are independent and identically distributed random variables, each with a Bernoulli distribution with parameter . If is the number of members in the random sample possessing the target characteristic, that is, , then has a binomial distribution with parameters and . The sample proportion of members with the characteristic is . Two large sample confidence intervals for are typically given. We start with the simplest. A large sample confidence interval of size for is given by

(1) |

where is the upper part of the standard normal distribution.

Let , the standard error of . So, we may shorten (1) by writing it as

(2) |

One can find the confidence interval in expression (2) in virtually any statistics book; in particular, see Devore and Berk [3] p. 396. Also, coverage probabilities for this confidence interval are described in Brown, Cai and DasGupta [4]. The derivation of the interval leads one to believe that the level of confidence is . However, two approximations are used to derive the interval in expression (2). One approximation uses the central limit theorem. A second approximation uses an estimated variance for the sampling distribution of the sample proportion . We want to compute the actual coverage probability for any possible value of the true population proportion . The coverage probability is

(3) |

where . Books are sometimes vague about whether or not to include the endpoints in the inequality. We exclude the endpoints in order to be consistent with typical hypothesis testing methods.

The definition of coverage confuses many students. For a given value of with , one must determine the set values of satisfying the inequality in expression (3) and compute the probability of observing such values of . We will describe how to determine the set of values and then compute their probability. Once we know what is actually being computed, we will move on to higher-level functions that perform the computations automatically.

We use an example with . A plot will show how bad the approximation can be and also displays the output of each step of the algorithm. We will compute the coverage probability for . The input and output are presented in a conversational style with some editorial comments along the way.

We wish to determine the upper percentage value from the standard normal distribution. The variable is often called a critical value for the standard normal distribution. The result will be a floating-point number, which restricts the accuracy and precision of all calculations that use it; the result of this calculation is a floating-point number.

Define .

Here is the confidence interval inequality for sample size 10 and general .

The support of the random variable is the set of values for which the probability mass function is positive. They also represent the observable values of for a discrete random variable. We represent the support of with the programming variable .

This tests whether the inequality is true for each value of and probability 0.5.

These are the positions that yield ; eliminates one level of parentheses. We wish to compute the probabilities of at those positions and sum them.

These are the appropriate values of the variable .

Now one computes the probabilities for the individual values of satisfying the inequality.

Finally, the values of the individual probabilities are summed to create the actual coverage probability for .

The steps have been broken down so that students can easily understand what is needed. A large sample justification leads us to believe that this number should be about 0.95. The coverage probability is about 0.89 rather than 0.95.

Here is a much more transparent manner in which to compute the coverage probability. We may use a system function for evaluating the probability of expressions of a random variable. Apparently, the system function automatically tests each possible value of the random variable to determine the ones that satisfy the inequality. (This works quite well for a discrete random variable with a finite number of observable values.) The relevant probabilities are then summed. This approach is not efficient in cases with infinitely many observable values of a random variable. However, it is straightforward and easy for a student to understand. We evaluate the probability of an expression involving the binomial random variable. The expression of the binomial random variable is the confidence interval inequality.

Let us define a function that constructs the inequality more explicitly.

Define the function that computes the coverage.

We now plot the coverage probabilities for a range of values of in Figure 1 below. We also create a horizontal line at a level of 0.95 for comparison purposes. The graph is symmetric due to properties of the binomial distribution and the large sample approximation involved in the confidence interval justification.

**Figure 1.** Coverage plot for first binomial confidence interval, .

Examining Figure 1 indicates several points. First, the coverage probabilities are in general not equal to the nominal level of confidence—namely .95. Moreover, coverage probabilities near and are effectively zero. Finally, the coverage probability function is discontinuous. All this with a minimum level of programming. In fact, the programming statements presented are simply a good description of the algorithm.

More is available. We wish to be able to change the plot by varying the sample size with a slider. A dynamic demonstration can easily be created with the function. The manipulate variable is the sample size , which you can vary with a slider from 5 to 100.

The graph is in Figure 2. The computer processing time increases with the value of the sample size because the inequality must be tested for each possible value of . The initial sample size is .

**Figure 2.** Coverage plot as a function of sample size.

A larger sample size improves the coverage probabilities as one expects. After all, the confidence interval formula is justified by a large sample argument. However, it is very clear that the coverage probability is small when is close to either 0 or 1 even with . For some sample sizes it is even more obvious that this function contains discontinuities.

This subsection presents coverage probabilities for an improved confidence interval for a population proportion. The improvement makes coverage probabilities generally larger.

Devore and Berk [3, p. 395] give a better large sample confidence interval for a population proportion. Based on the same assumptions as expression (1), a sample confidence interval of size for a population proportion is given by

(4) |

This confidence interval is based on solving the following inequality for :

(5) |

This defines the new inequality accordingly.

Just as with the previous kind of inequality, define .

Figure 3 is the corresponding plot, again with .

**Figure 3.** Coverage plot for the better confidence interval, .

The inequality in (5) is supposed to have a probability of approximately before sampling the population. We can of course compute the true probability with respect to the correct binomial distribution. The Mathematica code follows along with a dynamic graph in Figure 4.

**Figure 4.** Coverage probabilities for the superior asymptotic confidence interval for a population proportion.

Figure 5 contains the code and plot for the dynamic version of the plot. This plot allows for an easy comparison of the coverage probabilities for the two types of confidence intervals.

**Figure 5.** A comparison of coverage probabilities for the two binomial intervals.

The coverage probabilities for this improved confidence interval for a population proportion are indeed superior to the simpler interval. In particular, the coverage probabilities are quite large when is close to 0 or 1. One can see this even with a sample size of , for which the large sample approximation is not appropriate. The difference in coverage probabilities with the simple interval (displayed in Figure 2) and this improved interval is striking.

We now turn our attention to the Poisson distribution.

The book by Devore and Berk [3, p. 400] presents a homework exercise for determining a confidence interval of size for the mean of a population described by a Poisson distribution. Let , where the are independent and identically distributed with a Poisson distribution with parameter (mean) of . Ideally, we must solve the inequality

(6) |

to obtain the desired confidence interval. However, if we have a large enough sample, we may replace the true standard error in the denominator with its estimate. Again, this produces a less than ideal result.

The resulting simple confidence interval of approximate size for the mean is given by

(7) |

which has an approximate level of confidence of . The parameter in the denominator was replaced by the sample mean. Figure 6 contains the code and graph for the coverage probabilities. We use . We let where has a Poisson distribution with a mean of . In principle, the inequality must be tested for each of the infinitely many possible values of . Coverage probabilities are evaluated at a discrete set of points in order to save computational time.

**Figure 6.** Coverage probabilities for the confidence interval for the Poisson mean .

Unless is close to zero, this large sample approximation is quite good for , which is easily seen in Figure 6. Given the two approximations used, it is not surprising that the coverage probability is small when is close to zero.

This section addresses the situation of estimating a population proportion when the negative binomial distribution is appropriate.

Let , where the are independent and identically distributed with a geometric distribution with parameter . It is well known that has a negative binomial distribution with parameters and , (see [5], p. 127). Consequently, we use the negative binomial distribution for estimating a population proportion. There are many ways to define the negative binomial distribution. We use the version described in Kinney [5, p. 125]. Conduct independent success/failure trials, each with a probability of success . Let be the total number of trials needed to obtain successes. The probability mass function for is given by

(8) |

where .

Some authors count the number of trials before the success. Other authors count the number of failures before the success. There are other possibilities still. Mathematica uses , the number of failures before the success. Consequently, .

Casella and Berger [1, p. 496] describe large sample confidence intervals based on maximum likelihood. It is easily shown that the maximum likelihood estimator for is . Moreover, the asymptotic variance of this estimator is the reciprocal of the Fisher information, . Fisher information is described in Casella and Berger [1, p. 388]. This variance expression is not useful for creating a confidence interval for since it depends on . So, we estimate the large sample variance by replacing with . This leads to the large sample confidence interval:

(9) |

In order to conveniently perform the calculations, we note that . We evaluate the coverage probability for in steps of 0.01. We based the calculations on . The calculation can take some time depending on the computer. When is small, values of or are extremely unlikely. This makes the internal algorithm take quite a while. We can help speed up the calculations by using rather than the symbolic . The speedup occurs by reducing the required number of digits in calculations. Even so, this calculation takes some time (about four minutes on the author’s computer). A graph of the coverage probabilities is contained in Figure 7.

**Figure 7.** Coverage probabilities for the confidence interval for a population proportion on the negative binomial distribution, .

We see from Figure 7 that the approximation is quite good for values of close to 0.2. We infer that the approximation is also quite good if is close to 0. The approximation generally gets worse as increases (though not monotonically). A large sample approximation was used. Also, an approximate standard error was used. One sees that the coverage probability is essentially zero when is close to 1.

Large sample confidence intervals are often quite easy to derive. This is particularly true when using an estimate for the standard error of an estimator. However, the actual probability of surrounding the parameter value (coverage) can be quite different from the nominal value. It is helpful to graph the coverage probabilities to see this. Mathematica is particularly useful in performing these calculations and providing a language for describing the algorithms.

The author wishes to thank the anonymous reviewer and the editor for their help in improving this article.

[1] | G. Casella and R. Berger, Statistical Inference, 2nd ed., United States: Brooks/Cole Cengage Learning, 2002. |

[2] | K. Heiner and S. Wagon. “Wald and Bayesian Confidence Intervals” from the Wolfram Demonstrations Project—A Wolfram Web Resource. www.demonstrations.wolfram.com/WaldAndBayesianConfidenceIntervals. |

[3] | J. Devore and K. Berk, Modern Mathematical Statistics with Applications, 2nd ed., New York: Springer, 2012. |

[4] | L. D. Brown, T. T. Cai and A. DasGupta, “Confidence Intervals for a Binomial Proportion and Asymptotic Expansions,” The Annals of Statistics, 30(1), 2002 pp. 160–201. www.jstor.org/stable/2700007. |

[5] | J. Kinney, Probability: An Introduction with Statistical Applications, New York: John Wiley and Sons, 1997. |

P. Cook, “Coverage versus Confidence,” The Mathematica Journal, 2021. https://doi.org/10.3888/tmj.23–1. |

Peyton Cook earned a B.A. in Psychology, B.S. in Mathematics, and an M.S. and Ph.D. in Statistics. He is an Associate Professor at The University of Tulsa.

**Peyton Cook**

*Department of Mathematics
The University of Tulsa
800 Tucker Drive
Tulsa, Oklahoma 74104
pcook@utulsa.edu*

Structural equational modeling is a very popular statistical technique in the social sciences, as it is very flexible and includes factor analysis, path analysis and others as special cases. While usually done with specialized programs, the same can be achieved in Mathematica, which has the benefit of allowing control of any aspect of the calculation. Moreover, a second, more flexible, approach to calculating these models is described that is conceptually much easier yet potentially more powerful. This second approach is used to describe a solution of the attenuation problem of regression.

Linear structural equation modeling (SEM) is a technique that has found widespread use in many sciences in the last decades. An early foundational work is Bollen [1]; a more recent overview is provided by Hoyle [2]. The basic idea is to model the linear structure of observed variables of cases (observations, subjects) by linear equations that may involve latent variables. These variables are not measured directly but inferred from the observed variables by their linear relation to the observed variables.

Many commercial programs (including LISREL, Amos, Mplus) and free ones (including lavaan, sem, OpenMX) have been developed to carry out the estimation procedure. From my perspective, the R package lavaan [3, 4] by Yves Rosseel is the most reliable and convenient one among the free programs. I use it as the gold standard to judge results of my own code.

This article first gives a quick overview of the standard SEM theory, then shows how to perform the calculations in Mathematica. In the last section, a second approach is discussed.

There is a standard example due to Bollen that is also used in the lavaan manual. The dataset consists of observations of 11 manifest variables , , , , , , , , , , . SEM models are usually depicted graphically. In the lavaan documentation, this is displayed as in Figure 1.

**Figure 1.** Bollen’s democracy model (image from lavaan documentation [4]).

The variables , , are observed variables that measure the construct of industrialization in 1960, which is described by the latent variable . This means that the level of industrialization is assumed to be representable by one number for each country, but this number cannot be measured directly; it has to be inferred from its linear relation to gross national product , energy consumption per capita and share of industrial workers . Next, and are the democracy levels in 1960 and 1965, measured by , , , and , , , (these indicators are freedom of the press, etc.). The data matrix consists of these 11 numbers for each of 75 countries (cases). The data is delivered with the lavaan package for R. The aim of estimating the model is twofold. First, the weights of the linear connections (represented in the picture by arrows) are estimated. These arrows encode linear equations by the rule that all arrows that end in a variable indicate a linear combination that yields the value of this variable plus some error term variable. To bring this mysterious language down to earth, here are the equations represented in Figure 1:

, ,

, ,

, ,

,

.

The variable is called an exogenous latent variable because no arrow ends there. It has no associated error variable. However, its manifest (measured) indicator variables , , have associated error variables (they are called in [1]). The indicator variables , , , and , , , of the two endogenous latent variables (those latent variables where arrows end) have error variables (called in [1]). The equations that relate latent and manifest variables define the measurement part of the model. The two equations (coming from three arrows) between the latent variables are the structure model, usually of most interest. Fitting the model to the data gives estimates for the weights of the arrows, , , , , …. The second goal of SEM modeling is to check how well the structure of the model fits the data; that is, SEM is also a hypothesis-testing method.

The equations given do not yet identify all variables. Assume we have a solution of ; then for any number , the numbers and would be solutions, too. To avoid this problem, we either fix the variance of the latent variables to be 1 or we fix some of the weights to be 1. This is the default in lavaan and we adopt it here, hence , , .

Ever since SEM’s invention, SEM models are estimated by calculating the model’s covariance matrix. From the data, we get the empirical covariance matrix . On the other hand, from the model, we can calculate a theoretical covariance matrix between the observed variables. ( depends on the model and thus on the parameters.) For example, one entry in this matrix would be . Using linearity and other properties of the covariance, this boils down to a matrix with entries that are polynomials in the model parameters and the covariances and variances between latent variables and error variables. However, without further assumptions, this gives a lot of covariances (e.g. ) that are not determined by the model and hence must be estimated. As this usually leads to too much freedom, the broad assumption is that most error variables are uncorrelated. Only some covariances between error variables are not assumed to be 0; those are marked in the diagram by two-headed arrows between the observed variables. For every pair of observed variables, we calculate the covariance by using the above given model equation as replacement rules and applies linearity and independence assumptions. In the end, we get a covariance matrix that depends on the model parameters , , , , , … and on the variances of the latent variables and the covariances of error variables that are not assumed to be 0. Details can be found in Bollen [1].

To fit the empirical and the theoretical covariance matrix, we have to choose these parameters to minimize some distance function. The three most common are uniform least-square, , generalized least-square, (*I* is the identity matrix), and maximum likelihood, (here is the number of manifest variables).

Now we are in the position to define a Mathematica function that performs SEM. First, we define the helper function that gets all variables contained in an expression in such a way that, for example, counts as one variable.

Here is an example.

The method will be explained with Bollen’s democracy dataset, so first, we need to load this dataset. The file bollen.csv contains headers (the names of the variables are saved in the list ) and a first column numbering the cases, which is dropped.

The data has 75 rows.

Here is the first row of 11 numbers.

The model itself has to be specified as a list of replacement rules that mirror the model equations discussed.

The code for the estimation function includes some utilities. For example, it defines its own covariance and variance functions that take into account which variables are assumed to be uncorrelated. The input of is the data matrix , a matrix of numerical values, one row per case. The structural equations are given in the format detailed in the previous section, “The Standard Example.” Moreover, the function needs:

• the lists of free parameters, (e.g. path weights)

• endogenous latent variables,

• exogenous latent variables,

• the list of error variables of latent variables,

• errors of exogenous manifest variables

• errors of endogenous manifest variables

• a list of pairs of error variables specifying which error variables are allowed to be correlated

The code after defining can be omitted on a first reading; it is only needed to calculate some fit indices (if required by the option , which asks to do the fit index (FI) calculation; similarly, asks to do the maximum likelihood estimation). The estimation is done at the end of the function.

The goal of the first half of the program is the definition of the covariance function that takes into account the SEM assumptions: that most error variables are uncorrelated (except those specified to be correlated), leaving variances of latent variables as symbolic entities to be estimated.

This function is then used to calculate the model implied covariance matrix . Applying the model equation rules repeatedly gives a matrix that depends only on parameters, variances of latent variables and error variables and some allowed covariances of error variables. The code from the line defining (the degree of freedom) onward is only important for getting fit indices. If we are only interested in estimating the model parameters, the next interesting lines are where is applied to estimate the model. As described in the introduction, there are several strategies to measure deviation of covariance matrices; for example, the definition of is a straightforward coding for minimizing .

Let us run the code on Bollen’s model in a simplified version where no correlation of error variables is assumed. This may take several minutes.

The result combines parameter, variance and covariance estimations according to the various estimating strategies. To judge how well the model fits the data, you can set the option to some fit indices:

• RMSEA is the root square mean error

• CFI is the comparable fit index

• TLI is the Tucker–Lewis fit index

• NFI is the normed fit index

RMSEA should be less than 0.1 or better, less than 0.05, and the last three should all be greater than 0.9 or 0.95 for good model fit.

The results of estimating using the three different methods differ somewhat. This is not a bug of our program; lavaan determines the same numbers up to several decimal places. There are results in the literature about which methods are equivalent under which conditions. For these fit indices to be interpretable, we need to assume that the data is multivariate normally distributed. If this assumption is violated, then we should judge model fit by other indices, which is beyond the scope of this article; however, they could be calculated based on the current approach as well. The book edited by Hoyle [2] gives some information on these methods.

For the original model that allows some covariances between error variables, the runtime gets worse, especially for maximum likelihood estimation. Hence, this is turned off in the following code.

The results of both models are exactly the same as calculated with lavaan.

When I first learned about SEM, I was puzzled by the many notions (e.g. exogenous, endogenous) and the assumptions needed. For example, I felt that correlation of error variables should be calculated by the estimation algorithm and not be set at will when specifying the model. However, these difficulties seem to play no large role in practice and there are thousands of research papers (mainly) in the social sciences that use these methods with great success. Yet, there are some reasons why the standard approach to SEM via covariance matrices can be criticized (a more detailed discussion is given in [5]). Traditional SEM:

• is well suited only for linear models (there are some nonlinear extensions, but they have not yet become mainstream)

• does not give estimates of the values of latent variables for each case (Bayesian variants can do this)

• requires the covariance matrix of observed data to be nonsingular; however, improving measurement methods in , , , for example, may result in highly correlated measures of (in the extreme case with identical vectors of measured values) and hence their covariance matrix will be almost singular

• has resulting estimations for parameters that depend a lot on the estimation method used

• forbids certain linear models that are not identified in this approach, even though the model itself is sensible and well defined (e.g. the number of covariances of error variables allowed to be nonzero is limited, although in practice there may be correlations)

You may then wonder why the covariance matrix–based approach is so popular. I suppose that more than 40 years ago, computers were not powerful enough to deal with a full dataset, so that the information reduction by calculating the correlation matrix was essential. Since then, many powerful programs have been developed and research has been carried out that gave a good understanding of conditions under which the method works well. Moreover, the psychometric community reached a consensus on how model fit should be judged and thus studies using this method faced no problem being published.

After this discussion of pros and cons, it is time to present the following case-based approach to SEM estimation that is very easy (one may even call it naive) to implement but is also very flexible and with today’s computing power, it is feasible in many real-world situations.

Hence, I propose to do SEM case-based by least-square optimization of the defects of the equations. Assume we have observations (cases) of variables , . A general equational model consists of equations , , which involve the data, latent variables , , and parameters . Then the latent variables and the parameters are estimated by minimizing .

Another twist is needed to get the best results, however. The above objective function gives all equations the same weight. However, it turned out (by working with simulated data where it is clear which parameters should be found) that we get better results by multiplying by a factor that gives the equations different weights, that is, . The factor can be modified by an option in the code that follows. Best results are obtained for , where is the number of latent variables in . The idea behind this choice is that an equation that involves only one latent variable links this variable directly to the manifest data and thus should have a high weight. In contrast, equations with many latent variables are not so close to the manifest observations and are thus are more hypothetical, so they should have a lower weight.

The model equations are not formulated as rules as for the first SEM, but as equations with the name of the error variable attached to each equation. Moreover, the dataset is not normalized, so there are nonzero intercepts in the linear equations. In the first approach this had no consequences, because such additive values are eliminated by calculating the covariance matrix, but in the SEM2 approach, intercepts must be modeled explicitly (and we have the benefit of getting estimates for them as well).

The function SEM2 that carries out the model estimation takes as input and the names of the manifest () and latent variables (). At the technical heart of the function is the subroutine . This function takes an equation involving latent variables (e.g. ) and adds to the objective function the appropriate term for each case (i.e. with values from the data replacing the names of manifest variables):

There is one option.

This code estimates Bollen’s model.

As mentioned, there is a version that weights equations according to the number of latent variables they have.

The results for the estimates differ from what is calculated in the traditional covariance matrix–based approach given for . A simulation study that compares the two approaches [5] showed that in many situations the case-based approach gives better results, especially when the assumption of independent errors is violated. Moreover, the case-based approach is easily applied to nonlinear equations. However, in certain situations it may be necessary to perform the minimization with higher accuracy than provided by standard hardware floating-point numbers.

In standard linear regression , one assumes that the independent variables are measured exactly, while the dependent variable has an error that is ideally normally distributed. If the independent variables are measured with error too, standard linear regression underestimates the regression coefficient. This is the famous attenuation problem and I will show how to solve it. Let us first simulate a dataset with error on both variables.

Then linear regression underestimates the slope, which should be 0.5.

When using case-based modeling, several strategies are possible. We may use one or two latent variables for the true values. As the true dependent variable is just , the following code uses just one latent variable. Another twist is that the equations are divided by the empirical standard deviations to put them on an equal footing.

This example shows both the power of this method and the responsibility of the modeler to set up sensible equations. If we are sure that the errors are uncorrelated, we may add as another constraint to further improve the estimate. This may also be done automatically with an extended version of SEM2, which will be published when its development is completed.

Two methods for the estimation of structural equational models are presented. One uses the traditional covariance matrix–based approach and is therefore restricted to linear equations, while the other approach is more general but not yet established in practice. Estimating the models is rather easy in Mathematica, but the numerical problems that arise can be demanding. The new case-based approach is very flexible and promising in certain situations where the standard approach shows limitations.

Case-based calculation of SEM looks very promising given the numerical power of today’s computers and might give insight in situations where the restrictions of the traditional approach urge researchers into making assumptions that may not be warranted.

It is my pleasure to thank Ed Merkle and Yves Rosseel for many explanations of SEM.

[1] | K. A. Bollen, Structural Equations with Latent Variables, New York: Wiley, 1989. |

[2] | R. H. Hoyle (ed.), Handbook of Structural Equation Modeling, New York: Guilford Press, 2012. |

[3] | K. Gana and G. Broc, Structural Equation Modeling with lavaan, Hoboken: John Wiley & Sons, 2019. |

[4] | Y. Rosseel. “lavaan.” (Aug 25, 2019) https://lavaan.ugent.be. |

[5] | R. Oldenburg, “Case-based vs. Covariance-based SEM,” forthcoming. |

R. Oldenburg, “Structural Equation Modeling,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–5. |

Reinhard Oldenburg has studied physics and mathematics and received a PhD in algebra. He has been a high-school teacher and now holds a professorship in Mathematics Education at Augsburg University. His research interests are computer algebra, the logic of elementary algebra and real-world applications.

**Reinhard Oldenburg**

*Augsburg University
Mathematics Department
Universitätsstraße 14
86159 Augsburg, Germany
*

A method of generating minimally unsatisfiable conjunctive normal forms is introduced. A conjunctive normal form (CNF) is minimally unsatisfiable if it is unsatisfiable and such that removing any one of its clauses results in a satisfiable CNF.

Ivor Spence [1] introduced a method for producing small unsatisfiable formulas of propositional logic that were difficult to solve by most SAT solvers at the time, which we believe was because they were usually minimally (i.e. just barely) unsatisfiable. Kullmann and Zhao [2] claim that minimally unsatisfiable formulas are “the hardest examples for proof systems.” We will generalize Spence’s construction and show that it can be used to generate minimally unsatisfiable propositional formulas in conjunctive normal form, that is, formulas that are unsatisfiable but such that the removal of even a single clause produces a satisfiable formula. In addition to increasing our understanding of the satisfiability problem, these formulas have important connections to other combinatorial problems [3].

We assume the reader has at least a minimal acquaintance with propositional logic and truth tables. An *interpretation* of a propositional formula is an assignment of truth values to its propositional variables. A propositional formula is *satisfiable *if there is an interpretation that makes it true when evaluated using the usual truth table rules. A *literal* is a propositional variable or a negated propositional variable. A *clause* is a disjunction of literals; if it contains exactly literals, we call it a *-clause*. A *conjunctive normal form* (or *CNF*) is a conjunction of clauses. A -CNF is a conjunction of -clauses.

For example, is a 3-CNF. It is often convenient to think of CNFs as a list of lists of literals; in this format, the 3-CNF example would be written as . This way of writing CNFs is quite common in computer science and is the approach that we took in [4], where we showed how the famous Davis–Putnam algorithm for satisfiability testing could be easily programmed in Mathematica. , Mathematica’s built-in function for satisfiability testing, requires replacing “¬” by “!”, “∨” by “||” and “∧” by “&&”; so in Mathematica the 3-CNF example is written as . In this article, we adopt the “list of lists” approach for programming purposes and then convert to Mathematica’s format when testing for satisfiability with .

In this section, we show how to generalize a construction of Ivor Spence [1] that produces unsatisfiable 3-CNFs that are relatively short but take a relatively long time to verify that they are indeed unsatisfiable using standard computer programs, even though it is relatively easy, as we shall show, to demonstrate that they are unsatisfiable. (Perhaps humans are not replaceable by computers, after all!)

Given positive integers and , suppose the propositional variables , , …, are partitioned in order into sets of size and one set of size . For each cell of the partition, form all -clauses from the ’s in that cell and let be the conjunction of all these -clauses. If is to be true under an interpretation , no more than of the -variables from each partition cell can be false, since if were false, the -clause containing exactly these -variables would be false, as would their conjunction . Thus no more than of the -variables can be false under .

Next let, , …, be a random permutation of the ’s and partition these ’s just as the ’s were partitioned. However, this time, for each cell of the partition, form all -clauses from the __negated__ -variables in that cell and let be the conjunction of all these -clauses. Reasoning as before, no more than -variables can be true under .

Let . If is to be true under some interpretation , both and must be true under ; thus no more than -variables can be false and no more than -variables can be true under . Since the -variables are permuted -variables, it follows that no more than -variables can be true under . However, , the number of -variables in ! Thus is an unsatisfiable CNF, because there is no interpretation of all its -variables that makes true.

Suppose next that we drop one of the clauses in , say for example, ; let and let . Let be an interpretation that assigns false to , , …, and true to the remaining variables in the first -cell. As long as no more than -variables in each of the remaining cells of the partition of the ’s are assigned the value false, would be true under . Whether or not and hence are true under depends on whether also has the property that at most -variables in each cell of the partition of the randomly permuted variables (the ’s) are assigned the value true under . While this is unlikely for any given interpretation , there are so many interpretations satisfying that it is most likely that some such interpretation has this property and the reduced CNF will then be true under .

We will investigate this intuitive argument. For to be true, the -variables , , …, can now be false in the first cell, as long as each of the remaining cells in the -partition has at most false variables; thus -variables can be false and, as above, -variables can be true. However, , and the argument showing to be unsatisfiable cannot be applied to .

First we allow for different choices for the parameters and . Initially we set and . The next several steps serve to introduce the variables and partition them into cells.

Define the partition of the -variables.

Here is the -partition for our example.

Next we generate, negate and partition the -variables.

We join and and form all -sets from the result.

puts these steps together. The argument is a permutation of .

For the experiments that follow, leaving out the third argument uses a random permutation.

Because of , the negated pieces can change from one run to the next.

The function transforms a list of clauses into an expression that allows for satisfiability testing, as described in the section Definitions.

Equivalently, here is a longer form.

To test that a CNF is minimally unsatisfiable, we must show both that it is unsatisfiable and that the removal of any one clause always results in a satisfiable CNF. tests the satisfiability of a CNF in list form by converting it to a logical expression with and applying the built-in function ; then it tests if all the formulas with one clause deleted are indeed satisfiable.

This tests whether C3 is minimally unsatisfiable.

Most of the unsatisfiable formulas generated in this way are, as we shall see, minimally unsatisfiable, but not always. However, we conjecture that the bigger we take and , the more likely we are to get a minimally unsatisfiable formula. We have done some experiments on this conjecture and discuss them in the next section.

If we remove two different clauses instead of one from our unsatisfiable formulas, the result is almost always a satisfiable formula. We define the function and experiment with it in the next section.

We run some experiments to investigate the frequency of minimally unsatisfiable CNFs obtained with our Spence generalizations.

The table below summarizes some larger computer experiments we have conducted to test our conjecture that most of the formulas constructed by the above methods are minimally unsatisfiable. The third column, “”, stands for the number of clauses in the CNF defined by `SpenceCNF@SpenceList[k,g]`. The fourth column gives the percentage of these clauses that were minimally unsatisfiable.

Each line in the table represents results on 500 formulas. For example, the first line constructs 500 3-CNFs, based on variables and clauses in each. It turned out that 62% were minimally unsatisfiable.

Next we look at what happens if we remove two different clauses from our Spence formulas. With , , there are 92 clauses in the Spence CNF; hence distinct ways to remove two different clauses. We count the number of times the resulting CNF is true in 100 trials.

In this trial, we are very close to all Spence CNFs becoming satisfiable after removing two different clauses. We believe this is true in general.

In this section we modify our construction to generate only derangements of the -variables. A *derangement* is a permutation that leaves no number in its original position. There are three resource functions that deal with derangements.

It is known that derangements constitute slightly more than a third of the permutations (see [5]), as the following calculation illustrates.

We define the function that does “derangement” experiments.

On the basis of these experiments we conjecture that the derangements are about as likely to produce minimally unsatisfiable formulas as permutations in general.

We have adapted a method of I. Spence [1] to easily obtain large numbers of unsatisfiable CNFs that are usually but not always minimally unsatisfiable. We also ran some experiments to indicate what percentages would be minimally unsatisfiable. In addition, our experiments suggest that if two different clauses are removed rather than one, the resulting formula will almost always be satisfiable. Finally we restricted the random permutations in our construction by requiring them to be derangements and saw that this gave similar percentages of minimally unsatisfiable formulas.

I am grateful to the referee for his advice and to the editor, George Beck, for greatly improving my Mathematica coding throughout the paper.

[1] | I. Spence, “sgen1: A Generator of Small but Difficult Satisfiability Benchmarks,” ACM Journal of Experimental Algorithmics, 15, 2010 pp. 1.1–1.15. doi:10.1145/1671970.1671972. |

[2] | O. Kullmann and X. Zhao, “On Davis–Putnam Reductions for Minimally Unsatisfiable Clause-Sets,” Theoretical Computer Science, 492, 2013 pp. 70–87. doi:10.1007/978-3-642-31612-8_ 21. |

[3] | R. Aharoni and N. Linial, “Minimal Non-Two-Colorable Hypergraphs and Minimal Unsatisfiable Formulas,” Journal of Combinatorial Theory, Series A 43(2), 1986 pp. 196–204. doi:10.1016/0097-3165(86)90060-9. |

[4] | R.Cowen, M. Huq and W. MacDonald, “Implementing the Davis–Putnam Algorithm in Mathematica,” Mathematica in Education and Research, 10, 2005 pp, 46–55. www.researchgate.net/publication/246429822_Implementing_the_Davis-Putnam_Algorithm_in_Mathematica. |

[5] | Wikipedia. “Derangement.” (Jul 10, 2010) en.wikipedia.org/wiki/Derangement#Limit_of_ratio_of_derangement_to_permutation_as_n_approaches_∞. |

R. Cowen, “Generating Minimally Unsatisfiable Conjunctive Normal Forms,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–4. |

Robert Cowen is a Professor Emeritus at Queens College, CUNY. His main research interests are logic and combinatorics. He has enjoyed teaching students how to use Mathematica to do research in mathematics for many years.

**Robert Cowen**

*16422 75th Avenue
Fresh Meadows, NY 11366
*

Given a rationally parameterized curve in or , where the and are polynomials, we find the dimension of the smallest linear subset of containing the curve. If all the and are of degree or less, then it is known abstractly that this dimension is or less and *rational normal curves* play a key role in the argument. We consider this from a computational point of view with playing an essential part in the discussion.

The ancients were confused about the concepts of degree and dimension. As late as 1545 in his famous book *Ars Magna* [1], Cardano, who did not hesitate to invent imaginary numbers, in reference to his assistant Ferrari’s solution of the quartic gives the following disclaimer:

Although a long series of rules might be added and a long discourse given about them, we conclude our detailed consideration with the cubic, others being merely mentioned, even if generally, in passing. For as the first power refers to a line, the square to a surface, and the cube to a solid body, it would be very foolish to go beyond this point. Nature does not permit it.

The distinction between degree and dimension was later resolved by Descartes’s algebraic notation. But, in the context of parametric curves, I recently noticed a simple linear algebra proof of the following theorem:

**Theorem A**

Let , be a curve in or where the coordinate functions are polynomials of degree or less. Then for any , the curve lies in a linear subset of or of dimension .

This theorem, as well as many of the other facts in this article, is given in Joe Harris’s book [2] from a projective geometry point of view. He also considers the degree versus dimension issue in a number of other situations. We give the linear algebra proof in Section 2.

Unfortunately, projective geometry is not computationally friendly. Instead we can view these results from an affine point of view using the built-in function [3], which we discuss in Section 3.

We then generalize and rephrase our result in Section 4 as Theorem B. The generalization is to rational curves and we can give the dimensions of the smallest linear space containing the curve. Theorem B does clarify that, while the degree bounds the size of a linear set, the curve may lie in a smaller dimensional linear set.

In Section 5 we observe that the *rational normal curve* in or , , is universal for rational curves. That is, every rational curve is a transform of a normal curve. This is very easily seen via the . This lets us rephrase Theorem B in another useful form, where the can be found directly from the expression of a rational function in the form

where and the common denominator are all polynomials of degree or less written in descending degree. To simplify notation we generally work with coefficients in the real numbers , but it should be understood that one could work in any subfield of the complex numbers as well. But, as immediately below, in some cases we must consider parameter values in the algebraic closure of the subfield.

Sections 5 and 6 give two applications.

The first discusses the recognition problem: given a point , is for some ? This is equivalent to the well-studied problem of finding a common solution of a family of univariate polynomials, which we do not consider here. We show that modulo the linear , the recognition problem can often be solved in a linear space of smaller dimension.

The second example is the implicitization problem for rational functions, which is to find an implicit system that describes the ideal of the rational curve. We only sketch this, as there is no room to carefully describe the routines in [4].

In fact, this article was motivated by the author’s work on implicitization of parametric curves. I noticed that an unexpectedly large number of linear equations appeared in the implicit systems.

In this article a *linear subset of * is a set defined by a system of linear equations, not necessarily homogeneous. A linear subset is distinguished from a *linear subspace*, which is a subspace of the vector space and defined with homogeneous equations. The big difference is that a subspace contains the origin . A linear subset is a coset of a linear subspace under the operation of vector addition.

A polynomial parametric curve is a function where each coordinate function is a polynomial that we write in descending degree:

where is the *degree* of the coordinate polynomial. The largest such degree is the degree of the parameterization. Note that . This constant acts merely as a basepoint; a different basepoint gives a curve that is a translation of the first. Thus the basepoint does not affect the geometry. We say our parameterization is *stripped* if (or alternatively if each ). Each polynomial parameterized curve is then a translate of a stripped curve, so we first consider those. We *strip* a polynomial parameterized curve by dropping all the constant terms.

We now create a stripped coefficient matrix from the stripped polynomial. If is the degree of the polynomial, is the matrix with rows . Consider the following equation where points are column vectors.

This shows that every point on the parameterized curve is in the vector space spanned by the columns of the coefficient matrix. So Theorem A is true for a stripped parameterization, but adding back the constant simply moves this subspace to a linear subset.

To describe the smallest linear set containing a finite set of points in terms of a system of equations, here is a short routine.

A longer version of this with error detecting is in [4].

Example 1:

We know a linear set containing this curve must be of dimension no greater than three, since this set is contained in , so it is generated as a linear set by four or fewer points. Therefore it is enough to take four random points on this curve and calculate the smallest linear set containing them.

Here are the four random points.

Here is the linear expression for the linear set.

A linear set defined by one linear equation in three variables is of dimension two. This curve lies in the linear set defined by setting the linear expression to zero.

The central concept in this article is the built-in Wolfram Language function **. **When we say *transformation function* we mean a function given by . Basically these are affine versions of *projective linear* transformations, which can include translations along with the usual transformations of linear algebra. They appeared in Lecture 2 of Abhyankar [5] and much of the author’s work [4, 6] as *fractional linear transformations*; they are also known in the literature as *linear fractional transformations*. Our major use of these transformations is to be able to access projective geometry where points are cosets of -tuples, while working in affine geometry where points are merely -tuples, which are easy to manipulate computationally.

A transformation function can be described by an matrix. The matrix of the associated projective linear transformation is called the *transformation matrix* in the Wolfram Language. Thus the of an matrix takes an affine -tuple, appends 1 to represent this in projective -space, applies the projective linear transformation defined by and then specializes by dividing by the component.

Here is an example.

These transformations in the special case are discussed in detail in Chapter 6 of my book [4].

A transformation function is *affine* if the last row is ; the denominators are always 1, the upper-left submatrix gives a linear transformation and the first entries of the last column describe a translation.

In particular, the domain of an affine transformation is all of . Otherwise we call the transformation function *projective*. If the last row of the transformation matrix is , then the hyperplane of given by is not in the domain of the transformation function. In the context of an affine transformation, it is understood that the equation defines the empty set.

In this article we assume that a rational parametric curve has coordinates that are quotients of two polynomials in . We insist that the parametric curve be given with a common denominator , so, for example, is of the form

(1) |

for polynomials . The degrees of may be greater than, equal to or less than the degree of . In particular, could be the constant polynomial 1, in which case is a polynomial curve that we can treat as a special case of a rational curve. The degree of is the largest degree of .

The advantage of writing polynomials in the parameter in descending degree is that writing a transformation matrix for a rational function is easy. Suppose in equation (1) that for , where we write . Then the transformation matrix for is

(2) |

Example 2.

Both [2] and [5] mention the fact that every rationally parameterized curve is a projective transformation applied to a polynomially parameterized curve. In particular, [2] notes that this polynomial curve can be the rational normal curve of degree

Before we state Theorem B, we note that every linear transformation can be factored into a projection on some coordinates followed by an embedding. This is accomplished in a special way using Mathematica by the following matrix reduction algorithm we call . This takes an matrix of rank and outputs an matrix and an matrix consisting of rows of such that . This implies that rows of are what the Wolfram Language calls ; that is, contains an identity matrix as a submatrix.

In the code, the functions and defined in the statements invert the lists and viewed as functions from their index sets. The tests whether is in the domain of .

We can now state and prove our main theorem; we write . It may seem counterintuitive that we can strip the constant off the denominator, in particular for polynomially parameterized curves (so stripping it gives ). But projectively the denominator is just another coordinate so we can still do that. So if is the matrix from the previous section and , where is the of , then the *projective stripped coefficient matrix* of is just the submatrix of with the last column removed.

**Theorem B**

Let be a parametric curve in of degree . Suppose the projective stripped coefficient matrix of has rank . Then there are components of defining a stripped polynomial parametric curve in and a transformation function taking to .

**Proof**

We apply the algorithm to the projective stripped coefficient matrix of , obtaining a list of rows forming a basis of the row space of and a matrix of size , where the rows corresponding to this basis are replaced by rows of the identity matrix. Multiplying by the vector gives the parametric function . Appending a last column to with the constant terms of the original gives a transformation matrix . By the above comments it is easy to see that the defined by takes to .

One can paraphrase this theorem as: *Given a parametric curve ** of degree *, *there is an *, *a stripped parametric polynomial curve ** in ** and a * *so that the following diagram commutes*.

We ask the reader not to take this diagram literally in the case of a rational parameterization, as the domains of , , may not be the full spaces indicated. But if is a polynomial parameterization, then the domains are the full spaces and is an embedding.

Example 3: We illustrate this proof by fully working out the following degree-two curve in .

The decomposition can be easily done by hand.

So we add the constant row; remember that the constant in the last row is 1.

Theorem B tells us the composition of and is .

So this curve is contained in a plane.

Example 2 (continued): We now consider the rational parameterization of example 2.

We check that this lies in a two-dimensional plane in .

The step in the proof of Theorem B where we use to obtain the curve from the rational normal curve can also be done by using an affine transformation function obtained by adding the row and column to . In Example 2 we have the following.

This gives:

**Theorem C**

Let be a rational (or polynomial) curve parameterization of degree . Suppose the projective stripped coefficient matrix of has rank . Then the transformation function in Theorem B can be decomposed into transformation functions as in the following diagram.

Here is an affine transformation function of onto and is a possibly projective transformation of into . In particular, the parametric curve given by lies in a linear subset of of dimension less than or equal to the minimum of , , .

**Construction**

As in Theorem B, we let be the projective stripped matrix of and apply to to get of sizes and , respectively. Appending a row of zeros and then a column of zeros with last component 1 to make into an affine transformation matrix of size , let be the of . Appending the column of constants to , we get a transformation matrix of size . Then is the of . One can check that .

This recovers the known result [2] that every rational parameterization is a projective linear transformation of the rational normal curve, but here we have a constructive approach.

Example 4: For an easy but nontrivial (i.e. not conic) example we use the *piriform* [7].

Here , . Here is the stripped projective matrix.

A trivial application of in that is of full rank gives the following.

Notice here that , and In this case, the curve lies in , a two-dimensional space. The numbers , , are important values in describing a rational parameterized curve. Even though the transformation matrix for contains the identity matrix, it is not injective, which is typical in the case of a rational parameterization, even when , but this does not occur for a polynomial parameterization.

The recognition problem is: *given a parameterized curve ** and a point ** in *,* is ** in the curve*; that is, does there exist with ?

There are two obvious methods to solve this problem. The first is to directly solve the over-determined system using . This works surprisingly well, failing mostly with poorly conditioned systems for which the other methods following may not work well either. The biggest problem with this approach is that when it does not work, it gives a false negative to the recognition problem. One can, of course, solve component by component and see if any solutions are numerically close.

Example 2 continued.

So the first point is on the curve but the second point is not. In general, finding a common zero of a set of polynomial or rational equations is an interesting problem, but we do not consider that here.

The second method is to find a system of equations whose solution set is the Zariski closure of the point set . All that then needs to be done, in principle, is to evaluate this system at and check that the value is 0. We consider this issue in Section 5.

As we have seen, a parameterized curve in may lie in a linear subset of dimension less than Using Theorem C and the algorithm, we can get some additional information about the problem and perhaps reduce this to a problem in a smaller .

Example 5.

We would like to find out which, if any, of the following points are on this curve.

We first find the transformation functions. Here is the projective stripped coefficient matrix.

Apply .

Augment these matrices to get transformation matrices.

Generate some random points.

This says is not contained in any proper subspace, but the image of lies in a three-dimensional subspace of .

The points and do lie in the image of , so may be points on the curve, but we can eliminate . We find the fibers (preimage) of and in .

These conveniently are singleton points. Thus we have reduced this rational recognition problem in to a polynomial problem recognition problem in .

So but is not on the curve.

As mentioned, the motivation for this article is my work on implicitization of rational parametric space curves. In this section I only sketch my algorithms; details are in [4]. The key here is that by the material discussed, especially Theorem C, every such curve is simply a fractional linear transformation of the rational normal curve.

By *implicitization* I mean describing these parametric curves by way of algebraic equations. A problem that arises is that while one expects a curve in to be given by equations, this is often not enough to fully describe the curve pointwise or algebraically. The standard counterexample is the *twisted cubic*, which is just the rational normal curve of degree three, . A system of three equations in the variables , , describing the twisted cubic, given in [2], is . An exercise in [2] is to show that the zero set of any pair of these three equations contains not only the twisted cubic, but also a line, but note that the extra line in the last pair lies in the infinite plane of projective three-space.

Any implicitization problem has infinitely many possible answers, but the best answers are systems of equations that form an H-basis. This idea goes back to F. S. Macaulay in 1916, who was studying homogeneous equations, hence the “H”; basically in our context it means that any equation of total degree containing the parametric curve in its zero set is a polynomial combination , where the are in the H-basis and the are polynomials so that each term has total degree at most Thus for an H-basis, the ideal membership problem reduces to linear algebra.

If one has a system with zero set describing the parametric curve , then the Gröbner basis with respect to a degree ordering is an H-basis, perhaps larger than necessary. In practical terms one can simply use the following format.

In the case of the rational normal curve of degree , Harris [2] claims that using quadratic equations is sufficient, so we can proceed as follows: we first give a procedure finding the total degree of a polynomial of several variables.

Then we use the following code, say for .

This defines .

Likewise we get the following for .

The size of the H-basis is , which gets much larger than . The numbers are binomial coefficients and can be enumerated recursively; however, does not contain , so there is no obvious recursive construction of these bases.

In [4] I construct, for the fractional linear transformation given by an transformation matrix A, a transformation . This takes the system in with variables given by the list X to a system in with variable list Y such that for a solution of then is a solution of . Unfortunately, this works numerically and the user must provide a number that bounds the degrees of the polynomials used and a small tolerance , but for an appropriate choice of these parameters the system is often an H-basis if is.

Thus, a possible method for finding an implicit system describing the rational parameterized curve is to write it in the form , where is a , and use .

We use Example 4 to illustrate this.

Example 4 continued; define and so on.

Then we get the implicitization directly using a related function (fractional linear transformation, i.e. ) that takes not points to points but equation systems to equation systems. In [6] this is simple, because all transformation matrices used are invertible. In this context the 3×5 transformation matrix is not invertible, so finding the equation system for the image of a transformation function becomes quite involved. Essentially this is the subject of all of Chapter 2 in [4]. For instance, in this case we are compacting six equations into one.

The following non-executable code and result are copied from [4, Section 3.1]. For executable code, see the GlobalFunctionsMD.nb notebook of [4].

Once this is done we can check that this works.

We have shown how the Wolfram function simplifies the study of rational parameterized curves.

[1] | J. Cardano, The Great Art (T. R. Whitmer, trans.), Boston: MIT Press, 1968. |

[2] | J. Harris, Algebraic Geometry, A First Course, Springer Graduate Texts in Mathematics 133, New York: Springer, 1992. |

[3] | S. Wolfram, An Elementary Introduction to the Wolfram Language, Champaign, IL: Wolfram Media, 2015. See also reference.wolfram.com/language. |

[4] | B. H. Dayton. “Space Curve Book.” http://barryhdayton.space/SpaceCurves/spindex.html. (Sep 4, 2020). Code in barryhdayton.space/SpaceCurves/GlobalFunctionsMD.nb. |

[5] | S. Abhyankar, Algebraic Geometry for Scientists and Engineers, Providence, RI: American Mathematical Society, 1990. |

[6] | B. H. Dayton, A Numerical Approach to Real Algebraic Curves with the Wolfram Language, Champaign, IL: Wolfram Media, 2018. |

[7] | E. W. Weisstein. “Piriform Curve” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/PiriformCurve.html. |

B. Dayton, “Degree versus Dimension for Rational Parametric Curves,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–3. |

Barry Dayton is the author of *A Numerical Approach to Real Algebraic Curves with the Wolfram Language* and is Professor Emeritus at Northeastern Illinois University in Chicago, IL. He lives in Ridgefield CT.

**Barry H. Dayton**

Department of Mathematics

Northeastern Illinois University

Chicago, Illinois 60625-4699

*barryhdayton.space*

*barry@barryhdayton.us*

The Wolfram Language has numerous knowledge-based built-in functions to support financial computations. This article introduces many built-in and other financial functions that are based on concepts and models covered in undergraduate-level finance courses. Examples are taken from a wide range of finance areas. They emphasize importing and visualization of data from many sources, valuation, capital budgeting, analysis of stock returns, portfolio optimization and analysis of bonds and stock options. We hope that all the functions selected in this article are very useful for analyzing real-world financial data. All examples provide a unique set of tools for users to engage with real-world financial data and solve practical problems. The feature of automatic data retrieval from online sources and its analysis makes all results reproducible without any modifications in the code. We hope this feature will attract new users from the finance community.

Finance is computational in nature and often involves the analysis and visualization of complex data, optimization, simulation and use of data for risk management. Without the proper use of technology, it is almost impossible to analyze these functions of modern finance. Moreover, the field of finance has become far more driven by data and technology since the 1990s, which has made large-scale data analysis the norm. Data-driven decision-making and predictive modeling are now the heart of every strategic financial decision. Since the publication of Varian [1, 2], Shaw [3] and Stojanovic [4], there have been many updates and new functions, but no new articles or books have been written to cover wide areas of computational finance. This article provides a comprehensive overview of functions related to finance and introduces many functions that are useful for real-world financial data analysis using Mathematica 12.

We have provided all the custom functions in the text so that users can make changes as they learn how to program in the Wolfram Language. Furthermore, we minimize the explanation of any financial concepts in this article as our focus is on introducing financial application of the Wolfram Language.

We begin by defining some symbols that are frequently used as input arguments of custom functions in the article. The article uses or , , and as arguments in many functions defined in this article. Most of these symbols are used as input in the built-in function . All these arguments must be specified in the format acceptable in the function. We use or to represent a company’s or companies’ stock ticker symbol or symbols. It could be a string or a list of strings. The format represents the start date of the sample period specified and represents the last date of the analysis period. Both must be specified as date objects in any date format supported by . Similarly, represents data frequency. It may include , , or . In the subsequent functions defined in this article, we will not describe them when they are used as arguments.

The article is organized into 13 sections:

1. this Introduction

2. importing and visualizing data from different sources

3. capital budgeting and business valuation

4. functions for the analysis of security returns

5. rolling-window performance analysis

6. financial application of optimization

7. decomposing the risk of a portfolio into its components

8. importing factor data and running factor models

9. computing different types of portfolio performance measures

10. technical analysis of stock prices

11. bond analysis

12. analyzing derivative products

13. concluding remarks

The most commonly used built-in functions for retrieving company-specific financial data are , and . For example,

imports Facebook’s financial statement data and

imports Facebook’s price-related data.

Similarly, the function can be used to get data about stocks and other financial instruments. or can be used to chart prices against time. or can be used to make interactive plots with additional features of adding different technical indicators. Other functions such as , , or can also be used to visualize financial data. In the remaining part of this section, we are going to show you how to import data from different sources and visualize it.

We download Apple’s return on assets (ROA), return on equity (ROE) and revenue growth over the period January 1, 2001, to January 1, 2019, and plot them.

Similarly, we define the function to compare any specified property of different companies. The function takes a list of stock symbols, beginning period, end period and a property to consider as its arguments.

We plot the revenue growth of Apple, Facebook, Walmart and Bank of America over the period January 1, 2000, to January 1, 2019.

Second, we import and visualize data from the Federal Reserve Bank of St. Louis, as it is one of the most important data sources when it comes to economic data. The built-in function can be used to request the data from the Federal Reserve Economic Data API. Its argument structure is:

where is a series ID or a list of IDs. It returns a time series containing data for the specified series. It is often of interest to plot the economic time series with the recession dates.

The function downloads and plots the selected series along with the shaded recession period. The function takes series ID, start date, end date and title as inputs and returns a graph. It uses recession indicators based on USREC (US recession) data from the National Bureau of Economic Research (NBER) for the United States from the period following the peak through the trough to indicate the recession period. The Federal Reserve Bank of St. Louis may require the API key to download its data. The API key can be obtained freely by creating a user account at https://fred.stlouisfed.org (click “my account” and follow the instructions).

Now we can download any series and plot it. For example, we download and plot the Leading Index for the United States (USSLIND) over the period January 30, 1990, to January 30, 2019. Please use the API key 207071a5f2e90e7816259d3c32c1ab81 if needed. The shaded regions indicate recession periods.

We download and plot the historical real S&P 500 prices by month (MULTPL/SP500_REAL_PRICE _MONTH) over the period March 31, 1975, to May 30, 2019.

Finally, we show how to create a dataset. The built-in function is very useful for organizing large or small sets of data.

The function can be used to get a company’s fundamental data. After the data is stored, we can organize and analyze the data.

For example, we download return on assets (), return on equity () and revenue growth () for Apple Inc. (AAPL) over the period January 1, 2000, to January 1, 2019, and make a dataset. After the dataset is constructed, we can pull data and do further analysis using a rich set of built-in Wolfram knowledge.

Basic concepts used in common financial decision making are important for learning and understanding the finance discipline. Many functions such as , , , and are directly related to finance. Other functions such as and can be used to find one of the unknowns when relevant information is given. All these functions are useful for solving time value of money, capital begetting and business valuation problems. The Mathematica documentation provides numerous examples of how to use these functions. In this section, we are going to focus on a few examples concerning loan amortization, capital budgeting and business valuation.

A loan amortization table is often used to visualize periodic payments of the loan, loan balance and payment breakdown into principle payment and interest payment. The function returns an amortization table given its input arguments. It takes four arguments:

: current value of loan amount

: loan term in years

: annual percentage interest rate

: frequency of loan payment per year: 12 for monthly payment, 1 for annual payment, and so on

: (an optional argument) future value of the loan amount; if no value for is provided, the future value of the loan is assumed to be zero

Using this function, we compute an amortization table for a loan of $40,000 with 1-year loan term, 5% APR paid monthly.

The most commonly used decision tools in capital budgeting are net present value (NPV), internal rate of return (IRR), modified internal rate of return (MIRR) and profitability index (PI). These are defined in terms of the cash flows , , the discount rate and the reinvestment rate by:

We use the built-in Mathematica functions , and in the function to compute these measures. It takes cash flows (a list), discount rate and reinvestment rate as its arguments.

We illustrate the use of the function with an example. Say a project requires a $50,000 initial investment and is expected to produce, after tax, a cash flow of $15,000, $8,000, $10,000, $12,000, $14,000 and $16,000 over the next six years. The discount rate is 10% and the reinvestment rate is 11%. We compute the project’s NPV, IRR, MIRR and PI.

One of the most widely used business valuation models is the discounted cash flow model, in which the value of any asset is obtained by discounting the expected cash flows on that asset at a rate that reflects its riskiness. In its most general form, the value of a company is the present value of the expected free cash flows the company can generate in perpetuity. Because we cannot estimate free cash flows (FCF) in perpetuity, we generally allow for a period where FCF can grow at extraordinary rates, but we allow for closure in the model by assuming that the growth rate will decline to a stable rate that can be sustained forever at some point in the future. If we assume that the discount rate is the weighted average cost of capital (WACC), FCF grows at the rate of per year and that the last year’s free cash flow is , then the value of the firm can be defined as

If we assume that FCF grows at the rate for the next years and at the rate thereafter, then the value of the firm can be written as

(1) |

We implement formula (1) with that takes five arguments:

1. , last year’s free cash flows

2. , the annual growth rate of free cash flows in the first growth period

3. , the number of years in the first growth period

4. , the stable growth rate

5. , the weighted average cost of capital

Suppose that a company has $100 as its past year’s CFC, its FCF is expected to grow at 5% for the next five years and 2.5% thereafter and that its WACC is 9.5%. We compute its value.

The most practical approach is to assume that a company goes through three phases: growth, transition and maturity. First, a company’s growth increases. In the transition phase, the growth rate decreases. In the mature phase, a company grows at the same rate as that of the overall economy.

Assume that the FCF is positive and grew at the rate last year. Assume further that it grows at the higher rate each year for the next years and that after years, it declines at the rate each year for the next years. Finally, assume the company grows at the stable positive rate per year after years and that the cost of capital is represented by WACC. Then the value of the firm can be written as

(2) |

We implement formula (2) with that takes eight arguments:

1. , last year’s free cash flows

2. , last year’s FCF growth rate

3. , the incremental growth rate in the high growth period

4. , the number of years in the high growth period

5. , the declining growth rate in the translational growth period

6. , the number of years in the translational growth period

7. , the stable growth rate in the maturity growth period

8. , the weighted average cost of capital

To apply the function, consider a company in the early stage of its life cycle, assuming that the company experienced 10 percent growth in the past year. The company is expected to grow by 8% more each year for the next 7 years and its growth will start to decline by 5% each year after the seventh year for 5 years. After 12 years, the company is expected to grow at the same rate as that of the overall economy, which is 2.5% per year. Suppose the past year’s CFC was $100 million and the weighted average cost of capital is 9.5%. We compute the value of the company.

When using , we assume that the growth rate in the stable phase is never negative. Therefore, do not assume the declining growth rate to be too high. Otherwise, the growth rate of FCF in the maturity phase may be negative.

There are various methods in analyzing historical stock prices and returns data. We can use different kinds of charts and graphs as well as descriptive statistics. Similarly, it is also important to understand whether the stock returns distribution is normal. In the next two subsections, we explain some of the most common charts and descriptive measures.

Commonly used charts for historical performance analysis are time series plot of prices, normalized prices (historical prices divided by the price at the beginning), continuous draw-downs (cumulative continuous returns) and cumulative returns. The function l takes four arguments as defined in Section 1 and returns four different plots: historical prices, the normalized price, continuous draw-downs and cumulative returns.

We can apply l to any symbol and period. For example, we plot the historical stock prices and returns of Walmart Inc. (WMT) over the period October 10, 2018, to June 7, 2019.

A histogram and an empirical plot of kernel density estimates are often used to describe the general shape of the data. The function takes four arguments as defined in Section 1 and returns density and histogram plots of returns.

Using the function, we download the daily closing price of the S&P 500 index over the period October 1, 2000, to October 1, 2019, and plot the histogram, the empirical density function and the density function of a normal distribution with the same mean and variance.

Many built-in functions can be used to compute different descriptive statistics. These descriptive statistics describe properties of distributions, such as location, dispersion and shape. The most common measures are computed by :

• holding period return

• average return

• geometric mean return

• cumulative returns

• standard deviation

• minimum return

• maximum return

• skewness

• kurtosis

• historical value at risk

• historical conditional value at risk

For example, we compute the descriptive statistics for Walmart Inc. (WMT) using monthly returns over the period October 10, 2010, to June 6, 2019.

It is also informative to examine the historical performance of an individual stock. In such an analysis, we calculate monthly statistics using daily returns and report them on a monthly basis. We define the function to download historical stock prices, compute desired statistics and return a dataset. The function takes four arguments: stock ticker symbol, start date, end date and a statistical function such as , or , and returns a dataset. For the function to work, it requires more than two years of data.

For example, we compute the monthly cumulative returns for Walmart Inc. (WMT) using daily returns over the period October 1, 2010, to June 30, 2019.

Once we compute the statistics, we can take specific columns by specifying their names.

We are often interested in knowing whether returns data follows a normal distribution because understanding whether stock returns are normal or not is very important in investment management. One way to check whether returns are normally distributed or not is to compare the empirical quantiles of the data with normal distribution. The function can be used to produce quantile-quantile plots. Many other built-in functions can help to assess whether returns are normally distributed.

The function can be used to test whether data is normally distributed and can also be used to assess the goodness of fit of data to any distribution.

As a majority of financial data is multivariate, it is advantageous to perform comparative analysis of multiple security returns. In some cases, one has to compare one series with another. In other cases, many variables might have to be simultaneously measured to capture the complex nature of the relationship among variables. Comparing the complexities of these factors gives the analyst a more detailed account of the relationships between selected returns, thus allowing for a better interpretation of their values and behaviors. In this section, we first compare the performance of one asset with another using graphs, then compute descriptive statistics as well as correlation matrices.

Two most commonly used graphs for comparing historical performance of more than one stock/ETF are time series plots of normalized prices and cumulative returns. The next two functions take four arguments as defined in Section 1 and compute normalized prices and cumulative returns.

We get normalized prices and cumulative returns and plot them for three stocks (Facebook, Inc. (FB), Costco Wholesale Corporation (COST) and Walmart Inc. (WMT)) over the period May 1, 2000, to May 30, 2019.

Besides graphs, we can also compute descriptive statistics and compare their performance. We define a function for that purpose. It takes four arguments as defined in Section 1 and returns a table with different types of descriptive statistics.

For example, we download historical data and compute different descriptive statistics for three stocks (Walmart Inc. (WMT), Apple Inc. (AAPL) and Microsoft Corporation (MSFT) ) using monthly data over the period January 1, 2010, to March 30, 2019.

Similarly to how we calculated an individual stock’s monthly statistics in Section 4.1, we define the function to compute monthly statistics for more than one stock given the arguments: stock ticker symbols, start date, end date and a statistical function such as , or .

For example, we compute the monthly cumulative returns for four stocks (Walmart Inc. (WMT), Apple Inc. (AAPL), Microsoft Corporation (MSFT) and Netflix, Inc. (NFLX)) over the period January 1, 2010, to June 30, 2019, and create a dataset. The first column represents year and month, the first four digits for the year and the last two digits for the month.

Similarly, box-and-whisker charts, paired histograms, paired smooth histograms and matrix scatterplots are often used to examine multivariate data. The function can be used to make a box plot that gives a glimpse of the distribution of the given dataset. You can see the statistical information by hovering over the boxes in the plot. The and functions are used to create paired histogram and smooth distribution plots. They can be used to compare how two datasets are distributed. The function from the Statistical Plots package can be used to make scatter plots of multivariate data. It creates scatter plots comparing the data in each column against other columns. More complex analysis of multivariate data can be done using functions from the Multivariate Statistics package. The package contains functions to compute descriptive statistics for multivariate data and distributions derived from the multivariate normal distribution. All these functions are well explained in the official documentation.

Rolling-window performance analysis is a simple technique to access variability of the statistical performance measures. For example, if we want to access the stability of mean or standard deviation of returns on a stock over time, we can choose rolling window (the number of consecutive observations per rolling window), estimate the mean or standard deviation and plot series of the estimates. A little fluctuation is normal but large fluctuations indicate a shift in the values of the estimate. Built-in functions such as , and are useful for rolling-window performance analysis. In this section, we are going to show a few examples of how to compute rolling-window-based performance statistics.

We define that can be used to plot rolling-window statistics given its five inputs: stock ticker symbol, start date, end date, size of window in days and function to apply, which can be any built-in or user-defined function.

For example, we compute and plot the 90-day rolling mean to standard deviation ratio on Walmart’s daily stock returns over the period January 1, 2001, to March 30, 2019.

Similarly, we define the function to compute the rolling correlation of two series and apply it together with to the desired data.

We use the function to plot the 90-day rolling correlation of daily returns on two stocks, WMT and COST, for the period from March 30, 2009, to March 30, 2019.

Sometimes, it is also useful to store these time-varying descriptive statistics as a dataset so that we can use them in the subsequent analysis. The function , given its input, computes the geometric mean, standard deviation and the ratio of the arithmetic mean to the standard deviation on a rolling-window basis.

For example, we compute the 90-day rolling-window geometric mean (GM), standard deviation (Std. Dev.) and arithmetic mean to standard deviation ratio (AM/Std. Dev.) of Walmart’s daily stock returns over the period July 1, 2018, to October 30, 2019. You can scroll through the dataset.

Mean-variance analysis is one of the foundations of financial economics. Portfolio optimization is essential, whether it be in professional or personal financial planning. In this section, we are going to show how to implement the most commonly used optimization techniques in finance using historical returns. We want to point out that future returns on investment depend on expected returns and other conditioning information, not on the past returns. Past returns are used only for illustration and do not guarantee future returns.

Define the following variables:

• is the risk-free rate

• is the proportion of wealth invested in security

• is the average return on security

• is the variance of security

• is the covariance between securities and

• is the correlation between securities and

Then we define the vectors of mean returns and weights and the covariance matrix:

The formulas for the portfolio mean and variance are and , respectively. The corresponding Mathematica code is and .

In order to compute portfolio statistics, we need returns data. We can use the function to download historical returns data. It takes four arguments as defined in Section 1 and gives a matrix of returns. Most functions in this section use the function, so run it before you run other functions.

To compute basic portfolio statistics such as portfolio mean, variance, standard deviation and Sharpe ratio, we can use , which takes six arguments. The first four arguments are as defined in Section 1 and the other two are a list of weights and the optional risk-free rate.

For example, we compute the portfolio mean, variance, standard deviation and Sharpe ratio for the portfolio that consists of the stock returns of five companies: Apple (AAPL), Walmart (WMT), Boeing (BA), 3M and Exxon Mobil (XOM), using monthly returns over the period January 1, 2009, to May 30, 2019.

The function plots the Markowitz portfolio frontier; it takes a matrix of returns obtained from as its only argument. The function uses the concept that any two efficient portfolios are enough to establish the whole portfolio frontier, as first proved by Black [5]. It accepts any option.

For example, we plot the portfolio frontier for the portfolio that consists of stock returns of five companies: Apple (AAPL), Walmart (WMT), Boeing (BA), 3M and Exxon Mobil (XOM), using monthly returns over the period January 1, 2009, to May 30, 2019.

Next, we solve two kinds of portfolio problems: global minimum variance portfolio and tangency portfolio. In terms of the notation defined earlier in this section, the global minimum variance portfolio can be obtained by minimizing subject to and solving for . Its solution can be obtained with the built-in Mathematica function .

The function computes weights, returning the portfolio allocation on stocks considered for a global minimum variance portfolio.

We compute the global minimum variance portfolio weights using the monthly stock returns of five companies: Apple (AAPL), Walmart (WMT), Boeing (BA), 3M and Exxon Mobil (XOM), over the period January 1, 2009, to May 30, 2019.

Similarly, the tangency portfolio can be obtained by maximizing , where is the risk-free rate (a constant in this case), subject to and solving for . The solution uses the built-in Mathematica function . The function computes the tangency portfolio weights given its five inputs, four as defined in the Section 1 and the risk-free rate.

Assuming a monthly risk-free rate of 0.1667 percent and using monthly data over the period January 1, 2009, to May 30, 2019, we calculate the tangency portfolio weights for our portfolio of five stocks: (Apple (AAPL), Walmart (WMT), Boeing (BA), 3M and Exxon Mobil (XOM)).

Portfolio optimization using the Wolfram Language is very flexible. We can formulate any kind of portfolio and use built-in functions such as , or to get numerical solutions to the portfolio problem.

In this section, we concentrate on how to decompose a measure of portfolio risk (portfolio standard deviation) into risk contribution from individual assets included in the portfolio. It helps to see how individual assets influence portfolio risk. When risk is measured by standard deviation, we can use Euler’s theorem for decomposing risk into asset-specific risk contribution. Euler’s theorem provides an additive decomposition of a homogeneous function. For reference, see Campolieti and Makarov [6]. Using Euler’s theorem, we can define the percentage contribution to portfolio standard deviation of an asset as , where is the marginal contribution of the asset, is the weight of the asset, and is the portfolio standard deviation.

We define the function for portfolio risk decomposition. It takes five arguments, four arguments as defined in Section 1 and a list of portfolio weights; it returns a bar chart representing the individual asset’s risk contribution to the portfolio standard deviation.

We calculate the risk contribution of each asset in a portfolio that consists of five stocks using the historical monthly returns over the period January 30, 2010, to May 30, 2019 and make a bar chart.

Currently, factor models are widely accepted and used in finance to construct portfolios, to evaluate portfolio performance and for risk analysis. Factor models are regression models. We can use the built-in function to estimate and evaluate the appropriateness of the regression models. In addition, we can download all factor data directly from Prof. Kenneth French’s data library. Before we apply factor models to real-world data, we need data in the form

where

is a matrix of values of independent variables and

is a vector of values of a dependent variable.

Commonly used factor models are summarized in Table 1. We can find more about the factor models in Fama and French [7]. We use the following notation: for excess return on the security or portfolio, MKT for excess return on the value-weighted market portfolio, SML for return on a diversified portfolio of small-capitalization stocks minus the return on a diversified portfolio of large-capitalization stocks, HML for difference in the returns on diversified portfolios of high-book-to-market stocks and low-book-to-market stocks, MOM for difference in returns on diversified portfolios of the prior year’s winners and losers, RMW for difference between the returns on diversified portfolios of stocks with robust and weak profitability, CMA for difference between the returns on diversified portfolios of the stocks of low and high investment firms, for risk-adjusted return on security or portfolio and , , , , and for betas or factor loadings.

Before we estimate these models using real data, we need factors data. We download five factors and the momentum factor data from Kenneth French’s website (http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html) and define variables to store them.

The function takes only two arguments, the start date and end date, and returns time series of all factors data. The start date and end date must be specified as date objects.

Similarly, the function can be used to get a stock’s monthly returns data. It takes three arguments, a ticker symbol of any publicly traded company, and the start and end dates for the analysis period.

Using and , we define to combine factors and returns data. The takes four arguments (the symbol of the stock for which we want to estimate the factor model, the start date, the end date and an integer that represents the number of factors) and returns a data matrix suitable for .

Next, we estimate different factor models for Apple’s stock using monthly data from October 1, 2008, to March 30, 2019. We estimate the capital asset pricing model (CAPM) with market factor (MKT).

We estimate the Fama–French three-factor model with market, size and value factors (MKT, SMB, HML).

Similarly, we estimate the Carhart four-factor model with market, size, value and momentum factors (MKT, SMB, HML, MOM).

Finally, the Fama–French five-factor model with market, size, value, profitability and investment factors (MKT, SMB, HML, RMW, CMA) is estimated as follows.

Once the model is estimated, we must access different properties related to data and fitted models. To assess how well the model fits the data and how well the model meets the assumptions, there are many built-in functions.To learn more about obtaining diagnostic information, see the properties of .

There are various measures to evaluate the performance and risk of portfolios. Most of these measures are used to evaluate a portfolio of interest against a chosen benchmark, by taking a snapshot of the past or considering the entire historical picture. We will compute some common metrics often employed by investors while analyzing performance measures. All the computations are based on the formulas developed in Bacon [8]. Most common measures are summarized in the function . The function takes four arguments:

• ticker symbols for the test assets and the benchmark asset

• periodic risk-free rate

• time period

• frequency of data

We repeat the definition of the function defined earlier to make this section self-contained.

The next example uses stocks, although these measures are used to evaluate the performance of portfolios, mutual funds and exchange-traded funds. We evaluate the performance of (Walmart Inc. (WMT), Apple Inc. (AAPL) and Microsoft Corporation (MSFT) ) against the S&P 500 index (^SPX) using 0.0016 as a monthly risk-free rate and month as the data frequency over the period January 1, 1995, to March 30, 2019.

computePortfolioPerformanceTable[{"WMT","AAPL","MSFT"},"^SPX",0.0016, {1995,01,01},{2019,03,30},"Month"]//Text

Short-term traders commonly use interactive graphics and technical indicators of stock prices to profit from stocks that may be overbought or oversold. Much is based on market sentiment, but also market timing. When a stock is oversold, the price is low and people want to buy. In comparison, when a stock is overbought, the price is between its normal range and higher, and people do not want to buy, or may want to short sell. Many technical indicators are used to determine a given stock’s peak or bottom price and how to take advantage of that information. Three of the most useful functions for technical analysis of stocks are , and . The documentation provides a comprehensive set of examples on how to use them.

We show one example of how to use the function and one example of how to use .

You can choose the chart type and from over 100 technical indicators, which are divided into eight groups:

• basic

• moving average

• market strength

• momentum

• trend

• volatility

• band

• statistical

displays all the available technical indicators.

Here is the basic format:

Alternatively, use:

The time series data must be of the form:

The historical open, high, low, close and volume data retrieved from can also be used as data input. The function has many options that can be used to enhance graphics. We produce a chart using historical prices of Apple’s stock and volume over the period January 1, 2018, to March 30, 2019.

The top of the chart shows a plot of historical prices and 50-day and 200-day moving averages. The second part shows the historical volume. The last two parts show plots of two indicators, the commodity channel index () and the relative strength index ().

A good introduction to technical indicators can be found in standard references, including at Fidelity Learning Center [9].

The function provides a point-and-click interactive chart, with a similar setup:

Alternatively, use:

For example, we make a chart showing prices, volume and indicators for historical data of Apple’s stock over the period January 1, 2018, to March 30, 2019. The function provides a user-friendly environment where you can drag a slider to view different parts of the chart or you can choose different indicators with point-and-click.

A bond is a long-term debt instrument in which a borrower agrees to make payments of principal and interest, on specific dates, to the holders of the bond. When it comes to analysis and pricing of bonds and computing returns, convexity and duration are important concepts. When a bond is traded between coupon payment dates, its price has two parts: the quoted price and the accrued interest. The quoted price is net of accrued interest, and does not include accrued interest. Accrued interest is the proportion or share of the next coupon payment. The full price is the price of a bond including any interest that has accrued since issue of the most recent coupon payment. Similarly, yield to maturity is the rate of return earned on a bond if it is held to maturity. Duration is a measure of the average length of time for which money is invested in a coupon bond. Convexity estimates the change in the bond price given a change in the yield to maturity, assuming a nonlinear relationship between the two. The built-in functions and can be used to compute various properties including value of the bond, accrued interest, yield, duration, modified duration, convexity, and so on. This section provides a few examples of how to use and how the concepts of bond convexity and duration can be used in bond portfolio management.

We discuss zero-coupon bonds first. The zero-coupon bond does not make coupon payments. The only cash payment is the face value of the bond on the maturity date. The yield to maturity () for a zero-coupon bond with periods to maturity, current price and face value can be obtained by solving .

For example, we compute the yield to maturity of a zero-coupon bond with a $10,000 face value, time to maturity 4 years and current price $9,662 using and .

Similarly, can also be used to compute the yield to maturity of a nonzero coupon payment bond.

For example, we compute the yield to maturity of a $1,000 par value 10-year bond with 5% semiannual coupons issued on June 20, 2013, with maturity date of June 20, 2023, selling for $920 on September 15, 2018.

can also be used to compute the price, duration, modified duration and convexity of a bond. For example, we compute those values for a bond with 8% yield, 8% annual coupons, 10-year maturity and $1,000 face value.

There are different approaches to bond portfolio management. We concentrate here on a liability-driven portfolio strategy, in which the characteristics of the bonds that are held in the portfolio are coordinated with those of the liabilities the investor is obligated to pay. The matching techniques can range from an attempt to exactly match the levels and timing of the required cash payments to more general approaches that focus on other investment characteristics, such as setting the average duration or convexity of the bond portfolio equal to that of the underlying liabilities. One specific example would be to construct the portfolio so that the duration of the bond portfolio is equal to the duration of cash obligation and the total money invested in the bond portfolio today is equal to the present value of the future cash obligations.

To illustrate the concept of bond portfolio management, assume that we have an obligation to pay $1,000,000 in 10 years and there are two bonds available for investment. The first bond matures in 30 years with $100 face value and annual coupon payment of 6%. The second bond matures in 10 years with $100 face value and annual coupon payment of 5%. The yield to maturity is 9% on both bonds. We can decide on how much to invest in each bond so that the overall portfolio is immunized against changes in the interest rate.

We compute the duration of each bond using , which gives that the duration of bond 1 is 11.88 and that of bond 2 is 6.75. Assuming that the proportion of money invested in bonds 1 and 2 is and , the immunized portfolio is found by solving the simultaneous equations:

These two equations can be solved using .

The result shows how much money should be allocated to each bond. More examples can be found in Benninga [10]. A more general approach to bond portfolio management can be solved by using linear programming. It is beyond the scope of this article to introduce linear programming.

The most popular options pricing models are the binomial model and the Black–Scholes–Morton option pricing formulas for European options. In the next two subsections, we discuss these models and their implementation.

Following the notation from Hull [11], define:

In terms of these variables, we can define:

• time per period

• up factor

• down factor

• probability of an up move

• probability of a down move

• stock price at node

• payoff from a European call

• payoff from a European put

In the risk-neutral world, the price of the call and put using the -period binomial options pricing model can be computed as:

The functions and calculate the prices of European call and put options; they output the option price. Each function takes six arguments as defined at the beginning of this subsection.

We use these functions to find the prices of call and put options when the current stock price is $50, the strike price is $45, the annual volatility is 40%, the risk-free rate is 10%, the time to maturity is half a year and the total number of up and down moves is 500.

Similar to the binomial option pricing formula defined in the last section, we follow Hull [11] to explain the Black–Scholes–Morton option pricing formulas. Define the variables:

Furthermore, assume that is the standard normal density function (where the base of the natural logarithms) and let be the standard normal cumulative distribution function, so that denotes the probability that a random variable drawn from a standard normal distribution is less than . Then the call and put values can be computed as

,

,

where

,

.

Table 2 summarizes the price sensitivity measures of call and put options (denoted by Greek symbols) with respect to their major price determinants; here stands for value of the option.

The built-in Mathematica function computes the values and other price sensitivity measures for common types of derivative contracts. The function can compute the value of an option, any of delta, gamma, theta and vega, as well as the implied volatility of the contract. The Mathematica documentation provides many examples of how to use . Here are the first 10 of a list of 101 available contracts.

For example, we compute the price and Greeks of the European-style put option with strike price $50, expiration date 0.3846 years, interest rate 5%, annual volatility 20%, no annual dividend and current price $49.

Similarly, we compute the implied volatility of an American-style call option with the same values of the parameters.

One interesting application of is to get real-world data and compute related measures related to options. We define that computes the theoretical value of options and their Greeks, which takes five arguments:

• ticker symbol ()

• strike price ()

• expiration date ()

• risk-free rate ()

• either or ()

We compute the European-style option parameters for Boeing (BA), assuming that the option expires on December 28, 2020, with exercise price $145 and risk-free rate 0.0187. Make sure that the expiration date is later than the current date, since uses historical data.

We strongly encourage you to explore the built-in or online documentation for the powerful function.

The article provides a brief overview of built-in functions and introduces many functions especially designed for analysis of financial data. In particular, we have focused on functions that are more relevant to introductory computational financial concepts. We emphasize importing company fundamental data and its visualization, analysis of individual stocks and portfolio returns, factor models and the use of built-in functions for bond and financial derivative analysis. The functions we have provided are just a few examples. The Wolfram Language can do much more than what we have shown in this article. Interested readers can start exploring the Wolfram Language via Mathematica’s extensive documentation.

[1] | H. R. Varian, ed., Economic and Financial Modeling with Mathematica, New York: Springer-Verlag, 1993. |

[2] | H. R. Varian, ed., Computational Economics and Finance Modeling and Analysis with Mathematica, New York: Springer-Verlag, 1996. |

[3] | W. Shaw, Modelling Financial Derivatives with Mathematica, Cambridge, UK: Cambridge University Press, 1998. |

[4] | S. Stojanovic, Computational Financial Mathematics using MATHEMATICA Optimal Trading in Stocks and Options, Boston: Birkhäuser, 2003. |

[5] | F. Black, “Capital Market Equilibrium with Restricted Borrowing,” Journal of Business, 4(3), 1972 pp. 444–455. www.jstor.org/stable/2351499. |

[6] | G. Campolieti and R. Makarov, Financial Mathematics: A Comprehensive Treatment, London: Chapman and Hall/CRC Press, 2014. |

[7] | E. F. Fama and K. R. French, “A Five-Factor Asset Pricing Model,” Journal of Financial Economics, 116(1), 2015 pp.1–22. doi:10.1016/j.jfineco.2014.10.010. |

[8] | C. R. Bacon, Practical Risk-Adjusted Performance Measurement, 2nd ed., Hoboken: Wiley, 2013. |

[9] | Fidelity Learning Center. “Technical Indicator Guide.” (Jul 29, 2020) www.fidelity.com/learning-center/trading-investing/technical-analysis/technical-indicator-guide/overview. |

[10] | S. Benninga, Financial Modeling, 4th ed., Massachusetts: The MIT Press, 2014. |

[11] | J.C. Hull, Options, Futures and Other Derivatives, 10th ed., New York: Pearson Education Limited, 2018. |

R. Adhikari, “Foundations of Computational Finance,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–2. |

Ramesh Adhikari is an assistant professor of finance at Humboldt State University. Prior to coming to HSU, he taught undergraduate and graduate students at the Tribhuvan University and worked at the Central Bank of Nepal. He was also a research fellow at the Osaka Sangyo University, Osaka, Japan. He earned a Ph.D. in Financial Economics from the University of New Orleans. He is interested in the areas of computational finance and high-dimensional statistics.

**Ramesh Adhikari**

*School of Business, Humboldt State University
1 Harpst Street
Arcata, CA 95521*

The metric structure on a Riemannian or pseudo-Riemannian manifold is entirely determined by its metric tensor, which has a matrix representation in any given chart. Encoded in this metric is the sectional curvature, which is often of interest to mathematical physicists, differential geometers and geometric group theorists alike. In this article, we provide a function to compute the sectional curvature for a Riemannian manifold given its metric tensor. We also define a function to obtain the Ricci tensor, a closely related object.

A *Riemannian manifold* is a differentiable manifold together with a Riemannian metric tensor that takes any point in the manifold to a positive-definite inner product function on its *tangent space*, which is a vector space representing geodesic directions from that point [1]. We can treat this tensor as a symmetric matrix with entries denoted by representing the relationship between tangent vectors at a point in the manifold, once a system of local coordinates has been chosen [2, 3]. In the case of a parameterized surface, we can use the parameters to compute the full metric tensor.

A classical parametrization of a surface is the standard parameterization of the sphere. We compute the metric tensor of the standard sphere below.

This also works for more complicated surfaces. The following is an example taken from [4].

Denoting the coordinates by , we can then define , where the are functions of the coordinates ; this definition uses Einstein notation, which will also apply wherever applicable in the following. From this surprisingly dense description of distance, we can extract many properties of a given Riemannian manifold, including *sectional curvature*, which will be given an explicit formula later. In particular, two-dimensional manifolds, also called *surfaces*, carry a value that measures at any given point how far they are from being flat. This value can be positive, negative or zero. For intuition, we give examples of each of these types of behavior.

The sphere is the prototypical example of a surface of positive curvature.

Any convex subspace of Euclidean space has zero curvature everywhere.

The monkey saddle is an example of a two-dimensional figure with negative curvature.

Sectional curvature is a locally defined value that gives the curvature of a special type of two-dimensional subspace at a point, where the two dimensions defining the surface are input as tangent vectors. Manifolds may have points that admit sections of both negative and positive curvature simultaneously, as is the case for the Schwarzchild metric discussed in the section “Applications in Physics.” An important property of sectional curvature is that on a Riemannian manifold it varies smoothly with respect to both the point in the manifold being considered and the choice of tangent vectors.

Sectional curvature is given by

where .

In this formula, represents the purely covariant Riemannian curvature tensor, a function on tangent vectors that is completely determined by the . Both and the are treated more thoroughly in the following section, as well as in [1]. Some immediate properties of the curvature formula are that is symmetric in its two entries, is undefined if the vectors and are linearly dependent, and does not change when either vector is scaled. Moreover, any two tangent vectors that define the same subspace of the tangent space give the same value. This is important because curvature should only depend on the embedded surface itself and not how it was determined.

While we are primarily concerned with Riemannian manifolds, it is worth noting that all calculations are valid for pseudo-Riemannian manifolds, in which the assumption that the metric tensor is positive-definite is dropped. This generalization is especially important in areas such as general relativity, where the metric tensors that represent spacetime have a different signature than that of traditional Riemannian manifolds. We explore this connection more in the section “Applications in Physics.”

For a differentiable manifold, an *atlas* is a collection of homeomorphisms, called *charts*, from open sets in Euclidean space to the manifold, such that overlapping charts can be made compatible by a differentiable transition map between them. Via these homeomorphisms, we can define coordinates in an open set around any point by adopting the coordinates in the corresponding Euclidean neighborhood. By convention, these coordinates are labelled , and unless important, we omit the point giving rise to the coordinates. In some cases of interest, it is possible to adopt a coordinate system that is valid over the whole manifold.

From such a coordinate system, whether local or global, we can define a basis for the tangent space using a *coordinate frame* [5]. This will be the basis consisting of the partial derivative operators in each of the coordinate directions, that is, . Considering the tangent space as a vector space, this set is sometimes referred to in mathematical physics as a *holonomic basis* for the manifold. We use this expression then to define the symmetric matrix by the following expression for :

From here, we define one more tensor of interest for the purposes of calculating curvature. Using Einstein notation, the Riemannian curvature tensor is

The various are the *Christoffel symbols*, for which code is presented in the next section. In light of these definitions, we recall sectional curvature once again from the introduction as the following, now considering the special case of the tangent vectors being chosen in coordinate directions:

.

The norm in the denominator is the norm of the tangent vector associated to that partial derivative in the holonomic basis, which is induced by the associated inner product from .

We now create functions to compute these tensors and sectional curvature itself. These values depend on a set of coordinates and a Riemannian metric tensor, so that will be the information that serves as the input for these functions. Coordinates should be a list of coordinate names like , and should be a square symmetric matrix whose size matches the length of the coordinate list. Some not inconsiderable inspiration for the first half of this code was taken from Professor Leonard Parker’s Mathematica notebook “Curvature and the Einstein Equation,” which is available online as a supplement to [6].

We can now define a function for the Christoffel symbols from the previous section. This calculation consists of taking partial derivatives of the metric tensor components and one tensor operation. In Mathematica, the dot product, typically used for vectors and matrices, is also able to take tensors and contract indices.

We can now use the formulas stated in the second section to define both the covariant and contravariant forms of the Riemannian curvature tensor.

We perform one more tensor operation using the dot product to transform our partially contravariant tensor into one that is purely covariant. Both of these will be called at various points later.

The full function to return the sectional curvatures consists of computing a scaled version of the covariant Riemannian metric tensor.

The output consists of a symmetric matrix with zero diagonal entries representing curvatures in the coordinate directions. These diagonal values should not be taken literally, as curvature is undefined given two linearly dependent directions. While this of course does not give all possible sectional curvatures, one may perform a linear transformation on the basis in order to obtain a new metric tensor with arbitrary (linearly independent) vectors as basis elements. From here, the new tensor may be used for computation.

Here is an example with diagonal entries that are functions of the last coordinate.

Any good computation in mathematics must stand to scrutiny by known cases, so we evaluate our function with the input of hyperbolic 3-space. The two in the exponent should be imagined as the squaring of the exponential function.

Checking with [7] verifies that this is indeed a global metric tensor for hyperbolic 3-space. As such, we know that it has constant sectional curvature of (recall the diagonal entries do not represent any curvature information).

Continuing with the hyperbolic space metric tensor, it is a well-known result in hyperbolic geometry that one is able to scale these first two dimensions to vary the curvature and produce a *pinched curvature* manifold.

If we allow for new constant coefficients in the exponents for positive real numbers and , then we should see explicit bounds on the curvatures.

In this vein, the Riemannian structure for *complex hyperbolic space* is similar to the real case, except for a modification to allow for complex variables.

In this setting, a formula for the metric tensor valid over the entire manifold is available from [8], among other places.

One can verify that, although not constant, the entries in the upper-left block are always bounded between and for positive . This result agrees with sectional curvature in complex hyperbolic space, and so serves as an example of sectional curvature computation where the underlying tensor is not diagonal. A careful review of [8] reminds us that this metric is only well-defined up to rescaling, which can change the values of the sectional curvature. What does not change, however, is the ratio of the largest and smallest curvatures, which are always exactly 4. The introduction in [9] takes considerable care to remind us that definitions change between curvatures in , and even .

Perhaps the most interesting applications of differentiable manifolds and curvature to physics lie in the area of relativity. This discipline uses the idea of a *Lorentzian manifold*, which is defined as a manifold equipped with a Lorentzian metric that has signature instead of the signature for four-dimensional Riemannian manifolds. As noted in the introduction, however, this has no impact on the computations of sectional curvature. Examples of such Lorentzian metrics include the *Minkowski flat spacetime metric*; is the familiar constant speed of light.

Justifying the name of *flat* spacetime, our curvature calculation guarantees all sectional curvatures are identically zero.

More generic Lorentzian manifolds may have nonzero curvature. To this end, we examine the *Schwarzschild metric*, which describes spacetime outside a spherical mass such that the gravitational field outside the mass satisfies Einstein’s field equations. This most commonly is viewed in the context of a black hole and how spacetime behaves nearby. More details on the following tensor can be found in [10].

In the following, , and are standard spherical coordinates for three-dimensional space and represents time. With this setup, we can calculate the sectional curvature of spacetime for areas outside such a spherical mass.

This result indicates that the sectional curvature is directly proportional to the mass and inversely proportional to the distance from the object. In particular, there is a singularity at , indicating that curvature “blows up” near the center of the mass. Indeed, these results are in line with Flamm’s paraboloid, the graphical representation of a constant-time equatorial slice of the Schwarzchild metric, whose details can be found in [11].

In fact, the calculations we have done already allow us to compute one further object of interest for a Riemannian or pseudo-Riemannian manifold: the Ricci curvature. The Ricci curvature is a tensor that contracts the curvature tensor and is computable when one has the contravariant Riemannian curvature tensor. Below we use a built-in function for tensors to contract the first and third indices of the contravariant Riemannian curvature tensor to obtain a matrix containing condensed curvature information (see [12] for more information).

The values 1 and 3 above refer to the dimensions we are contracting. In general, the corresponding indices must vary over sets of the same size; here all dimensions have indices that vary over a set whose size is the number of coordinates. We compute the Ricci curvature for some of the previous examples.

The fact that the Ricci curvature vanishes for the above solution to the Einstein field equation is a consequence of its types of symmetries. In general, the Ricci curvature for other solutions is nonzero. Notice for the example (and the , trivially), all information from the Ricci tensor is contained in the diagonal elements. This is always the case for a diagonal metric tensor [12]. As such, we may sometimes be interested only in these values, so we take the diagonal in such a case.

The supervising author would like to thank Dr. Nicolas Robles for suggesting the submission of this article to *The Mathematica Journal*. We would also like to thank Leonard Parker, who authored the notebook file available at [6], which greatly illuminated some of the calculations. We are also very grateful to the referee and especially the editor, whose contributions have made this article much more accurate, legible and efficient.

[1] | M. do Carmo, Differential Geometry of Curves & Surfaces, Mineola, NY: Dover Publications, Inc., 2018. |

[2] | J. M. Lee, Introduction to Smooth Manifolds, Graduate Texts in Mathematics, 218, New York: Springer, 2003. |

[3] | C. Stover and E. W. Weisstein, “Metric Tensor” from MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/MetricTensor.html. |

[4] | “ParametricPlot3D,” ParametricPlot3D from Wolfram Language & System Documentation Center—A Wolfram Web Resource. reference.wolfram.com/language/ref/ParametricPlot3D.html. |

[5] | F. Catoni, D. Boccaletti, R. Cannata, V. Catoni, E. Nichelatti and P. Zampetti, The Mathematics of Minkowski Space-Time, Frontiers in Mathematics, Basel: Birkhäuser Verlag, 2008. |

[6] | J. B. Hartle, Gravity: An Introduction to Einstein’s General Relativity, San Francisco: Addison-Wesley, 2003. web.physics.ucsb.edu/~gravitybook/math/curvature.pdf. |

[7] | J. G. Ratcliffe, Foundations of Hyperbolic Manifolds, 2nd ed., Graduate Texts in Mathematics, 149, New York: Springer, 2006. |

[8] | J. Parker, “Notes on Complex Hyperbolic Geometry” (Jan 10, 2020). maths.dur.ac.uk/~dma0jrp/img/NCHG.pdf. |

[9] | W. M. Goldman, Complex Hyperbolic Geometry, Oxford Mathematical Monographs, Oxford Science Publications, New York: Oxford University Press, 1999. |

[10] | R. Adler, M. Bazin and M. Schiffer, Introduction to General Relativity, New York: McGraw-Hill, 1965. |

[11] | R. T. Eufrasio, N. A. Mecholsky and L. Resca, “Curved Space, Curved Time, and Curved Space-Time in Schwarzschild Geodetic Geometry,” General Relativity and Gravitation, 50(159), 2018. doi:10.1007/s10714-018-2481-2. |

[12] | L. A. Sidorov, “Ricci Tensor,” Encyclopedia of Mathematics (M. Hazewinkel, ed.), Netherlands: Springer, 1990. www.encyclopediaofmath.org/index.php/Ricci_tensor. |

E. Fairchild, F. Owen and B. Burns Healy, “Sectional Curvature in Riemannian Manifolds,” The Mathematica Journal, 2020. https://doi.org/10.3888/tmj.22–1. |

Elliott Fairchild is a high-school student at Cedarburg High School. He particularly enjoys problems in analysis, and is always looking for more research opportunities.

Francis Owen is an undergraduate student at the University of Wisconsin-Milwaukee. His major is Applied Mathematics and Computer Science, and he is eager to find new programming opportunities.

Brendan Burns Healy is a Visiting Assistant Professor at the University of Wisconsin-Milwaukee. Though a geometric group theorist and low-dimensional topologist by training, he also enjoys problems of computation and coding.

**Elliott Fairchild**

*Department of Mathematical Sciences
University of Wisconsin-Milwaukee
3200 N. Cramer St.
Milwaukee, WI 53211
*

**Francis Owen**

*Department of Mathematical Sciences
University of Wisconsin-Milwaukee
3200 N. Cramer St.
Milwaukee, WI 53211*

**Brendan Burns Healy, PhD**

Department of Mathematical Sciences

University of Wisconsin-Milwaukee

3200 N. Cramer St.

Milwaukee, WI 53211

*www.burnshealy.com*

We study the distribution of eigenspectra for operators of the form with self-adjoint boundary conditions on both bounded and unbounded interval domains. With integrable potentials , we explore computational methods for calculating spectral density functions involving cases of discrete and continuous spectra where discrete eigenvalue distributions approach a continuous limit as the domain becomes unbounded. We develop methods from classic texts in ODE analysis and spectral theory in a concrete, visually oriented way as a supplement to introductory literature on spectral analysis. As a main result of this study, we develop a routine for computing eigenvalues as an alternative to , resulting in fast approximations to implement in our demonstrations of spectral distribution.

We follow methods of the texts by Coddington and Levinson [1] and by Titchmarsh [2] (both publicly available online via archive.org) in our study of the operator and the associated problem

(1) |

where on the interval with real parameter and boundary condition

(2) |

for fixed , where . For continuous (the set of absolutely integrable functions on ), we study the spectral function associated with (1) and (2) using two main methods: First, following [1], we approximate by step functions associated with related eigenvalue problems on finite intervals for some sufficiently large positive ; then, we apply asymptotic solution estimates along with an explicit formula for spectral density [2]. For some motivation and clarification of terms, we recall a major application: For certain solutions of (1) and (2) and for any (the set of square-integrable functions on ), a corresponding solution to (1) may take the form

where

(in a sense described in Theorem 3.1 of Chapter 9 [1]); here, is said to be a spectral transform of . By way of such spectral transforms, the differential operator may be represented alternatively in the integral form

where induces a measure by which (roughly, the set of square-integrable functions when integrated against ) and by which Parseval’s equality holds. Typical examples are the complete set of orthogonal eigenfunctions for and the corresponding Fourier sine transform in the limiting case (cf. Chapter 9, Section 1 [1]).

For a fixed, large finite interval , we consider the problem (1), (2) along with the boundary condition

(3) |

(), which together admit an eigensystem with correspondence

where the eigenvalues satisfying and where the eigenfunctions form a complete basis for . Since the associated spectral function is a step function with jumps at the various , we first estimate these by way of a related equation arising from Prüfer (phase-space) variables and compute the corresponding jumps .

Then, we use interpolation to approximate the continuous spectral function using data from a case of large at points and using

(4) |

imposing the condition for all .

We compare our results with those of a well-known formula [2] appropriate to our case on , which we outline as follows: For fixed , let be the solution to (1) with boundary values

for which the asymptotic formula

(5) |

holds as . Then we have

(6) |

from Section 3.5 [2].

Finally, in the last section, we apply the above techniques to extend our study to operators on large domains and on , where spectral matrices take the place of spectral functions as a matrix analog of spectral transforms on these types of intervals (cf. equation (5.5) [1]). The techniques are described in detail below, but it is of particular interest that our computations uncover an interesting pattern in a discrete-spectrum case, as we are forced to reformulate our approach according to certain eigen-subspaces involved: our desired spectral approximations are resolved by way of an averaging procedure in forming Riemann sums.

Various sections of Chapters 7–9 [1] (see also [3] and related articles) present useful introductory discussion applied to material presented in this article; yet, with our focus on equations (1)–(6), one may proceed given basic understanding of Riemann–Stieltjes integration along with knowledge of ordinary differential equations and linear algebra, commensurate with (say) the use of and .

We compute eigenvalues by first computing solutions on to the following, arising from Prüfer variables (equation 2.4, Chapter 8 [1]):

(7) |

Here, , where is a nontrivial solution to (1), (2) and (3) and satisfies

(8) |

for positive integers . We interpolate to approximate such solutions as an efficient means to invert (8) in the variable . And we use the following function on (7) throughout this article.

Consider an example with , , and potential for parameter with , , in the case , .

We create an interpolation approximation for eigenvalues .

It is instructive to graphically demonstrate the theory behind this method. Here, we consider the eigenvalues as those values of where the graph of intersects the various lines as we use to find (or ), our maximum index , depending on .

We choose these boundary conditions so that we may compare our results with those of applied to the corresponding problem (1) and (2) using .

We now compare and contrast the methods in this case. The percent differences of the corresponding eigenvalues are all less than 0.2%, even within our limits of accuracy.

In contrast, our interpolation method allows some direct control of which eigenvalues are to be computed, whereas (in the default setting) outputs a list up to 39 values, starting from the first. Moreover, our method admits nonhomogeneous boundary conditions, where admits only homogeneous conditions, Dirichlet or Neumann.

We proceed to build our approximate spectral density function for the problem (1) and (2) on with the same potential as above. We compute eigenvalues likewise but now on a larger interval for and with nonhomogeneous boundary conditions, say given by , (albeit does not depend on ).

We compute eigenvalues via our interpolation method and compute a minimum (or ) as well as a maximum index so as to admit only positive eigenvalues; is supported on and negative eigenvalues result in dubious approximations by .

We now compute the values .

We now apply the method of [2] as outlined in equation (6). We use to include data from an interval near the endpoint that includes at least one half-period of the period of the fitting functions and .

The function may return non-numerical results among the first few, in which case we recommend that either or be readjusted or that be set large enough to disregard such results.

We now compare our results of the discrete and continuous (asymptotic fit) spectral density approximations.

We compare the results by plotting percent differences, all being less than 0.1%.

We chose as above because, in part, the solutions can be computed in terms of well-known (modified Bessel) functions. Replacing by , for , the solutions are linear combinations of

(9) |

From asymptotic estimates (cf. equation 9.6.7 [4]), we see that the former is dominant and the latter is recessive as when . Then, from Chapter 9 [1], equation 2.13 and Theorem 3.1, we obtain the density function by computing

(10) |

where is a solution as above and is a solution with boundary values , . (Here, is commonly known as the Titchmarsh–Weyl -function.) In the following code, we produce the density function in exact form by replacing functions from (9), the dominant by 1 and the recessive by 0, to compute the inside limit and thereafter simply allowing to be real.

We likewise compare the exact formula for the continuous spectrum with the discrete results, noting that the exact graph appears to essentially be the same as that obtained by our asymptotic fitting method (not generally expecting the fits to be accurate for small !).

For the operator we now extend our study to large domains in the discrete-spectrum case and to the domain in the continuous-limit case. We choose an odd function potential of the form for positive constants , . We focus on the spectral density associated with specific boundary values at and an associated pair of solutions to (1): namely, we consider expansions in the pair and such that

(11) |

We apply the above computational methods to the analytical constructs from Chapter 5 [1] in both the discrete and continuous cases. First, for the discrete case, we compute spectral matrices associated with self-adjoint boundary-value problems and the pair as in (11): We estimate eigenvalues for an alternative two-point boundary-value problem on for (moderately) large to compute the familiar jumps of the various components . These components induce measures that appear in the following form of Parseval’s equality for square-integrable functions on (taken in a certain limiting sense):

(real-valued case). Second, we compute the various densities as limits as by the formulas

(12) |

where and are certain limits of -functions, related to equation (10), but for our ODE problem on domains and , respectively. The densities are computed by procedures more elaborate than (6), as discussed later. Then, we compare results of the discrete case like in (4), approximating

(13) |

After choosing (self-adjoint) boundary conditions (of which the limits happen to be independent)

(14) |

on an interval , we estimate eigenvalues and compute coefficients , from the linear combinations

for the associated orthonormal (complete) set of eigenfunctions ; , whereby

(real-valued case). Here, the functions result by normalizing eigenfunctions satisfying (14) so that we obtain

We are ready to demonstrate. Let us choose , and , (arbitrary). Much of the procedure follows as above, with minor modification, as we include to obtain the values and (the next result may take around three minutes on a laptop).

We now approximate the density functions by plotting where

(15) |

(for certain ) as we compute the difference quotients at the various jumps, over even and odd indices separately, and assign the corresponding sums to the midpoints of corresponding intervals .

We give the plots below, in comparison with those of the continuous spectra, and give a heuristic argument in the Appendix as to why this approach works.

First, we apply the asymptotic fitting method using the solutions and . Here, we have to compute full complex-valued formulas for the corresponding -functions (cf. Section 5.7 [2]) where a slight modification of the derivation of , via a change of variables and a complex conjugation, results in (See Appendix).

We now compare the result of the discrete and asymptotic fitting methods for the elements .

We have deferred some discussion on our use of , comparison of eigenvalue computations, discrete eigenspace decomposition and Weyl -functions to this section.

First, we have used to suppress messages warning that some solutions may not be found. From Chapter 8 [1], we expect unique solutions since the functions are strictly increasing. We have also used to suppress various messages from and other related functions regarding small values of to be expected with short-range potentials and large domains.

Second, our formulation of and the midpoints as in (15) arises from a decomposition of the eigenspace by even and odd indices. We motivate this decomposition by an example plot of the values , where the dichotomous behavior is quite pronounced, certainly for large .

We are thus inspired to compute the quotients over even and odd indices separately. Then, we consider, say, a relevant expression from Parseval’s equality: for appropriate Fourier coefficients , , associated with respective solutions , we write

We suppose that and for the corresponding transforms in the limit . Of course, a rigorous argument is beyond the scope of this article.

Finally, we elaborate on the calculations of the -functions and : Given the asymptotic expressions

as (resp.), we follow Section 5.7 of [2], making changes as needed, with a modification via complex conjugation (, say) for to arrive at

The author would like to thank the members of MAST for helpful and motivating discussions concerning preliminary results of this work in particular and Mathematica computing in general.

[1] | E. A. Coddington and N. Levinson, Theory of Ordinary Differential Equations, New York: McGraw-Hill, 1955. archive.org/details/theoryofordinary00codd. |

[2] | E. C. Titchmarsh, Eigenfunction Expansions Associated with Second-Order Differential Equations, 2nd ed., London: Oxford University Press, 1962. archive.org/details/eigenfunctionexp0000titc. |

[3] | E. W. Weisstein. “Operator Spectrum” from MathWorld–A Wolfram Web Resource. mathworld.wolfram.com/OperatorSpectrum.html. |

[4] | M. Abramowitz and I. A. Stegun, eds., Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables, New York: Wiley, 1972. |

C. Winfield, “From Discrete to Continuous Spectra,” The Mathematica Journal, 2019. https://doi.org/10.3888/tmj.21–3. |

C. Winfield holds an MS in physics and a PhD in mathematics and is a member of the Madison Area Science and Technology amateur science organization, based in Madison, WI.

**Christopher J. Winfield**

Madison Area Science and Technology

3783 US Hwy. 45

Conover, WI 54519

*cjwinfield2005@yahoo.com*

H. S. M. Coxeter wrote several geometry film scripts that were produced between 1965 and 1971 [1]. In 1992, Coxeter gave George Beck mimeographs of two scripts that had not been made. Beck wrote Mathematica code for the stills and animations. This material was added to the third edition of Coxeter’s *The Real Projective Plane* [2]. This article updates the Mathematica code.

The example of a thermometer makes it easy to see how the real numbers (positive, zero and negative) can be represented by the points of a straight line.

On the axis of ordinary analytic geometry, the number is represented by the point .

Given any two such numbers, and , we can set up geometrical constructions for their sum, difference, product, and quotient.

However, these constructions require a scaffolding of extra points and lines. It is by no means obvious that a different choice of scaffolding would yield the same final results.

The object of the present program is to make use of a circle (or any other conic) instead of the line, so that the constructions can all be performed with a straight edge, and the only arbitrariness is in the choice of the positions of three of the numbers (for instance, 0, 1 and 2).

Although this is strictly a chapter in projective geometry, let us begin with a prologue in which the scale of abscissas on the axis is transferred to a circle by the familiar process of stereographic projection.

A circle of any radius (say 1, for convenience) rests on the axis at the origin 0, and the numbers are transferred from this axis to the circle by lines drawn through the opposite point.

That is, the point at the top. In this manner, a definite number is assigned to every point on the circle except the topmost point itself.

The numbers come closer and closer to this point on one side, and the numbers come closer and closer on the other side.

So it is natural to assign the special symbol (infinity) to this exceptional point: the only point for which no proper number is available.

The tangent at this exceptional point is, of course, parallel to the axis; that is, parallel to the tangent at the point 0.

Having transferred all the numbers to the circle, we can forget about the axis; but the tangent at the point infinity will play an important role in the construction of sums.

For instance, there is one point on this tangent that lies on the line joining points 1 and 2, also on the line joining 0 and 3, and on the line joining −1 and 4. We notice that these pairs of numbers all have the same sum: .

Similarly, the tangent at 1 meets the tangent at infinity in a point that lies on the lines joining 0 and 2, −1 and 3, −2 and 4, in accordance with the equations .

These results could all be verified by elementary analytic geometry, but there is no need to do this, because we shall see later that a general principle is involved.

Having finished the Euclidean prologue, let us see how far we can go with the methods of projective geometry. Let symbols 0, 1, infinity be assigned to any three distinct points on a given conic.

There is a certain line through 0 concurrent with the tangents at infinity and 1; let this line meet the conic again in 2.

(Alternatively, if we had been given 0, 1, 2 instead of 0, 1, infinity, we could have reconstructed infinity as the point of contact of the remaining tangent from the point where the tangent at 1 meets the line 02.)

We now have the beginning of a geometrical interpretation of all the real numbers.

To obtain 3, we join 1 and 2, see where this line meets the tangent at infinity, join this point of intersection to 0, and assign the symbol 3 to the point where this line meets the conic again. Thus the line joining 0 and 3 and the line joining 1 and 2 both meet the tangent at infinity in the same point.

More generally, we define addition in such a way that two pairs of points have the same sum if their joins are concurrent with the tangent at the point infinity.

In other words, we define the sum of any two points and to be the remaining point of intersection of the conic with the line joining 0 to the point where the tangent at infinity meets the join of and .

To justify this definition, we must make sure that it agrees with our usual requirements for the addition of numbers: the commutative law

a unique solution for every equation of the form

and the associative law

The commutative law is satisfied immediately, as our definition for involves and symmetrically.

The equation is solved by choosing so that and have the same sum as and .

Thus the only possible cause of trouble is the associative law; we must make sure that for any three points , , (not necessarily distinct), the sum of and is the same as the sum of and .

For this purpose, we make use of a special case of Pascal’s theorem, which says that if is a hexagon inscribed in a conic, the pairs of opposite sides (namely and , and , and ) meet in three points that lie on a line, called the Pascal line of the given hexagon.

In 1639, when Blaise Pascal was sixteen years old, he discovered this theorem as a property of a circle.

He then deduced the general result by joining the circle to a point outside the plane by a cone and then considering the section of this cone by an arbitrary plane.

We do not know how he proved this property of a hexagon inscribed in a circle, because his original treatise was lost, but we do know how he might have done it, using only the first three books of Euclid’s *Elements*. In our own time, an easier proof can be found in any textbook on projective geometry.

Each hexagon has its own Pascal line. If we fix five of the six vertices and let the sixth vertex run round the conic, we see the Pascal line rotating about a fixed point.

If this fixed point is outside the conic, we can stop the motion at a stage when the Pascal line is a tangent. This is the special case that concerns us in the geometrical theory of addition.

The hexagon shows that the sum of and is equal to the sum of and .

Beginning with 0, 1 and infinity, we can now construct the remaining positive integers

and so on.

We can also construct the negative integers , given by

and so on.

Alternatively, we can construct the negative integers using

and so on.

By fixing while letting vary, we obtain a vivid picture of the transformation that adds to every number . The points and chase each other round the conic, irrespective of whether happens to be positive or negative.

In our construction for the point 2, we tacitly assumed that the tangent at 1 can be regarded as the join of 1 and 1.

More generally, the join of and meets the tangent at infinity in a point from which the remaining tangent has, for its point of contact, a point such that , namely, , which is the arithmetic mean (or average) of and .

This result holds not only when is even but also when is odd; for instance, when and are consecutive integers. In this way we can interpolate 1/2 between 0 and 1, 1 1/2 between 1 and 2 and so on.

We shall find it convenient to work in the scale of 2 (or binary scale), so that the number 2 itself is written as 10, one half as 0.1, one quarter as 0.01, three quarters as 0.11 and so on.

We can now interpolate

1.1 between 1 and 10, …

1.01 between 1 and 1.1, …

… and so on to the eighths between 1 and 10.

In fact, we can construct a point for every number that can be expressed as a terminating “decimal” in the binary scale. By a limiting process, we can thus theoretically assign a position to every real number.

For instance, the square root of two, being (in the binary scale)

is the limit of a certain sequence of constructible numbers:

Conversely, by a process of repeated bisection, we can assign a binary “decimal” to any given point on the conic. (The “but one” is, of course, the point to which we arbitrarily assigned the symbol infinity.)

We can now define multiplication in terms of the same three points 0, 1 and infinity.

Two pairs of points have the same product if their joins are concurrent with the line joining 0 and infinity.

The geometrical theory of projectivities is somewhat too complicated to describe here, so let us be content to remark that, if we pursued it, we could prove that our definition for addition is consistent with this definition for multiplication.

The product is positive if the point of concurrence is outside, negative if it is inside the conic.

In other words, we define the product of any two points and on the conic to be the remaining point of intersection of the conic with the line joining 1 to the point where the line joining 0 and infinity meets the line joining and .

Of course, the question arises as to whether this definition agrees with our usual requirements for the multiplication of numbers:

- the commutative law
- a unique solution for every equation of the form (with )
- the associative law

The commutative law is satisfied immediately, as our definition for involves and symmetrically.

The equation is solved by choosing so that and have the same product as 1 and .

Finally, another application of Pascal’s theorem suffices to show the associative law.

That is, for any three points , , , the product of and is equal to the product of and . In fact, the appropriate hexagon is .

By fixing while letting vary, we obtain a vivid picture of the transformation that multiplies every number by . If is positive, the points and chase each other round the conic.

But if is negative, they go round in opposite directions.

The familiar identity is illustrated by the concurrence of the tangent at 2 with the line joining 1 and 4 and the line joining 0 and infinity.

More generally, if and are any two numbers having the same sign, the join of the corresponding points meets the line joining 0 to infinity in a point from which the two tangents have, for their points of contact, points such that , namely , where the square root of is the geometric mean of and .

Setting and , we obtain a construction for the square root of two without having recourse to any limiting process. In fact, we have finite constructions for all the “quadratic” numbers commonly associated with Euclid’s straight-edge and compass.

One of the most fruitful ideas of the nineteenth century is that of one-to-one correspondence. It is well illustrated by the example of cups and saucers. Suppose we have about a hundred cups and about a hundred saucers and wish to know whether the number of cups is actually equal to the number of saucers. This can be determined, without counting, by the simple device of putting each cup on a saucer, that is, by establishing a one-to-one correspondence between the cups and saucers.

In our first application of this idea to plane geometry, the cups are points, the saucers are lines and the relation “cup on saucer” is incidence. As we know, a line is determined by any two of its points and is of unlimited extent. We say that a point and a line are “incident” if the point lies on the line; that is, if the line passes through the point. It is natural to ask whether the number of points on a line is actually equal to the number of lines through a point. In ordinary geometry both numbers are infinite, but this fact need not trouble us: if we can establish a one-to-one correspondence between the points and lines, there are equally many of each.

The set of all points on a line is called a range and the set of all lines through a point is called a pencil. If the line and the point are not incident, we can establish an elementary correspondence between the range and the pencil by means of the relation of incidence. Each point of the range lies on a corresponding line of the pencil. The range is a section of the pencil (namely the section by the line ) and the pencil projects the range (from the point ).

In our picture, the range is represented by a red point moving along a fixed line (which, for convenience, is taken to be horizontal) and the pencil is represented by a green line rotating around a fixed point .

There is evidently a green line for each position of the red point. But we must admit that for some positions of the green line the red point cannot be seen because it is too far away; in fact, when the green line is parallel to (that is, horizontal), the red point is one of the ideal “points at infinity” that we agree to add to the ordinary plane so as to make the projective plane. Without this ideal point, our elementary correspondence would not be one-to-one: the number of points in the range would be one less than the number of lines in the pencil. In other words, the postulation of ideal points makes it possible for us to express the axioms for the projective plane in such a way that they remain valid when we consistently interchange the words “point” and “line” (and consequently also certain other pairs of words such as “join” and “meet”, “on” and “through”, “collinear” and “concurrent” and so forth). It follows that the same kind of interchange can be made in all the theorems that can be deduced from the axioms.

This principle of duality is characteristic of projective geometry. In the plane we interchange points and lines. In space, the same principle enables us to interchange points and planes, while lines remain lines.

When we regard the elementary correspondence as taking us from the point to the line , we write the capital before the small , as . The inverse correspondence, from to , is denoted by the same sign with the small before the capital , as . If , , , … are particular positions of , and , , , … of , we write all these letters before and after the sign, taking care to keep them in their corresponding order (which need not be the order in which they appear to occur in the figure), .

This notation enables us to exhibit the principle of duality as the possibility of consistently interchanging capital and small letters.

By combining two elementary correspondences, one relating a range to a pencil and the other a pencil to a range, we obtain a perspectivity. This either relates two ranges that are different sections of one pencil, or two pencils that project one range from different centers.

In the former case, two of the symbols with one bar , or can be abbreviated to one with two bars, or, if we wish to specify the point that carries the pencil, we put above the two bars, as .

In the latter case (when two pencils project one range from different centers), the two symbols with one bar are again abbreviated to one with two bars, and if we wish to specify the line that carries the range, we put above the bars.

We can easily go on to combine three or more elementary correspondences. But then we prefer not to increase the complication of the symbols. Instead, we retain the simple symbol (with just one bar) for the product of any number of elementary correspondences. Such a transformation is called a projectivity. Thus elementary correspondences and perspectivities are the two simplest instances of a projectivity.

The product of three elementary correspondences is the simplest instance of a correspondence relating a range to a pencil in such a way that the range is not merely a section of the pencil.

The product of four elementary correspondences, being the product of two perspectivities, shares with a simple perspectivity the property of relating a range to a range or a pencil to a pencil. Now there is the interesting possibility that the initial and final range (or pencil) may be on the same line (or through the same point). We see two moving red points and , on , related by perspectivities from and to an auxiliary red point on . When reaches , on , we have another invariant point; the three red points all come together.

Such a projectivity, having two distinct invariant points, is said to be hyperbolic.

On the other hand, the three lines , and may all meet in a single point , so that coincides with and there is only one invariant point. Such a projectivity is said to be parabolic.

A third possibility is an elliptic projectivity that has no invariant point, but this is more complicated, requiring three perspectivities (i.e., six elementary correspondences). The centers of the three perspectivities are , and . The green lines, rotating around these points, yield four red points. Two of the red points and chase each other along the bottom line .

These two points are related by the elliptic projectivity.

However, this is not the most general elliptic projectivity.

There is a special feature arising from the fact that the points , , lie on the sides of the green triangle. When one of the two red points is at , the other is at , and vice versa: the projectivity interchanges and and is consequently called an involution. Thus we are watching an elliptic involution.

Looking closely, we see that it not only interchanges and but also interchanges every pair of related points. For instance, it interchanges with (on ). An important theorem tells us that for any four collinear points , , , , there is just one involution that interchanges with and with .

We denote it by . At any instant, the two red points are a pair belonging to this involution. Call them and . We now have three pairs of points, , , , on the bottom dark blue line, all belonging to one involution. The other lines form the six sides of a complete quadrangle , which consists of four points (no three collinear) and the six lines that join them in pairs. Two sides are said to be opposite if their point of intersection is not a vertex; for instance, and are a pair of opposite sides.

We see now that the six points named on the bottom dark blue line are sections of the six sides of the quadrangle, and that each related pair comes from a pair of opposite sides. Accordingly the six points, paired in this particular way, are said to form a quadrangular set. Here is another version of the quadrangle and the corresponding quadrangular set , , . As before, is a pair of the involution .

This remains true when we move the bottom dark blue line to a new position so that coincides with and with . Now and are invariant points, and we have a hyperbolic involution , which still interchanges and .

The quadrangular set of six points has become a harmonic set of four points. We say that and are harmonic conjugates of each other with respect to and , and that the four points satisfy the relation .

This means that there is a quadrangle having two opposite sides through and two opposite sides through , while one of the remaining two sides passes through and the other through .

Given , and , we can construct by drawing a triangle whose sides pass through these three points.

Let meet in ; then meets in . Of course, the hyperbolic involution can still be constructed as the product of three perspectivities (with centers , , ).

But the invariant points and enable us to replace these three perspectivities by two, with centers (where meets ) and .

Another product of two perspectivities relates ranges on two distinct lines. The fundamental theorem of projective geometry tells us that a projectivity relating ranges on two such lines is uniquely determined by any three points of the first range and the corresponding three points of the second. There are, of course, many ways to construct the projectivity as the product of two or more perspectivities, but the final result will always be the same.

For instance, there is a unique projectivity relating on the first line to on the second. This means that for any point on there is a definite point on .

The simplest way to construct this projectivity is by means of perspectivities from and , so that is first related to on and then to on . We can regard as a variable triangle whose vertices run along fixed lines , , while the two sides and rotate around fixed points and . The third side joins the projectively related points and .

This construction remains valid when and are of general position, instead of lying on the lines that carry the related ranges. Let meet in and in . Now we have a construction for the unique projectivity that relates to .

As before, the vertices of the variable triangle run along fixed lines , , while the two sides and rotate around the fixed points and . The possible positions for the third side include, in turn, each of the five sides of the pentagon .

Carefully watching this line , we see that it envelops a beautiful curve.

This is the same kind of curve that was constructed quite differently by Menaechmus about 340 BC. Since that time it has been known everywhere as a conic. One important property is that a conic is uniquely determined by any five of its tangents, and that these may be any five lines of which no three are concurrent.

Since the possible positions for our variable line include, in turn, each side of the pentagon , we call its envelope the conic inscribed in this pentagon.

To sum up: Let be a variable point on the diagonal of a given pentagon . Then the point , where meets , and the point , where meets , determine a line whose envelope is the inscribed conic.

For any particular position of (on ), we see a hexagon whose six sides all touch the conic. The three lines , , , which join pairs of opposite vertices, are naturally called diagonals of the hexagon. Thus, if the diagonals of a hexagon are concurrent, the six sides all touch a conic. Conversely, if all the sides of a hexagon touch a conic, five of them can be identified with the lines , , , , . Since the given conic is the only one that touches these fixed lines, the sixth side must coincide with one of the lines that we have constructed. We thus have Brianchon’s theorem: If a hexagon is circumscribed about a conic, the three diagonals are concurrent.

All these results can, of course, be dualized. (Now all the letters that we use are lowercase, representing lines.)

For any pentagon whose vertex is joined to by and to by , there is a unique projectivity relating to .

The sides of the variable triangle rotate about fixed points , , while the two vertices and run along the fixed lines and . The possible positions for the third vertex include, in turn, each of the five vertices of the pentagon.

Carefully watching this moving point , we see that it traces out a curve through these five fixed points (no three concurrent).

What is this curve, the dual of a conic?

One of the many possible definitions for a conic exhibits it as a self-dual figure, with the interesting result that the dual of a conic (regarded as the envelope of its tangents) is again a conic (regarded as the locus of the points of contact of these tangents).

Thus the locus of the point is a conic, and this is the only conic that can be drawn through the five vertices of the pentagon.

To sum up: Let be a variable line through the intersection of two non-adjacent sides of a given pentagon . Then the line , which joins to , and the line , which joins to , determine a point whose locus is the circumscribed conic.

The hexagon , which, for convenience, we rename , yields the dual of Brianchon’s theorem, namely Pascal’s theorem: If is a hexagon inscribed in a conic, the points , , (where pairs of opposite sides meet) are collinear.

The hexagon that we see is, perhaps, unusual, because its sides cross one another. From the standpoint of projective geometry, this feature is irrelevant. A convex hexagon would serve just as well, but the “diagonal points” would be inconveniently far away. Another natural observation is that our conic looks like the familiar circle. In fact, this famous theorem was first proved for a circle in 1639, when its discoverer, Blaise Pascal, was only sixteen years old. Nobody knows just how he did it, because his original treatise has been lost.

But there is no possible doubt about how he deduced the analogous property of the general conic. He joined the circle and lines to a point outside the plane, obtaining a cone and planes. Then he took the section of this solid figure by an arbitrary plane.

We change the position of the points of the hexagon.

In this way the conic appears in one of its most ancient aspects: as the section of a circular cone by a plane of general position.

We change the position of the points of the hexagon.

Thanks to Gregory Robbins, who sparked this update and was able to read the files from an old diskette.

[1] | College Geometry Project (1963–71). (Dec. 19, 2018) archive.org/details/CollegeGeometry. |

[2] | H. S. M. Coxeter, The Real Projective Plane, 3rd ed., New York: Springer, 1993. |

H. S. M. Coxeter and G. Beck, “The Arithmetic of Points on a Conic and Projectivities,” The Mathematica Journal, 2018. https://doi.org/10.3888/tmj.21–2. |

H. S. M. Coxeter (1907–2003) was a Canadian geometer. For an extensive biography, see mathworld.wolfram.com/news/2003-04-02/coxeter.

George Beck earned a B.Sc. (Honours Math) from McGill University and an MA in math from the University of British Columbia. He has been the managing editor of *The Mathematica Journal* since 1997. He has worked for Wolfram Research, Inc. since 1993 in a variety of roles.

**George Beck**

*102-1944 Riverside Drive
Courtenay, B.C., V9N 0E5
Canada*

A comprehensive discussion is presented of the closed-form solutions for the responses of single-degree-of-freedom systems subject to swept-frequency harmonic excitation. The closed-form solutions for linear and octave swept-frequency excitation are presented and these are compared to results obtained by direct numerical integration of the equations of motion. Included is an in-depth discussion of the numerical difficulties associated with the complex error functions and incomplete gamma functions, which are part of the closed-form solutions, and how these difficulties were overcome by employing exact arithmetic. The closed-form solutions allowed the in-depth study of several interesting phenomena. These include the scalloped behavior of the peak response (with multiple discontinuities in the derivative), the significant attenuation of the peak response if the sweep frequency is started at frequencies near or above the natural frequency, and the fact that the swept-excitation response could exceed the steady-state harmonic response.
### Notation

### 1. Introduction

### 2. Equations of Motion

### 3. Closed-Form Solution: Linear Sweep

### 4. Closed-Form Solution: Octave Sweep

### 5. Challenges in Separating Real and Imaginary Parts of Closed-Form Solutions

### 6. Challenges in Numerical Evaluation of the Exact Closed-Form Solutions

### 7. Comparison of Exact and Numerical Solutions

### 8. Construction of Peak Response Curves

### 9. Peak Response Curves for Linear Sweep

#### 9.1 Discontinuities in Derivative of Peak Response Curves at Low Frequencies

### 10. Peak Response Curves for Octave Sweep

#### 10.1 Numerical Integration in the Domain

#### 10.2 Numerical Optimization to Identify Peak Response

#### 10.3 Peak Response Curves for Octave Sweep

### Conclusion

### Acknowledgments

### References

### About the Authors

]]> complex variable

complex variable

dimensionless composite parameter

complex variable

error function

imaginary error function

linear sweep rate in Hz per minute

nonzero start frequency for an octave sweep rate

natural frequency in Hz

complex variable

complex variable

octave sweep rate in octaves per minute

complex variable

time (also used as a dummy integration variable)

upper limit of search for peak values

time at which instantaneous frequency of excitation for linear sweep equals

time at which instantaneous frequency of excitation for octave sweep equals

* * new independent variable for octave sweep

value at which instantaneous frequency of excitation for octave sweep equals

single-degree-of-freedom system displacement response

single-degree-of-freedom system velocity response

single-degree-of-freedom system acceleration response

initial displacement

initial velocity

complex variable

octave sweep rate in octaves per second

composite parameter for closed-form solution for linear sweep

composite parameter for closed-form solution for linear sweep

composite parameter for closed-form solution for linear sweep

composite parameter for closed-form solution for octave sweep

general phase function

composite parameter for closed-form solution for octave sweep

initial phase value

incomplete gamma function

dummy integration variable

composite variable proportional to

composite parameter for closed-form solution for linear sweep

composite parameter for closed-form solution for linear sweep

natural frequency in radians per second

linear sweep rate in radians per second per second

generalized sweep forcing function

multiplication

composite parameter for closed-form solution for octave sweep

composite parameter for closed-form solution for octave sweep

critical damping ratio

dummy integration variable

Harmonic excitation is a fact of life in systems with rotating machinery, such as liquid rocket engine turbopumps, spacecraft momentum wheels, aircraft turbojet engines, electric plant steam turbines and liquid-transport turbine compressor trains. Associated with high performance are high shaft speeds and the resulting excitation caused by imbalances in the rotating components and imperfections in the shafts and ball bearings. Furthermore, phenomena such as shaft whirl and rotor dynamic instability are critical design aspects. Although performance requirements dictate design parameters such as shaft speed, avoiding certain speeds due to dynamic interactions within the system is also a critical design consideration. Completely avoiding critical speeds may not be possible. For example, if the critical speeds are below the operational shaft speed, then at startup and shutdown, the rotation rate sweeps through them. The magnitude of the response is a function of the sweep rate, system damping and modal gains at the excitation and response locations. In addition, bearing imperfection can produce excitation above and below the operational frequency, and responses to these imperfections are also a function of the sweep rate associated with the startup and shutdown of the system. In addition to rotating machinery considerations, frequency sweep effects are a critical aspect of harmonic base shake vibration testing, as employed in the aerospace industry, for example. Therefore, it has been recognized that being able to predict the vibration response of systems to swept-frequency excitation is critical (e.g. [1–7]).

In 1932 Lewis presented the first response of a single-degree-of-freedom system to linear frequency sweep excitation [1]. He derived an expression for the envelope functions that contained the peak values. The limited quantitative results presented by Lewis were obtained by graphical integration for various levels of damping and sweep rate. Lewis concluded that the greater the sweep rate, the larger the attenuation relative to steady-state response, and the higher the instantaneous frequency of excitation would be at which the peak envelope response occurs. Fearn developed in 1967 [2] an algebraic expression for the time at which the peak displacement response of a single-degree-of-freedom system subjected to a linear frequency sweep would occur, and an approximate magnitude of the displacement response. Until Cronin’s dissertation [3], published in 1965, analytical studies were generally restricted to linear frequency sweep, and exponential sweep-excitation studies were mostly experimental in nature. Cronin did provide results for relatively slow sweep rates; his work included analog studies involving linear and exponential excitation frequency sweeps. In addition to spring-mass single-degree-of-freedom systems, work has also been done on unbalanced flexible rotors whose spin rate swept through its critical speeds, e.g., [4]. In these types of systems the modes of vibration would be a function of the spin rate and the resulting gyroscopic moments. In 1964 Hawkes [5] described an approach for obtaining the envelope function of the response of single-degree-of-freedom systems subjected to octave sweep rates. He credits the solution approach to an unpublished document written in April 1961 by T. J. Harvey. From the publication, it is unclear how all required initial conditions were obtained for the resulting differential equations that were solved by numerical integration. The results, however, are consistent with subsequent work published by Lollock [6], who extended the work for both linear and octave sweep rates to useful damping and natural frequency ranges.

In approaches where the envelope function is used to identify the peak response, several factors need to be considered. First, the peak of the envelope function may not coincide with the peak of the time history response; this could lead to an incorrect estimate of the instantaneous excitation frequency that coincides with the peak response. The discrepancy would be greatest for low-frequency systems and decrease as the natural frequency increases relative to the starting frequency of the sweep. Another peculiar feature of this approach is that, whereas the original equation of motion is a second-order differential equation with two initial conditions (say, on the function and its derivative), the envelope equations turn out to be two coupled second-order differential equations, each of which requires two initial conditions, and there does not appear to be any way to derive these four necessary initial conditions from the original two for the equation of motion. There are physical arguments that one could make regarding what the initial conditions ought to be, but there does not appear to be any way to mathematically derive them from the original initial conditions.

It is the purpose of this article to extend and complement previously published work by proposing explicit closed-form solutions to both linear and octave frequency-sweep excitation. This allows the computation of the peak response, not just the peak of the envelope function. The closed-form solutions involve error functions and incomplete gamma functions of complex arguments, computations of which require numerical precision exceeding that which today’s computers can provide. The approach used to overcome this will be described. The closed-form solutions are compared to solutions obtained by numerical integration of the equations of motion. Having the ability to compute closed-form solutions, studies were performed to explore the impact of the frequency separation between the start frequency of the sweeps and the natural frequency of the system. In addition, results are presented showing the fine structure of the peak response in relation to the steady-state resonance response as a function of natural frequencies and critical damping ratios. This includes some unexpected results, in that the peak response curves exhibit highly nonlinear behavior with discontinuities in the derivative.

The differential equation for the motion of a single-degree-of-freedom system driven by harmonic excitation with a linear frequency sweep is given by

(1) |

where is the critical damping ratio, is the natural frequency, is the sweep rate in radians per second per second, and the dots indicate differentiation with respect to time. Assume, without any loss of generality, a sweep starting frequency of zero, a force magnitude equal to the mass of the system and initial conditions of and . The differential equation of motion of a single-degree-of-freedom system driven by harmonic excitation with an octave frequency sweep is

(2) |

where , is the octave sweep rate in octaves/sec, and is the nonzero start frequency of the sweep. As for the linear sweep case, assume a force magnitude equal to the mass of the system and initial conditions of and .

It is helpful to also write both the linear sweep and the octave sweep equations in the following form:

(3) |

where is a general phase function and is the initial phase. Both the linear and octave sweep equations of motion can be put into the following more general form, which will be useful for constructing closed-form solutions:

(4) |

The solution to equation (4) can be expressed as

(5) |

For linear sweep, this becomes

(6) |

If the sine terms are expanded in terms of complex exponentials, then the resulting integrals can be computed in terms of the error function, , and the imaginary error function, , each with complex argument , where . Conceptually, the process proceeds as follows:

- After converting the sine terms to complex exponentials, expand out the products of sums of exponentials, splitting the integral accordingly into a sum of several integrals of exponentials and pulling the parts of each integrand that do not depend on the integration variable outside the integral; the resulting integrals will all have the form.
- With some algebraic manipulation, these integrals can be put into the form or , where , and are, in general, complex valued.
- Choosing as the new integration variable, the first of these integrals becomes:

An identical procedure can be applied to the second of these integrals, leading to an expression involving imaginary error functions. Performing the indicated calculations (including the associated algebra) gives the following closed-form solution for the linear frequency-sweep excitation case. In the interests of compactness, it is helpful to first introduce the following auxiliary parameters:

Then the closed-form solution for the linear sweep case can be written as

(7) |

In order to verify that this equation for does in fact satisfy the equation of motion, we make use of the fact that the derivatives of the error function and the imaginary error function are given by the exponentials and . Then substituting equation (7) and its derivatives into the equation of motion yields an expression involving all of the original and functions, plus a number of terms that do not contain any error functions. Collecting terms with respect to the various error functions, which is relatively straightforward although algebraically tedious, verifies that the coefficient of each of the error functions is zero, and that the terms that do not contain any error functions sum to , which is the forcing function on the right-hand side of the equation. Since we are interested in the peak acceleration response, the second derivative of the solution, equation (7), is the sought-after response time history.

For the case of octave sweep, it is helpful to make a change of independent variable in equation (2) and let , where is the octave sweep rate in octaves per minute. With this change of variable, the equation of motion for octave sweep in the domain becomes

(8) |

The initial conditions become and . Similarly, the expression for the second derivative with respect to time becomes (in terms of derivatives with respect to ),

(9) |

where we define . The advantage of making this change of variable, from the perspective of numerical integration, is *the absence of exponential functions of time in the forcing function* in equation (8); rather, the forcing function is a constant-frequency sine wave, and the coefficients in the equation are at most quadratic in . This greatly improves the stability and reliability of the numerical integration.

It is helpful to write equation (8) for the octave sweep in the following more general form:

(10) |

where . Then using the variation of parameters method, we obtain the following expression for :

(11) |

Substituting for and then expanding the sines in terms of complex exponentials yields integrals of the form , which are readily expressed in terms of incomplete gamma functions after algebraic transformation. The incomplete gamma function is given by . For compactness, it is helpful to first introduce the following auxiliary parameters: , , , and . Then the resulting expression for reduces to

(12) |

Substituting yields the corresponding solution in the time domain:

(13) |

Computing the first and second derivatives of equation (13) and substituting them into the original equation of motion, equation (2), one discovers, after some algebra and collecting terms with respect to the various incomplete gamma functions, that the resulting equation can be put into the form . Since we are interested in oscillatory motion, which implies , it follows that reduces to zero, thereby showing that equation (13) does indeed satisfy equation (2).

The sought-after solutions are the real parts of equation (7) and equation (13). For the linear sweep, series expressions exist for the real and imaginary parts of both and : functions.wolfram.com/GammaBetaErf/Erf/19 for and functions.wolfram.com/GammaBetaErf/Erfi/19 for contain series expressions in terms of Hermite polynomials as well as hypergeometric functions. In practice, these series have very slow and highly nonmonotonic convergence properties, with the partial sums fluctuating over many orders of magnitude as successive terms are added. Furthermore, numerical evaluation of these partial sums using exact numbers as inputs is extremely slow and computation time increases nonlinearly with the number of terms, while evaluation using finite-precision numbers yields erroneous results. Since one does not know ahead of time how many terms will be needed for an accurate computation, this approach is impractical. As with the error function, there are similar numerical challenges in computing the incomplete gamma function of complex arguments. Accordingly, the closed-form solutions will be computed using equations (7) and (13) directly.

There are also numerical challenges associated with the exact solutions because of the complex arguments of the error and gamma functions. Recall that the error function is given by and observe that the magnitude (i.e. the absolute value) of * *is the same as the magnitude of ,* *since .* *However, once the argument becomes complex, we would need to integrate expressions of the form , and the presence of the term in the exponent means that the real part of the exponent grows very quickly with , that is, as . Since is analytic in the complex plane, we can use the Cauchy integral theorem for line integrals [8] to break the integral from 0 to (complex) into two parts: the integral from to plus the integral from to . In the integral from to , we are in effect integrating from to . Thus, both * *and increase very quickly, as shown in the plots in Figures 1 and 2. In order for the end result of the combinations of * *and that appear in the exact solution to sum to an oscillatory function, very precise cancellations are needed, meaning that extremely high precision is needed in order to do the numerical evaluations correctly.

**Figure 1.** Plot of . Observe that over the range and , the magnitude of increases to about .

**Figure 2.** Plot of . The behavior of is similar to that of .

Because of the extremely high numerical precision requirements, Mathematica, which implements arbitrary-precision arithmetic, was chosen to compute the closed-form solutions. This made it possible to experiment with different levels of computational precision. Some results were computed with hundreds or thousands of digits of precision. Depending on the values of the input parameters (sweep rate, natural frequency, damping coefficient, etc.), it was found that different levels of precision were needed in order to get reliable results—not a very attractive idea, since it is impossible to know ahead of time how much precision would be needed for any particular set of inputs. Fortunately, Mathematica also allows exact arithmetic (using rational and/or exact symbolic numbers as inputs), and this made it possible to use the exact analytic solutions in a computationally tractable form. More specifically, one can evaluate functions numerically using exact arithmetic by means of the following steps:

- Convert all of the inputs to integers, rational numbers or exact symbolic numbers such as or , or rational multiples thereof, all of which are treated as having infinite precision.
- Set the global variable , which specifies the maximum number of extra digits of precision, to . This enables as much extra precision as possible.
- Evaluate the function of interest with the desired (exact) inputs. This will, in general, yield a very complicated exact expression.
- Evaluate this exact value to the desired number of digits of precision for the output in order to get a recognizable numerical value, with the understanding that any imaginary “dust” arising from this numerical truncation will be ignored. (For the results presented later, we used 30 digits of output precision.)

To build confidence in the closed-form solution, the equation of motion was also solved by direct numerical integration. For the linear sweep case, the results presented herein were obtained from the closed-form solution, equation (7), as well as direct integration of the differential equation of motion, equation (1). The *closed-form* solution was evaluated by first rationalizing all of the inputs to (7) (other than integers, rational numbers and multiples of and ) using (which converts any number to rational form), and then evaluating the real part of the result (to eliminate any very small imaginary numbers) to the desired number of digits of precision (typically 30, 50 or 100) with the function. The *numerical *solution was obtained by integrating the equation of motion (1) with out to a some desired maximum time (typically some time after the sweep frequency hits the natural frequency of the system), with , , and .

Figure 3 shows the response time histories for a system with a natural frequency of 5 Hz and a critical damping ratio of 1%. The sweep frequency was started at zero Hz and the sweep rate was 150 Hz/min, or . In the figure, the dashed orange line is the closed-form solution and the dotted blue line is the direct numerical integration solution. Clearly, the differences are imperceptible. Table 1 shows the numerical values for both solutions for a randomly selected subset of the time points used in plotting Figure 3. Again, it is evident that for all practical purposes, the solutions are identical.

**Figure 3.** Acceleration response time histories of a single-degree-of-freedom system, , excited by a harmonic force with a linear sweep rate frequency of 150 Hz/min.

For the octave sweep case, the results in Table 2 were obtained from the closed-form solution, equation (13), as well as by direct integration of the differential equation of motion in the -domain, equation (8). The procedure for evaluating the *closed-form* solution for the octave sweep case was identical to that described for the closed-form solution in the linear sweep case. The *numerical *solution was obtained by integrating equation (8) in the -domain with from out to some desired maximum value of (typically corresponding to some time beyond the time at which the sweep frequency hits the natural frequency of the system), with and , and then using equation (9) to transform the acceleration back to the -domain. Figure 4 shows the response time histories for a system with a circular natural frequency of 1/4 Hz and critical damping ratio of 0.01. The sweep frequency was started at 1/8 Hz and the sweep rate was 1/2 octaves/min. The orange dashed line is the closed-form solution and the blue dotted line is the direct numerical integration solution. Again, the differences are imperceptible. Table 2 provides the numerical values for both solutions for a randomly selected subset of the time points used in plotting Figure 4; for all practical purposes, the results are identical.

**Figure 4.** Acceleration response time histories of a single-degree-of-freedom system, , excited by a harmonic force with an octave sweep rate frequency of 0.5 octaves/min.

The construction of the peak response curves involved two steps. First, the times at which the peak of the absolute value of the acceleration occurred were obtained via numerical integration for the desired combinations of , and for linear sweep or for octave sweep. These times were then used as the starting points for a very fine-grained search of the exact analytical solutions in order to determine the peak acceleration in each case. Development of a generic algorithm to accomplish this was not trivial, as will be discussed. However, the effort was made easier by previously published results that indicate that the peak envelope values, which would contain the peak response values, would occur *after* the instantaneous frequency of excitation was equal to the natural frequency of the system. Hence, the search for the peaks was started at the point in the response time history where the instantaneous frequency of excitation was equal to the circular natural frequency of the system. For the linear sweep excitation, the time was computed as

(14) |

and for the octave sweep excitation, the value was computed as

(15) |

In the case of the numerical approach, we sorted the list of computed acceleration values generated via integration, starting at in order to find an initial approximation to the peak acceleration, and then did a more refined local search around this peak using standard local optimization techniques. In the case of the analytical approach, much smaller increments in were used in order to get a sharper picture of some unusual phenomena that emerge at low frequencies. Accordingly, interpolations were generated of the times at which the *numerically generated *peak responses occurred, as a function of for combinations of and for linear sweep, or for octave sweep. Thus, for any value of , we could use this interpolated time value as the starting point for a refined numerical search that involved evaluating the *exact analytical solution* at very closely spaced time points in a neighborhood of this time. For this, we chose time points that were equally spaced in the phase of the forcing function, that is, at 0.25° phase increments, which provided precise, although computationally intense results. In addition, care was taken to search for the peak sufficiently past the start of the search, given by equation (14) or equation (15), to guarantee that the global maximum peak had been found.

Associated with the question of at which point in a time history to start the search for the peak value, that is, and , is the question of how far past this point the search should be conducted to guarantee that the global peak has been identified. Unfortunately, the only way we found to reliably accomplish this was through trial and error. For linear sweep, we found experimentally that it was very helpful to divide the range into two parts: and .

For relatively low natural frequencies, that is , it was found experimentally that evaluating the function out to gives reliable results in most cases, with the peak response typically occurring about 20% of the way out to . At low values of , however, sometimes the peak response occurred about 45 to 50% of the way out to . Although with hindsight, we could have obtained the peak response without going out so far in time, we wanted to be sure that the peak response found was in fact the true global peak response. We observed that in some cases, what looked like a global peak value eventually got “dethroned” by a peak that occurred quite a few cycles later, due to the beating of the frequencies involved. Thus, all of the low-frequency responses, as well as a subset of the high-frequency responses, were visually monitored graphically, and if any peak responses were found at times more than 50% or so of the way out to , then the coefficient of for was increased accordingly. For higher natural frequencies, that is, , it was found that generally gave reliable results for high sweep rates (~150-200 Hz/min), while gave reliable results for lower sweep rates (~10-20 Hz/minute).

In view of the oscillatory nature of the system, it was important to constrain the maximum integration step size to be at most a small fraction of a cycle. Based on prior experience with similar computations, we chose the maximum step size to be 1/40 of a cycle of the largest frequency of interest, which was the sweep frequency at the value described previously. For simplicity, we deliberately chose to constrain the maximum step size based on the largest frequency of interest, encountered at time , rather than attempt to change the maximum step size as the frequency changed. For the low sweep frequencies encountered in the early parts of a sweep, this step size was much smaller than 1/40 of a cycle, but this did not create any problems. The numerical integrator used () employs an adaptive algorithm that adjusts the step size as needed, subject to any user-prescribed constraints. In addition, we used fifth-order interpolation in the numerical integrator so that the acceleration would be a third-order interpolating function. Finally, in view of the progressive increase of sweep frequency with time, we found it useful to specify a maximum of 100,000,000 integration time steps (considerably more than the integrator’s default value), as in some cases a smaller maximum number of time steps (such as 10,000,000) did not allow the adaptive integrator to reach the global peak response.

It was also required that the closed-form solution be evaluated at very closely spaced time increments in order to reliably find the peak acceleration. This strategy leveraged off of the previously computed numerical solutions, that is, the times at which the numerically obtained peak values occurred, in order to do a very fine-grained search (with the closed-form solution) in the neighborhood of the numerically computed peak value. Although a global list of search points could have been generated in other ways without the use of the numerical solution, using the points generated by the numerical integrator seemed like the most efficient approach. The strategy then was to use the numerically generated estimate of when the peak acceleration occurs and search within plus or minus some number of cycles of this time, at equally spaced increments in the forcing function phase. We found that searching within ±60 cycles with 1,440 phase increments per cycle (i.e. at 0.25° phase increments) yielded reliable results.

Figure 5 shows the peak response (from the exact solution) normalized by , the steady-state resonant response when the excitation frequency is equal to the undamped natural frequency of the system, plotted against the natural frequency of the system for three linear excitation sweep rates, , and Hz/min. The system has a critical damping ratio of and its natural frequency was varied from 0.25 Hz to 10 Hz in steps of 0.01 Hz. Each of the (almost 1,000) peak response values on each of these curves was computed via the process for computing peak acceleration (from the exact solution) described in Sections 7 and 8, that is, searching within ±60 cycles of the numerically generated estimate of when the peak acceleration occurs, with 1,440 phase increments per cycle. As can be seen, the attenuation of the peak response relative to the resonant steady-state response is significant for systems with low natural frequencies. As the natural frequency increases, which allows a greater number of response cycles during any given excitation frequency range, the attenuation decreases. These results are consistent with those published by others [6]. What is not consistent is the scalloped behavior of the peak curves at the lower frequencies. This behavior was obtained with both the numerically integrated results and the closed-form solution. Figure 6 shows an expanded close-up view of the lower-frequency range of Figure 5 and was generated by simply changing the horizontal plot range in Figure 5. The details visible in Figure 6 will be discussed in more detail later.

**Figure 5.** Normalized peak response plotted against natural frequency for several linear excitation sweep rates. Left-to-right curves correspond to top to bottom in key.

**Figure 6.** Close-up of the low-frequency range of Figure 5. Left-to-right curves correspond to top to bottom in key.

Another observation is that the peak response during a frequency sweep can exceed the steady-state resonant response. This is shown in Figure 7, where the normalized peak responses are shown for two sweep rates (Figure 7 was obtained from Figure 5 by simply adjusting the vertical plot range to focus on the overshoot portion of the response). This might seem counterintuitive, since the frequency of excitation is sweeping through the natural frequency and therefore does not dwell. However, the sweep causes a response that is at the natural frequency of the system and that decays as a function of the system damping. Once the sweep frequency passes the natural frequency, the total response is the response due to the excitation plus the free-decay response of the system at its natural frequency. This is what causes the beating in the response once the sweep frequency passes the natural frequency. The decaying free response plus the transient response to the swept excitation can combine to produce higher peak responses than the resonant response caused by harmonic dwell at the natural frequency. The overshoot observed here is consistent with the overshoot observed by Cronin [3].

**Figure 7.** Close-up of overshoot phenomenon observed in Figure 5. Left-to-right curves correspond to top to bottom in key.

Figure 8 shows the normalized peak response for various sweep rates plotted against the natural frequency squared divided by the linear sweep rate; this normalization allows comparison to results presented in the literature. The critical damping ratio for this system is . The data used in Figure 8 is the same as the data used in Figure 5, only plotted differently. Observe that the curves merge into one, as explained by Hawkes [5].

**Figure 8.** Normalized peak response for several linear sweep rates plotted against , where is the natural frequency and is the sweep rate.

In Figures 5 through 8, one observes periodic discontinuities in the derivative of the peak response curve. Moreover, the curve does not increase monotonically; sometimes it starts to dip down before hitting a discontinuity in slope and resuming its upward trend. One also observes that at very low frequencies the discontinuities in the derivative are not very regular, but as the natural frequency is gradually increased, they take on a much more regular nature. These discontinuities are best understood in terms of what we will call the “competing peaks” phenomenon, which can be most clearly explained by taking several observations into account:

- The peak response always occurs some time after the sweep frequency reaches the natural frequency of the system.
- As the natural frequency of the system is increased, the time at which the sweep frequency reaches the natural frequency occurs later and later, since for these problems the sweep frequency always started at 1/8 Hz.
- Thus the time at which the peak acceleration occurs can be expected to increase as the natural frequency is increased.
- In the array of plots shown in Figure 9, which show the response time histories for several very closely spaced values of , one observes that as the time at which the peak acceleration is reached increases, the “dominant” peak (i.e. the largest global peak) is eventually overtaken (from one value of to the next, i.e. from one plot to the next) by the secondary peak (i.e. the second-largest global peak), which has been increasing all along. So when this happens, the rate of change of the global peak suddenly changes, since it is now associated with a different peak, and thus there is a discontinuity in the slope of the peak response curve. These peak responses as a function of frequency are summarized in the plot insert at the lower-right corner of Figure 10.

**Figure 9.** Evolution of peak acceleration as natural frequency is increased (left to right, then top to bottom).

**Figure 10.** Evolution of peak acceleration as secondary peak overtakes the dominant peak. The first six points come from the preceding plots.

In principle, there are actually three possible types of behavior that can lead to discontinuities in the derivative of the peak response curve, and all can be understood in terms of the preceding logic:

- A decreasing peak is overtaken by an increasing peak (this is the case described in the preceding).
- An increasing peak is overtaken by a more rapidly increasing peak.
- A decreasing peak passes a more slowly decreasing peak so that the more slowly decreasing peak is now the dominant peak (possible in principle, but not observed in this example).

The later the peak (i.e. the larger the natural frequency of the system), the longer the system has to build up to a steady-state-like response, so that successive peak accelerations (corresponding to successively higher natural frequencies) attain higher and higher values, hence the overall upward general trend of the peak response curve. For this same reason, at high sweep frequencies successive peaks in the response versus time curve all have very similar amplitudes, so that when the natural frequency is changed slightly and one peak overtakes another, the difference in the rates at which the dominant and secondary peaks are increasing is extremely small and barely noticeable. Thus the peak response curve appears to be smooth at high frequencies.

As described earlier, in the octave sweep case, it is extremely helpful to first make a change of independent variable by letting . The resulting differential equation for then has a constant-frequency forcing term in the domain (at the expense of coefficients in the equation that are at most quadratic in time). The resulting differential equation for , equation (8), was solved both analytically (equation (13)) and numerically, and then transformed back to the time domain.

The time at which the sweep frequency equals the oscillator’s resonant frequency is given by equation (15). However, since the integration is being done in the domain, the corresponding expression for the value at which the system’s resonant frequency is reached becomes

(16) |

Since in the domain the forcing function is a constant-frequency sine wave, we found experimentally that in most cases it was sufficient to integrate to a maximum value of 1.5 , although occasionally it was necessary to go up to 3 or 4 times . In some cases it is possible for the value of to become less than 1, and so we also imposed a lower bound of 1.05 on the maximum value of . We again used fifth-order interpolation for computing derivatives in and again allowed the integration to go for a maximum of 100,000,000 time steps: recall that for octave sweep we used the substitution , so increases exponentially with , and thus the number of steps in the domain can become much larger than the number of time steps in the domain.

Once the differential equation (for a given set of , , and values) had been solved, the following procedure for finding the peak response was followed:

- Create a list of the values generated via numerical integration.
- Use equation (9) to evaluate (in the time domain) at each value and then from this list select the largest response.
- Use the data from steps (1) and (2) to also create an interpolating function for as a function of .
- Having found this initial estimate of the peak value, then use the interpolation function returned by to do a local optimization (via the function) around this initial peak, using this peak as a starting point.

Figure 11 shows the normalized peak response to various octave sweep rates. Each of the (almost 1,000) peak response values on each of these curves was computed via the process for computing peak acceleration (from the exact solution) described in Sections 7 and 8, that is, searching within ±60 cycles of the numerically generated estimate of when the peak acceleration occurs, with 1,440 phase increments per cycle. As expected, the slower the sweep rate, the lower the attenuation. In addition, the scalloped behavior in the peak response curves that was observed for the linear sweeps is also present here, although not as pronounced. This is because the octave sweep increases in frequency more rapidly than the linear sweep.

**Figure 11.** Normalized peak responses with and several values of octave sweep rate (octaves/minute). At low natural frequencies, , the peak response was computed in increments of 0.002 Hz. Left-to-right curves correspond to top to bottom in key.

Figure 12 shows an expanded view of Figure 11 corresponding to the lower frequency systems so that the scalloped behavior can be better seen. Figure 12 was obtained from Figure 11 by simply adjusting the vertical and horizontal plot ranges.

**Figure 12.** Expanded view of peak response curves with at low natural frequencies for various octave sweep rates (octaves/minute). Left-to-right curves correspond to top to bottom in key.

Figure 13 shows the results from Figure 11 normalized by the octave sweep rate, as suggested by Hawkes [5]. The data used in Figure 13 is the same as the data used in Figure 11, only plotted differently. As in the case with the linear sweep rate and its normalization factor, the octave sweep rate results also merge into a single curve for systems with the same critical damping ratio.

**Figure 13.** Normalized peak response curves for and various octave sweep rates plotted against , where is in Hz and is in octaves/minute.

Figure 14 shows comparable results to those in Figure 13 for systems with a critical damping ratio of .

**Figure 14.** Normalized peak response curves for and various sweep rates with , plotted against ; is in Hz and is in octaves/minute.

Figures 15 and 16 show the severe attenuation that occurs when the start frequency of the sweep is close to the natural frequency. In both figures, the sweeps were started at 1 Hz. As can be ascertained, the attenuation is significant for systems with natural frequencies close to or below 1 Hz, as would be expected. Hence, the attenuation is not only a function of the natural frequency, damping and sweep rate, but also of the proximity of the start frequency of the sweep to the natural frequency. As with Figure 11, each of the peak response values on each of the curves in Figures 14–16 was computed via the process for computing peak acceleration (from the exact solution) described in Sections 7 and 8.

**Figure 15.** Normalized peak response curves for octave sweep with and (instead of 1/8 Hz). Left-to-right curves correspond to top to bottom in key.

**Figure 16.** Normalized peak response curves for octave sweep with and (instead of 1/8 Hz). Left-to-right curves correspond to top to bottom in key.

The derivation of closed-form solutions for the responses of single-degree-of-freedom systems subject to linear and octave swept-frequency harmonic excitation was presented. The closed-form solutions were compared to results obtained by direct numerical integration of the equations of motion with excellent agreement obtained. In addition, an in-depth discussion was presented on the numerical difficulties associated with the gamma and error functions of complex arguments that are part of the closed-form solutions, and how these difficulties were overcome by employing exact arithmetic with infinite-precision numbers, that is, rational and/or exact symbolic numbers. This included a study of precision requirements by performing computations with numerical precision exceeding what is available on today’s computers. The closed-form solutions allowed the in-depth study of several interesting phenomena including: (a) computation of the peak response instead of the peak of the envelope function; (b) scalloped behavior of the peak response with frequent discontinuities in the derivative; (c) the significant attenuation of the peak response if the sweep frequency is started at frequencies near or above the natural frequency; and (d) the fact that the swept-excitation response could exceed the steady-state harmonic response when the system is excited at its natural frequency.

We are grateful to Luke Titus of Wolfram Research for his valuable suggestions on exact numerical computation.

This work was supported by contract # FA8802-14-C-0001.

[1] | F. M. Lewis, “Vibration during Acceleration through a Critical Speed,” Transactions of the American Society of Mechanical Engineers, 54(1), 1932 pp. 253–261. |

[2] | R. L. Fearn and K. Millsaps, “Constant Acceleration of an Undamped Simple Vibrator through Resonance,” The Aeronautical Journal, 71(680), August 1967 pp. 567–569. https://doi.org/10.1017/S0001924000055007. |

[3] | D. L. Cronin, Response of Linear, Viscous Damped Systems to Excitations Having Time-Varying Frequency, Ph.D. thesis, Dynamics Laboratory, California Institute of Technology, Pasadena, California, 1965. https://authors.library.caltech.edu/26518. |

[4] | R. Gasch, R. Markert and H. Pfutzner, “Acceleration of Unbalanced Flexible Rotors through the Critical Speeds,” Journal of Sound and Vibration, 63(3), 1979 pp. 393–409. https://doi.org/10.1016/0022-460X(79)90682-5. |

[5] | P. E. Hawkes, “Response of a Single-Degree-of-Freedom System to Exponential Sweep Rates,” Shock, Vibration and Associated Environments, Part II, Bulletin No. 33, February 1964 pp. 296–304. https://apps.dtic.mil/dtic/tr/fulltext/u2/432931.pdf. |

[6] | J. A. Lollock, “The Effect of Swept Sinusoidal Excitation on the Response of a Single-Degree-of-Freedom Oscillator,” in 43rd AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference, 2002, Denver, CO. https://doi.org/10.2514/6.2002-1230. |

[7] | R. Markert and M. Seidler, “Analytically Based Estimation of the Maximum Amplitude during Passage through Resonance,” International Journal of Solids and Structures, 38(10–13), 2001 pp. 1975–1992. https://doi.org/10.1016/S0020-7683(00)00147-5. |

[8] | L. Ahlfors, Complex Analysis: An Introduction to the Theory of Analytic Functions of One Complex Variable, New York: McGraw-Hill, 2000. |

C. C. Reed and A. M. Kabe, “Peak Response of Single-Degree-of-Freedom Systems to Swept-Frequency Excitation,” The Mathematica Journal, 2018. https://doi.org/10.3888/tmj.21-1. |

Dr. Chris Reed is a Senior Engineering Specialist in the Structures Department at The Aerospace Corporation. As an applied mathematician, his work has encompassed mechanical vibrations, structural deformation, space-based sensor system performance, satellite system design optimization, flight termination system interference, fluid sloshing, electrostatic discharges, dielectric degradation on satellites and queueing systems. He has two patents and received a Wolfram Innovator award in 2017. His B.S. is from the California Institute of Technology and his M.S. and Ph.D. degrees are from Cornell University.

Dr. Alvar M. Kabe is the Principal Director of the Structural Mechanics Subdivision of The Aerospace Corporation. He has made notable contributions to the state of the art of launch vehicle and spacecraft structural dynamics. He has published numerous papers, is an Associate Fellow of the AIAA, and has received The Aerospace Corporation’s Trustees’ Distinguished Achievement Award and the Aerospace President’s Achievement Award. His B.S., M.S. and Ph.D. degrees are from UCLA.

**C. Christopher Reed**

*Senior Engineering Specialist
Structures Department
M4-912
The Aerospace Corporation
P.O. Box 92957
Los Angeles, CA 90009-2957
*

**Alvar M. Kabe**

*Principal Director
Structural Mechanics Subdivision
M4-899
The Aerospace Corporation
P.O. Box 92957
Los Angeles, CA 90009-2957
*

dx.doi.org/doi:10.3888/tmj.20-8

This article presents a numerical pseudo-dynamic approach to solve a nonlinear stationary partial differential equation (PDE) with bifurcations by passing from to a pseudo-time-dependent PDE . The equation is constructed so that the desired nontrivial solution of represents a fixed point of . The numeric solution of is then obtained as the solution of at a high enough value of the

pseudo-time.### 1. Introduction: Soft Bifurcation of a Stationary Nonlinear PDE

### 2. Numerical Description of a Soft Bifurcation: A Problem and a Workaround

#### 2.1. A Pseudo-Dynamic Equation

#### 2.2. A Critical Slowing Down

### 3. Example: A 1D Ginzburg–Landau Equation

### 4. Numerical Solution of the Ginzburg–Landau Equation

#### 4.1. Pseudo-Time-Dependent Equation

#### 4.2. Solution within a Finite Domain

#### 4.3. The Solution Norm and the Convergence Control

#### 4.4. The Critical Slowing Down in the Numeric Process

#### 4.5. In Search of the Bifurcation Point

#### 4.6. Varying the Boundary

### 5. Discussion

#### 5. 1. Nonzero Boundary Conditions

#### 5.2. Dimensionality

#### 5.3. A Supercritical (Soft) versus a Subcritical (Hard) Bifurcation

### 6. Summary

### References

### About the Author

]]>
pseudo-time.

The method described here can be applied to solve PDEs coming from different domains. However, it was initially developed to get the numerical solution of a stationary nonlinear PDE with a *bifurcation*. The method’s application to a broader class of equations is briefly discussed at the end of the article.

The term “bifurcation” describes a phenomenon that occurs in some nonlinear equations that depend on one or several parameters. These equations can be algebraic, differential, integral or integro-differential. At some values of a parameter, such an equation may exhibit a fixed number of solutions. However, as soon as the parameter exceeds a critical value (referred to as the *bifurcation point*), the number of solutions changes and either new solutions emerge or some old ones disappear. To be specific, we discuss the case of dependence on a single parameter .

The new solutions can emerge continuously at the bifurcation point. The norm of the solution exhibits a continuous though nonsmooth dependence on the parameter at the bifurcation point (left, Figure 1). An explicit example is in Section 4.5. A bifurcation at which the solution is continuous at the bifurcation point is referred to as *supercritical* or *soft*.

The behavior of the solution in the case of a *subcritical* or *hard* bifurcation is different: the norm of the solution is finite at the bifurcation point but has a jump discontinuity there (right, Figure 1).

**Figure 1.** Soft versus hard bifurcation. In the case of the soft bifurcation, the solution has a continuous dependence of the solution norm on the control parameter , with a kink at the bifurcation point, . In contrast, in the case of a hard bifurcation, the solution is discontinuous at the bifurcation point.

In this article, we focus only on the case of a nonlinear PDE with soft bifurcations; some peculiarities of hard bifurcations are briefly discussed in Section 5.3.

In the most general form, a nonlinear PDE can be written as:

(1) |

Here so that (1) indicates a system of nonlinear PDEs; is an -dimensional vector representing the dependent variable. The subscript indicates that is the solution of a stationary equation. Further, is a -dimensional vector. Finally, is a real numerical parameter. The system of equations (1) is analyzed in a domain subject to zero Dirichlet boundary conditions:

(2) |

Also assume that

(3) |

and thus represents a trivial solution of (1, 2).

It is convenient to separate out the linear part of the operator (1), which is often (though not always) representable in the form and to write it down in the following form:

(4) |

Here is a linear differential operator (such as, for example, the Laplace operator). Further, is the nonlinear part of the operator . The assumption that solves equation (1) implies that .

In its explicit form, we use the representation (4) only in Section 2.2, where we derive the critical slowing-down phenomenon. In all other cases, a general form of the dependence of equation (4) on is valid: and . Nevertheless, we stick to the form (4) for simplicity, while the generalization is straightforward.

Let us also consider an auxiliary equation

(5) |

that yields the linear part of the nonlinear equation (4). Equation (5) represents the eigenvalue problem, where the are its eigenfunctions and the are its eigenvalues, indexed by the discrete variable , provided the discrete spectrum of (5) exists. Let us assume that at least a part of the spectrum of (5) is discrete. We assume here that starts from zero: . The state with is referred to as the *ground state*.

Without proofs, we recall a few facts from bifurcation theory [1] valid for soft bifurcations of such equations.

Assume that the trivial solution is stable for some values of . As soon as the parameter becomes equal to the smallest discrete eigenvalue of the auxiliary equation (5), this solution becomes unstable. As a result, a nontrivial solution branches off from the trivial one. In the close vicinity of the bifurcation point , this solution has the asymptotics

(6) |

where is the set of eigenfunctions of the equation (5) belonging to the eigenvalue . The vector is the set of amplitudes. The scalar product stands for the expression . Here the index (where ) enumerates the eigenfunctions in the -dimensional subspace of the functional space where (5) has a nonzero solution. The exponent exceeds unity: .

There are a few methods available to determine . Listing them is out of the scope of this article. However, the simplest of these methods can be applied if there exists a generating functional enabling one to obtain the system of equations (1) as its minimum condition:

(7) |

where is the variational derivative. This functional we refer to as *energy* in analogy with physics. Substituting the representation (6) into the energy functional and integrating out the spatial coordinates, one finds the energy as a function of the amplitudes and parameter . Minimizing the energy with respect to the amplitudes yields the system of equations for the amplitudes, referred to as the *ramification equation*:

(8) |

Their solution is only accurate close to the bifurcation point . Assuming that the bifurcation takes place with decreasing (as is the case in the following example), one finds the typical solution for the amplitudes,

(9) |

where and are real numbers to be determined using the original equation. One of the methods to analytically find these parameters is discussed in Section 3. Further analytical methods may be found in [1]. This article focuses on finding these parameters numerically (Section 4.5).

All theorems and proofs for the preceding statements, along with more general methods of the derivation of the ramification equation, can be found in [1].

The bifurcation theory formulated so far is quite general: equation (1) can be differential, integral or integro-differential [1]. In what follows, we focus only on a more specific class of nonlinear partial differential equations.

The solution of the spectral system of equations (5) yields the bifurcation point ; the solutions (6) and (9) are only valid very close to this point. With increasing , the solution soon deviates from the correct behavior quantitatively, and the solution often fails to resemble (6) even qualitatively. For this reason, to get the solution at some finite that would be correct both qualitatively and quantitatively, one needs to solve (1) numerically.

In the case of a hard bifurcation, none of the machinery of the theory of soft bifurcations described so far works. Studying the bifurcation numerically often becomes the only possibility.

However, the direct numerical solution of nonlinear equations like (1) and (4) with some nonlinear solvers only returns the trivial solution for equation (4), even at the values of the parameter at which the trivial solution is unstable and a stable nontrivial solution already exists.

A plausible reason may be as follows: the solver starts to construct the PDE solution from the boundary. Here, however, the boundary condition is already part of the trivial solution. Thus the solver appears to be placed at the true solution of the equation and is then unable to climb down from it.

To find a nontrivial solution, one needs to use a method that would start from some initial approximation that, even if rough, should be quite different from the trivial solution. Furthermore, this method should converge to the nontrivial solution by a chain of successive steps.

One can do this with the pseudo-dynamic approach formulated in the present article.

Let us introduce pseudo-time . The word “pseudo” indicates that is not real time. It just represents a technical trick that helps with the simulation. Assume now that the dependent variable is a function of both the set of spatial coordinates and the pseudo-time: . Instead of the stationary equation (1), let us study the behavior of the pseudo-time-dependent equation:

(10) |

One solves equation (10) with a suitable nonzero initial condition . Let us stress that the solution of the time-dependent equations (10) is not the same as the solution of the stationary equation (1).

One could also construct the pseudo-time-dependent equation as follows: , that is, with a minus sign in front of . The idea of such an extension is that either or exhibits a fixed point, so that , while the other diverges as . By trial and error, one chooses the equation whose solution converges to the fixed point .

The operator has not yet been specified; for definiteness let us assume that the fixed point at takes place for equation (10), that is, with the plus sign in front of .

The convergence of the solution of the dynamic equation to the fixed point enables one to apply the following strategy. Instead of the static equation (1), which is difficult to solve numerically, one simulates the quasi-dynamic equation (10) using a suitable time-stepping algorithm.

The advantage of this approach is in the possibility of starting the simulation from an arbitrary distribution chosen as the initial condition, provided it agrees with the boundary conditions. From the very beginning, such a choice takes one away from the trivial solution. The time-stepping process takes the initial condition for each step from the previous solution. The solution starting from any function gradually converges to with time if belongs to its attraction basin.

After having obtained the solution of the pseudo-time-dependent equation, one approximates the function , as at a large enough value of the pseudo-time . The meaning of the words “large enough” is clarified in Section 4.3.

The approach can be given a pictorial interpretation (Figure 2). In the infinite-dimensional functional space, let be an infinite set of basis functions. Then the function can be represented as

(11) |

**Figure 2.** Schematic view of the 3D projection of the infinite-dimensional functional space with a trajectory from the initial state (blue dot) to the fixed point (red dot).

The trajectory in this space goes from the initial state to the final state , as shown by the two dots.

The time derivative represents the velocity of the motion of a point through this space, while can be regarded as a force driving this point. Thus equation (10) can be interpreted as describing a driven motion of a massless point particle with viscous friction through the functional space. In these terms, the condition (1) means that the driving force is equal to zero at some point of the space, which is the location of the fixed point of the nonlinear equation (10).

If the energy functional for equation (1) exists, one can make one further step in the interpretation (Figure 3).

**Figure 3.** Schematic view of the energy functional as the function of the coordinate in the functional space (A) above and (B) below the bifurcation point. The cross section of the infinite-dimensional space along a single coordinate is shown. The points show initial positions of the particle, while the arrows indicate its motion to the nearest minimum of the potential well.

Indeed, according to the definition given, equation (1) delivers a minimum to the energy functional. In this case, one can regard the dynamic equation (10) as describing a viscous motion of the massless point particle along a hypersurface in the -dimensional space, , the surface forming a potential well. The motion goes from some initial position to the minimum of the potential well as shown schematically in Figure 3. Above the bifurcation, this minimum only corresponds to the trivial solution (A) situated at . Below the bifurcation, the energy hypersurface exhibits a new configuration with new minima, while the previous minima vanish. As a result, below the bifurcation, the point particle moves from the initial position (shown by dots in Figure 3) to one of the newly formed minima (as the red and green arrows show in B). The functional space has infinite dimension, and essential features of the numeric process may involve several dimensions. The D representation displayed in Figure 3 is therefore oversimplified and only partially represents the bifurcation phenomenon.

Equation (10) can be rewritten as:

(12) |

Though lacking a stationary nonlinear solver at present, Mathematica offers the option , efficiently applicable to dynamic equations like (12). This method is applied everywhere in the rest of this article.

The evident penalty of this approach is that the computation time can become large, especially in the vicinity of the bifurcation point; this peculiarity is discussed next.

Close to the critical point , the relaxation of the solution to the fixed point dramatically slows down. This is referred to as *critical slowing down*. Its origin is illustrated in Section 4. To simplify the argument, let us consider a single equation with the one-component dependent variable that still depends on the D-dimensional coordinate . The generalization for a system of equations is straightforward, though a bit cumbersome.

According to (6), close to the bifurcation point, one can look for the solution of equation (12) in the form:

(13) |

Ignore the higher-order terms, assuming that is small. Substitute (13) into the first equation (12) and linearize it. Here one should distinguish between the case at , where the linearization should be done around , and that at , where one linearizes with the center at (the second line of equation 9). In the former case, one finds

Making use of (5), one finally obtains the dynamic equation for at :

(14) |

implying that , and the relaxation time has the form .

At , analogous but somewhat more lengthy arguments give the characteristic time, twice as small as that above the critical point. One comes to the relation:

(15) |

One can see that the relaxation time diverges with from both sides. From the practical point of view, this suggests increasing the simulation time according to (15) near the critical point.

The result (15) is valid for equation (12), in which the linear part of the pseudo-dynamic equation has the form . That is, the parameter enters this equation only linearly, in the form of the product . In the general case , one still finds diverging relaxation time , though the factors (such as above, and below the bifurcation point) may be different.

The phenomenon of critical slowing down was first discussed in the framework of the kinetics of phase transitions [2].

As an example, let us study the 1D PDE:

(16) |

where is the dependent variable of the single coordinate . This equation exhibits a cubic nonlinearity . A classical Ginzburg–Landau equation only has constant coefficients for the terms and . In contrast, equation (16) possesses the inhomogeneity with

(17) |

shown by the solid line in Figure 4. It thus represents a nonhomogeneous version of the Ginzburg–Landau equation. One can see that (16) has the trivial solution .

**Figure 4.** The potential from equation (17) (solid, red) and the solution of the auxiliary equation (18) (dashed, blue).

Equations (16) and (17) play an important role in the theory of the transformation of types of domain walls into one another [3].

The auxiliary equation (5) in this case takes the following form:

(18) |

where enumerates the eigenvalues and eigenfunctions belonging to the discrete spectrum. One can see that equation (18) represents the Schrödinger equation [4] with potential well (17) and energy .

The exact solution of the auxiliary equation (18) is known [3, 4]. It has two discrete eigenvalues when and , and the ground-state () solution has the form

(19) |

which can be easily checked by direct substitution.

The energy functional generating the Ginzburg–Landau equation (16, 17) has the form:

(20) |

Equation (6) can be written as . Substituting that into equation (20) for the energy, eliminating the term with the derivative using equation (18) and applying the Gauss theorem, one finds the energy as a function of the amplitude :

(21) |

The *ramification* equation takes the form :

(22) |

with the following solution for the amplitude:

(23) |

Let us now look for the numerical solution of equation (16). The problem to be solved is to find the point of bifurcation and the overcritical solution at . The pseudo-time-dependent equation can be written as:

(24) |

The choice of the initial condition is not critical, provided it is nonzero. The method of lines employed in the following is relatively insensitive to whether or not the initial condition precisely matches the boundary conditions. We demonstrate its solution with three initial conditions

in the in the next section.

The method of lines is applied here since it can solve nonlinear PDEs, provided these equations are dynamic, which is exactly the case within the pseudo-time-dependent approach.

To address the problem numerically, let us start with the initial conditions taken at a finite distance, rather than at infinity. The distance must be greater than the characteristic dimension of the equation, which is the distance for which exhibits a considerable variation. For the Ginzburg–Landau equation (16), the characteristic dimension is defined by the width of the potential for (17), which is about 1. That is, let us start with the boundaries at with . We check the quality of the result obtained with such a boundary later.

To obtain a precise enough solution, one needs to make a spatial discretization providing a step comparable to the characteristic dimension of the equation, which we just saw is of the order of . Therefore, a step that is small enough can be a few times . The value appears to be enough.

The following code solves the equation. To keep the discretization with the step comparable to the characteristic equation dimension, we chose .

To avoid conflicts with variables that may have been previously set, this notebook has the setting Evaluation ▶ Notebook’s Default Context ▶ Unique to This Notebook.

According to Section 2, the time-dependent solution obtained converges to the solution of the stationary problem . In practice, however, one can instead take some finite value, provided that it is large enough.

We solve the pseudo-dynamic equation (24) with each of the three initial conditions stated before.

Further, in order to give the feeling of the method, we visualize and animate the solution, varying as well as the initial conditions. This requires a few comments. As discussed in Section 2.2, the maximum time of simulation strongly depends on . This is accounted for by introducing according to (15), where was chosen by trial so that the simulation does not last too long, but also so that the value of always ensures the convergence for any combination of and initial condition.

In the simulations, you can observe two essential features of the present method.

First, near the fixed point, the solution converges more slowly and the curve gradually appears to stop changing.

Second, near the critical point, close to , the critical slowing down (see Section 2.2) takes place, which requires considerably longer to approach the fixed point. In the animation, the curve evolves much more slowly at and , and the convergence, therefore, requires much more time.

In the , choose one of the three initial conditions and a value of . Click the button with the arrow to start the animation. The value of the current time is shown at the top-left corner. The distribution shown by the blue curve at corresponds to the initial condition, while at the animation shows its further evolution.

For each of the three initial conditions, the solution converges to the same bell-shaped curve. One can make sure that for low , the solution is nonzero. However, for greater than about 0.5, the solution is trivial.

To get an accurate solution, one needs to control the convergence as the pseudo-time increases. Here we control the convergence by analyzing the behavior of the integral

(25) |

(the norm of the solution in Hilbert space) at a fixed value of the parameter as a function of . The norm is zero above the bifurcation but nonzero below it.

We show how depends on the time limit at three fixed values of the control parameter : , and , which are all below the bifurcation point .

The following code makes a nested list containing three sublists corresponding to the three values. Each sublist consists of pairs at different values of the simulation time , which increases from 10 to approximately 3000. The exponential rate of increase is chosen so as to make the plot on a semilogarithmic scale look equally spaced (Figure 5).

**Figure 5.** Semilogarithmic plots of the Hilbert norm of the solution for (disks), (squares) and (diamonds) depending on the simulation time, .

There is convergence for all three values of . However, the value of for which the convergence is satisfactory depends on . For example, at the solution at slightly exceeding 100 is already near convergence. Thus, with , one can be sure that the solution is satisfactory. We use this in Section 4.4 to determine the expression for accounting for the critical slowing down.

In contrast, the solution for shows some evolution even at .

As we showed in Section 2.2, the value that gives satisfactory convergence depends on . To get an accurate solution, must considerably exceed the relaxation time . For example, in the calculation of the result shown in Figure 4, substituting and into (15), one finds , while the convergence only becomes good enough at , which is eight times greater than . This implies that to find an accurate solution in the close vicinity of the bifurcation point, one has to define depending on by

(26) |

where is the regularization parameter.

The bifurcation point can be found by analyzing the same integral calculated at in (26). Let us denote . This time we study the integral as a function of the parameter .

The transition from to occurs at the bifurcation point. Accordingly, the integral at this point changes from to .

To find the critical point, bifurcation theory (23) predicts the norm to be expressed in the form:

(27) |

We find the constant parameters and by fitting.

We now find the numerical solution of the equation (16) as a function of the control parameter ; the norm obtained from this solution depends on . We vary from 0.45 to to create a list consisting of pairs . The most critical region for dependence is close to the critical point, so the points there are taken to be about 10 times more dense. This list is fitted to the function (27). The list is plotted with the analytic function obtained by fitting (Figure 6).

**Figure 6.** Behavior of the Hilbert norm of the solution in the vicinity of the bifurcation point. Dots show the integrals (25), while the solid line indicates the result of fitting with the relation (27), yielding .

The values of the integrals at various are shown by the red dots in Figure 6, while its fitting curve is shown by the solid blue curve. The fitted value of the bifurcation point is and .

We used equation (26) for the used in the solution. However, this equation depends on the spectral value . In the present case, the value was known, which considerably simplifies the task. In general, the value of is only established in the course of the fitting procedure, requiring an iterative approach. For the first simulation, we fix some large enough value of independent of and obtain a fit. This fit gives the first guess for , which can then be used for the simulation with the equation (26). This procedure can be repeated until a satisfactory is achieved.

To check how the choice of the boundary affects the results, we solve the problem by gradually increasing (Figure 7). (This takes some time.)

**Figure 7.** A double-logarithmic plot showing the convergence of the bifurcation point with

increasing .

Figure 7 displays the error in the spectral value obtained by the numerical process. As one could have expected, with the increase of , it decreases from to about .

The preceding example has shown the application of the pseudo-dynamic approach for solving a 1D nonlinear PDE with zero boundary conditions that exhibits a supercritical (soft) bifurcation. That simple problem was chosen to keep the processing time as short as possible. Now possible extensions are discussed.

Recall that zero boundary conditions often (if not always) represent a problem for a nonlinear solver. Starting from along the boundary, such a solver often only returns the trivial solution, since zero is, indeed, the solution of the equation considered here. For this reason, a solution to a problem like the one discussed in this article necessarily requires some specific approach that can converge to a nontrivial solution. It is for this type of equation that the approach presented here has been developed.

One should, however, make two comments.

First, there are numerous problems where the bifurcation takes place from a solution that is nonzero. The boundary condition in this case has the form . A trivial observation shows that one comes back to the original problem by the shift .

Second, the approach formulated here can be applied to nonlinear equations with no bifurcation. These equations can have boundary conditions that are either zero or nonzero. Indeed, such equations can often be solved by a nonlinear solver if one is available. Among other approaches, the present one can be applied; the nonzero boundary conditions are not an obstacle for the transition to the pseudo-time-dependent equation.

Though the present approach takes longer, in certain cases it is preferable; for example, when due to a strong nonlinearity the nonlinear solvers fail. The solver moves along the pseudo-time parameter in small steps from to , gradually passing from the initial condition to the final solution. Such a slow ramping can be stable.

The space dimensionality does not limit the application of our approach (for 2D examples, see [5, 6]).

In the case of a soft bifurcation, the energy can have only one type of minimum, as shown in Figure 2 describing the convergence either to the trivial or the nontrivial solution. The trajectory always flows into the minimum along the steepest slope of . The minimum is a fixed point.

An essentially different situation occurs for a hard bifurcation, when the hypersurface may have multiple minimums. Figure 8 (A) shows a schematic cross section of the infinite-dimensional functional along the plane, leaving out all other dimensions. This cross section shows the situation with minima of different types. One of these minima is more pronounced than the others. The arrows schematically indicate the trajectories in the functional space. These start from the initial conditions displayed by the dots in Figure 8 (A, B) and converge to the minima (Figure 8 A). The green arrow shows the convergence of the process to the principal minimum, while the red one converges to a secondary minimum.

**Figure 8.** Schematic view of the energy functional along a direction of the functional space, where it exhibits a metastable minimum (A). The green point schematically indicates the initial condition starting from which the solution converges to the one corresponding to the principal energy minimum (green arrow), while the red dot shows the initial condition leading to the convergence to the secondary minimum. (B) The trajectory ends at an inflection point.

As a result, depending on the choice of initial condition, some solution trajectories may end up at a fixed point that is a secondary minimum rather than in the main one.

Also, keep in mind that the dimension of the functional space is infinite and can have many unobvious secondary minima.

There can also be inflection and saddle points of the energy hypersurface (Figure 8 B). The trajectory completely stops at such a point.

It is a fundamental question whether or not such secondary fixed points as well as the inflection points belong to the problem under study. The answer is not straightforward. One should look for such an answer based on the origin of the equation.

Let us also mention possible gently sloping valleys in the energy relief. In this case, the motion along such a shallow slope may appear practically indistinguishable from an asymptotic falling into a fixed point during the numerical process.

This article offers an approach to solve nonlinear stationary partial differential equations numerically. It is especially useful in the case of equations with zero boundary conditions that have both a trivial solution and nontrivial solutions. The approach is based on solving a pseudo-time-dependent equation instead of the stationary one, the initial condition being different from zero. Then the solver can avoid sticking to the trivial solution and is able to converge to a nontrivial solution. However, the penalty is increased simulation time.

[1] | M. M. Vainberg and V. A. Trenogin, Theory of Branching of Solutions of Non-linear Equations, Leyden, Netherlands: Noordhoff International Publishing, 1974. |

[2] | E. M. Lifshitz and L. P. Pitaevskii, Physical Kinetics: Course of Theoretical Physics, Vol. 10, Oxford, UK: Pergamon, 1981 Chapter 101. |

[3] | A. A. Bullbich and Yu. M. Gufan, “Phase Transitions in Domain Walls,” Ferroelectrics, 98(1), 1989 pp. 277–290. doi:10.1080/00150198908217589. |

[4] | L. D. Landau and E. M. Lifshitz, Quantum Mechanics: Course of Theoretical Physics, Vol. 3, 3rd ed., Oxford, UK: Butterworth-Heinemann, 2003. |

[5] | A. Boulbitch and A. L. Korzhenevskii, “Field-Theoretical Description of the Formation of a Crack Tip Process Zone,” European Physical Journal B, 89(261), 2016 pp. 1–18. doi:10.1140/epjb/e2016-70426-6. |

[6] | A. Boulbitch, Yu. M. Gufan and A. L. Korzhenevskii, “Crack-Tip Process Zone as a Bifurcation Problem,” Physics Review E, 96(013005), 2017 pp. 1–19. doi:10.1103/PhysRevE.96.013005. |

A. Boulbitch, “Pseudo-Dynamic Approach to the Numerical Solution of Nonlinear Stationary Partial Differential Equations,” The Mathematica Journal, 2018. dx.doi.org/doi:10.3888/tmj.20-8. |

Alexei Boulbitch graduated from Rostov University (USSR) in 1980 and obtained his Ph.D. in theoretical solid-state physics in 1988 from this university. In 1990 he moved to the University of Picardie (France) and later to the Technical University of Munich (Germany). The Technical University of Munich granted him his habilitation degree in theoretical biophysics in 2001. His areas of interest are bacteria, biomembranes, cells, defects in crystals, phase transitions, physics of fracture (currently active), polymers and sensors (currently active). He presently works in industrial physics with a focus on sensors and gives lectures at the University of Luxembourg.

**Alexei Boulbitch**

*Zum Waldeskühl 12
54298 Igel
Germany*