The action of Möbius transformations with real coefficients preserves the hyperbolic metric in the upper half-plane model of the hyperbolic plane. The modular group is an interesting group of hyperbolic isometries generated by two Möbius transformations, namely, an order-two element and an element of infinite order . Viewing the action of the group elements on a model of the hyperbolic plane provides insight into the structure of hyperbolic 2-space. Animations provide dynamic illustrations of this action.

This article updates an earlier article [1].

Transformations of spaces have long been objects of study. Many of the early examples of formal group theory were the transformations of spaces. Among the most important transformations are the *isometries*, those transformations that preserve lengths. Euclidean isometries are translations, rotations and reflections. The groups and subgroups of Euclidean isometries of the plane are so familiar to us that we may not think of them as revealing much about the space they transform. In hyperbolic space, however, light traveling or even a person traveling on a hyperbolic shortest-distance path tends to veer away from the boundary. Thus, the geometry is unusual enough so that viewing the actions of isometries of hyperbolic 2-space reveals some of the shape of that space. Two-dimensional hyperbolic space is referred to as the *hyperbolic plane*.

Here are graphic building blocks used for all of the animations.

Figure 1 shows four cyan and white regions, each bounded by some combination of three arcs or rays. Any two adjacent regions make up a *fundamental region*. The two fundamental regions shown on either side of the axis (each with one white and one cyan half) are related by the function , which is an inversion over the unit circle composed with a reflection in the axis.

**Figure 1. **Two fundamental domains on either side of the axis.

This article examines how elements of the modular group rearrange the triangular-shaped regions shown in Figure 2. The curved paths are arcs of circles orthogonal to the axis. Arcs on these circles are hyperbolic geodesics, that is, shortest-distance paths in hyperbolic 2-space. In Euclidean space, the shortest-distance paths lie on straight lines. In hyperbolic space, shortest paths lie on circles that intersect the boundary of the space at right angles. Hyperbolic distances are computed as if there is a penalty to pay for traveling near to the plane’s boundary. Thus, the shortest-distance paths between two points must bend away from the boundary.

**Figure 2. **The upper half-plane model of the hyperbolic plane.

In the animations that follow, it is instructive to focus on the action that a transformation takes on the family of circles that meet the axis at right angles. The transformations that we consider, namely members of the modular group, preserve this family of circles. The circles in the family are shuffled onto different members in the family, but no new circles are created and none are taken away. One could say that in the context of hyperbolic geometry, the transformations preserve the family of all shortest-distance paths. Indeed this is an excellent thing for an isometry to do!

The context of this article can be found described in Chapter 2 of [2]. In this small text, one can find illustrations that inspired our animations. The formulas, which made coding the animations much simpler than one might expect, are given and justified in detail.

First consider a class of functions known as Möbius transformations. These transformations are named after the same mathematician with whom we associate the one-sided, half-twisted Möbius band. Möbius transformations are defined by

Here stands for the complex numbers. Over the reals, a Möbius transformation with real coefficients falls into one of two categories: either , and the graph is a straight line, or , and the graph is a hyperbola. A representation of this latter type of function is shown in Figure 3.

**Figure 3. **Graph of shown with dashed asymptotes.

Our purpose is to investigate how Möbius transformations stretch and twist regions in the *extended complex plane*. The *complex plane* is the usual Euclidean plane with each point identified as a complex number—namely, . The *extended complex plane* is formed from the complex plane by adding the point at infinity. A Möbius transformation is one-to-one (injective) on the extended complex plane .

When a Möbius transformation acts on a complex number, , we may view the action as moving the point to the point Importantly, a Möbius transformation maps the set of circles and lines in back to in . A comprehensive proof of this fact may be found in most elementary texts on complex variables, for example, in [3], p. 158.

The figures of our animations live in the extended complex plane. Each point of a figure, taken as a complex number, is acted on by the Möbius transformations. These transformations spin hyperbolic 2-space about a fixed point or shift the space in one direction or another.

The *modular group* is a special class of Möbius transformations:

That is, if , the coefficients of are integers and the coefficient matrix has determinant equal to one.

What is a group? Recall that a *group* is a set together with a binary operation satisfying certain properties: (1) the set must be closed under the operation; (2) the operation must be associative; and (3) there must be an identity element for the operation, and all inverses of elements in must themselves be elements of . The proof that the modular group is, in fact, a group under the operation of function composition is a standard exercise in a course on complex analysis. (See, for example, [3], p. 277–278.)

We take as established that the elements of the modular group do indeed form a group and investigate some of the interesting subgroups.

One of our main goals is to investigate how the elements of the modular group act on fundamental regions. That is to say, how the regions are stretched and bent when we view them as Euclidean objects. As hyperbolic objects, the regions are all carbon copies of each other, in much the same way that the squares on a checkerboard are all identical in ordinary Euclidean geometry.

In general, a group of one-to-one transformations acting on a topological space partitions that space into *fundamental regions*. For a collection of sets to be a collection of fundamental regions, certain properties must hold. First and foremost, the must be pairwise disjoint. Second, given any transformation in the group other than the identity, and are disjoint. Finally, given any two regions and , there exists some transformation such that .

Generally, in order to cover the entire space without overlapping, each fundamental region must contain some but not all of its boundary points. This technicality is set aside for the purposes of this article.

In fact, in this article we relax the definition to include *all* of the boundary points for a particular fundamental region. Thus, adjacent fundamental regions can only overlap on their boundaries. The essential feature remains that there is no *area* in the intersection of adjacent regions.

A group of transformations does not necessarily yield a unique partition of the space into fundamental regions. Thus, the fundamental regions we view are merely representative fundamental regions.

Figure 4 shows a fundamental region of the modular group with some parts highlighted.

**Figure 4. **A fundamental region with vertices marked and a pair of tangents.

Each fundamental region contains four vertices that can be fixed by elements of the modular group. (A point is fixed by if .) Tangents are drawn at one vertex; the angle is 60 degrees. The vertex at the top has a straight, 180 degree angle. The vertex at the bottom has a zero degree angle because the tangents to the intersecting arcs coincide there. Any hyperbolic polygon with a vertex on the boundary of the space (the axis in this case of the upper half-plane) has a zero degree angle at that vertex. The corresponding four angles in each fundamental region have the same measures as those indicated here. Each vertex can be fixed by some element in the modular group. Further, each fundamental region can be mapped onto any other fundamental region by an element of the modular group.

A classic view of the matter is to see the upper half-plane as tessellated (or tiled) by triangular-shaped regions, as in Figure 2. A checkerboard tessellation of the Euclidean plane can be constructed by sliding copies of a square to the left, right, up and down. Eventually, the plane is covered with square tiles. The modular group tessellates the hyperbolic plane in an analogous way. The elements of the group move copies of a fundamental region until triangular-shaped tiles cover the upper half-plane model of the hyperbolic plane. Of course, these tiles do not appear to be identical to our eyes, trained to match shapes and lengths in Euclidean geometry. However, the triangular-shaped tiles are all identical if measured using the hyperbolic metric. In the tiling process, all areas in the upper half-plane are covered by tiles and no two tiles have any overlapping area. In fact, this procedure is precisely how the hyperbolic plane illustration was constructed. The boundary points for a single fundamental region were acted on by function elements of the modular group, and the resulting points were drawn as a boundary line in the illustration.

It helps to note that each transformation in has at least one *fixed point*. Some transformations in have two fixed points. Only the identity map has more than two. In the illustrations that follow, we observe the placement of fixed points and the way transformations map fundamental regions near them.

The hyperbolic metric is a rather curious metric that challenges our notion of distance. Under the hyperbolic metric in the upper half-plane, the shortest distance between two points is along a vertical line or an arc of a circle perpendicular to the boundary (the real axis). For example, the shortest hyperbolic path between the points and is the top arc of the circle , which passes through both points and is perpendicular to the real axis (Figure 5).

**Figure 5. **The shortest hyperbolic path between the points and .

Without discussing precisely how hyperbolic lengths and areas are measured, we state that every image under a transformation in the modular group is *congruent* to every other image under the hyperbolic metric. Thus, all of our fundamental regions shown in the animations are actually the same size in the hyperbolic metric. For a discussion on hyperbolic metrics, [4] is a good place to start.

We structure our investigation of the modular group by considering four cyclic subgroups. Recall that a *cyclic subgroup* can be generated by computing all powers of a single group element. The four cyclic subgroups we present are representative of the four possible types of subgroups found in the modular group.

For the first subgroup, consider the function ; it is a Möbius transformation with coefficients and coefficient matrix . The subscript indicates that is of order two in ; that is, , and so is its own inverse. In this case, generates a subgroup with only two elements, namely .

In this article, we adopt the standard notation that angular parentheses indicate the set of elements generated by taking products from the elements enclosed by the parentheses. Curly braces , on the other hand, enclose the delineated list of elements in a set.

The Möbius transformation , its inverse and the function are used for the motion in Figure 6.

**Figure 6. **Action of the order-two element .

The depicts the way in which maps the two fundamental regions shown in Figure 2 onto one another. In fact, the action of on the fundamental regions is to hyperbolically rotate them 180° onto each other about the central fixed point . The actual mapping is performed instantaneously without rotation. In particular, only the first, middle and final frames contain illustrations of fundamental regions. However, the sequence of intermediate mappings illustrates through animation the mapping properties of . In a later section, we discuss how the functions illustrated were broken into a composition of functions so that the hyperbolic nature of their motion was made continuous.

This example highlights the fact that vertical lines are paths of least distance in the upper half-plane model of the hyperbolic plane. Indeed, it is usual to view straight lines as circles that have radii with infinite length and that pass through the point at infinity. With this bending of the definition of a circle, a vertical line has all the characteristics required of a geodesic in the hyperbolic plane. Like the circles, a vertical line is perpendicular to the axis, which is the boundary of the upper half-plane model of the hyperbolic plane. A vertical line is the limit of a sequence of geodesic circles.

The second example (see Figure 7) is a subgroup of infinite order generated by the linear shift (or translation) ; it is a Möbius transformation with and matrix . The function has infinite order because , and no point of ever returns to its original position no matter how many times is applied, though in the extended plane. The subgroup produced by taking all powers of and its inverse is denoted as .

Every point in the plane shifts one unit to the right under the action of . The infinite half-strips in the following are images of each other under powers of . For contrast, we also provide images of these infinite half-strip regions under the map . These images are bunched in a flower-like arrangement attached to the real axis at the origin. As the blue infinite regions are pushed from left to right, their magenta images echo their motion in a counterclockwise direction. These two actions are not produced by a single transformation. The two transformations that cause these actions are closely related to each other as algebraic conjugates, but more on that in a later section.

**Figure 7. **This animation shows copies of fundamental regions moving back and forth, with corresponding regions anchored at the origin.

The hyperbolic isometry is notable among the elements of the modular group because it is also a Euclidean isometry. Under the hyperbolic metric, the magenta regions are each congruent to the half-strip regions in blue.

The third cyclic subgroup of that we consider is generated by the composition of the first two functions and : define . The subgroup generated by this element is denoted , a subgroup of order 3.

Here is the fixed point of .

Define and its inverse .

In Figure 8, the function moves the fixed point to the origin and moves it back.

The function is an order-three hyperbolic rotation made continuous; it is used in Figures 8 and 11.

**Figure 8. **Action of the order-three element .

A red fundamental region and a green fundamental region are shown associated with the blue fundamental region attached to the origin in the animation’s first frame. We include these in order to provide a better orientation for the scene. Of special interest is how the point of the blue region on the axis moves as the rotation takes place. The point begins at the origin and slides toward the right along the positive axis. The blue lines of the cluster become vertical precisely when that point arrives at the point at infinity! The point continues by sliding along the negative axis to arrive back at the origin. It is fair to say that the motion of a point as it passes through the origin is a “mirror image” of the motion of the point as it passes through the point at infinity. The function used here is a composition of Möbius transformations that is described and demonstrated in Figure 12.

The rotations we saw in the action of and are of orders two and three, respectively. That is to say, after the rotation is repeated a number of times, all points are back to their original positions. In contrast, the function generates an infinite subgroup. When we iterate , the right shifts accumulate at the point at infinity. Points in the left half-plane get repelled by infinity, while points in the right half-plane get attracted to infinity. Of course, since all points in the left half-plane eventually map to points in the right half-plane, all points are, in some sense, simultaneously attracted to and repelled by infinity under the action of . Indeed, the point at infinity is the single fixed point for the action of .

The transformation in the modular group generates an infinite subgroup that differs from in the sense that has two distinct fixed points, an attractor and a repeller.

Define the well-known golden ratio ; its reciprocal is .

This defines with fixed points and .

Define to move the fixed points and to infinity and zero, respectively.

Define to be the inverse of ; sends infinity back to and zero back to .

For Figure 9, makes the hyperbolic translation continuous. The ratio is the length of the hyperbolic translation.

**Figure 9. **Action of the hyperbolic element .

The depicts the action of on fundamental regions in the plane. All points exterior to the red circle on the left are mapped to the interior of the green circle on the right. The animation begins with regions that lie exterior to the red and green circles. These regions are all mapped to the area between the green circle and smaller cyan circle. If the action of were to be repeated, the regions would be mapped into the interior of increasingly small circles inside the smallest (cyan) circle shown. The attracting fixed point for lies within these shrinking, nested circles.

The rotations and translations we have seen as examples are intimately related to Euclidean rotations and translations, as discussed in Section 6. The transformation is related in a similar way to a Euclidean dilation, which turns a figure into a similar but not congruent image figure. A curious characteristic of hyperbolic space is that the distinction between similarity and congruence disappears. In the hyperbolic plane, it is enough for two figures to have the same angles to guarantee congruence. In marked contrast to Euclidean space, equal angles guarantee that corresponding side lengths are equal in the hyperbolic metric.

The entire group can be generated by the two functions and . In symbols, . Establishing this fact requires tools from linear algebra about which we make only a few brief comments. The group of matrices with real number entries and with nonzero determinants is denoted by . This group has been studied extensively and much is known about it. Thus, there are great advantages for any group that can be represented as a subgroup of . While the modular group cannot be represented in exactly this way, it almost can be. For instance, the elements and are considered distinct in . On the other hand, the actions of the two associated Möbius transformations are identical, since . In general, for every element in the modular group, there are two associated elements in .

A remarkable feature of Möbius transformations is that the group operation of compositions produces coefficients that are identical to the results of matrix multiplication. To see this, consider the two Möbius transformations and . First, multiply the associated matrices.

Second, carry out the composition of the two functions.

The coefficients and the four matrix entries are the same!

In this way, the group operation of composition of functions in the modular group can be replaced with the group operation of matrix multiplication in . It is down this path we would travel if we were to present a complete proof of the claim that the modular group is generated by the two elements and .

A major part of this claim is that any element of can be written as a composition of and Consider the following examples of compositions.

Each of these functions has a coefficient matrix with determinant equal to one. A worthy exercise for undergraduate mathematics students is to verify by direct computations that each equality holds for the indicated compositions.

The four examples of cyclic subgroups outlined in Section 4 give a complete description of the four types of subgroups possible in the modular group. Any subgroup of the modular group is *conjugate* to a subgroup generated by , , an iterate of or an iterate of an element of the same type as

Recall that in a group , a subgroup is *conjugate* to a subgroup if there exists an element such that the entire subgroup can be generated by computing for every More compactly, we write , and even more compactly, .

Here we define the hyperbolic translation that relates two order-three elements: and .

Consider a function in the modular group that generates an order-three subgroup (Figure 10).

**Figure 10. **The action of the order-three function on a selection of fundamental regions.

We view side by side the actions of on the right and on the left. For the figure on the left, first the function moves the fixed point of onto the fixed point of . Then the function rotates the attached fundamental regions, as we have seen it do before, while at the same time the function acts on the right-hand figure. Finally, the inverse of returns the fixed point and associated regions to the original position, except that the fundamental regions have been rotated in the same way as those on the right. Thus, the final results are the same in both cases.

The function is used for the continuous motion in Figure 11.

**Figure 11. **The action of and .

This animation demonstrates what it means for two functions to be conjugate equivalent.

All functions in the modular group are abruptly discontinuous; that is, their actions move triangular regions onto other regions all in one jump. The facility to produce transformations that seem continuous is due to the following.

Every element of the modular group is conjugate equivalent to one of three Euclidean transformations, namely a rotation about the origin, a scaling from the origin or a rigid translation of the entire plane ([2], pp. 12–20).

These Euclidean transformations have very simple continuous forms:

Rotation: .

Scaling: .

Translation: .

The left-hand side of Figure 12 shows the fixed point of translated to the origin. Following this transformation, all circles that passed through the original fixed point become straight lines passing through the origin. A Euclidean rotation about the origin accomplishes the desired rearrangement of the regions. Finally, translating the fixed points back to their original positions maps the fundamental regions to their proper, final positions. We see that the final results are the same for the right and left animations.

Indeed, each frame in the right-hand animation was computed by composing the functions that are explicitly portrayed in the left-hand animation.

**Figure 12. **Conjugation with rotation of 120°.

In this way, the action of the hyperbolic motions can be animated as continuous because the Euclidean rotations, translations and dilations can all be coded as continuous functions.

[1] | P. McCreary, T. J. Murphy and C. Carter, “The Modular Group,” The Mathematica Journal, 9(3), 2005. www.mathematica-journal.com/issue/v9i3. |

[2] | L. Ford, Automorphic Functions, New York: McGraw-Hill, 1929. |

[3] | N. Levinson and R. M. Redheffer, Complex Variables, San Francisco: Holden-Day, 1970. |

[4] | E. W. Weisstein. “Hyperbolic Metric” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/HyperbolicMetric.html. |

P. R. McCreary, T. J. Murphy and C. Carter, “The Modular Group,” The Mathematica Journal, 2018. dx.doi.org/doi:10.3888/tmj.20-3. |

**Paul R. McCreary**

*The Evergreen State College-Tacoma
Tacoma, WA*

**Teri Jo Murphy**

*Department of Mathematics & Statistics
Northern Kentucky University
Highland Heights, KY*

**Christan Carter**

*Department of Mathematics
Xavier University of Louisiana
New Orleans, LA*

We propose and implement an algorithm for solving an overdetermined system of partial differential equations in one unknown. Our approach relies on the Bour–Mayer method to determine compatibility conditions via Jacobi–Mayer brackets. We solve compatible systems recursively by imitating what one would do with pen and paper: Solve one equation, substitute its solution into the remaining equations, and iterate the process until the equations of the system are exhausted. The method we employ for assessing the consistency of the underlying system differs from the traditional use of differential Gröbner bases, yet seems more efficient and straightforward to implement.

The search of solutions of many problems leads to overdetermined systems of partial differential equations (PDEs). These problems comprise the computation of discrete symmetries of differential equations [1], the calculation of differential invariants [2] and the determination of generalized Casimir operators of a finite-dimensional Lie algebra [3]. In this article, we focus solely on the integration of simultaneous systems of scalar first-order PDEs; that is, our systems have at least two equations, one dependent variable (the unknown function) and several independent variables. Our ultimate goal is to automate the search of general symbolic solutions of these systems. The approach we adopt uses the Bour–Mayer method [4] to find compatibility conditions (i.e. obstructions to the integrability) of the underlying system of PDEs and to iteratively prepend these compatibility conditions to the system until a consistent or an inconsistent system is found. This differs from the traditional approach, which uses differential Gröbner bases [5] to discover compatibility conditions. When applicable, it has the advantage of being easy to implement and efficient. Recently, using machinery from differential geometry, Kruglikov and Lychagin [6] have extended the Bour–Mayer method to systems of PDEs in several dependent and independent variables of mixed orders (i.e. the orders of the individual equations in the system can be different). In our approach, for the situation where the completion process leads to a consistent system, we solve the latter by imitating what one would do with pen and paper: Solve one equation, substitute it into the next equation, and continue the process until the equations of the system are exhausted.

To fix ideas, consider a system of PDEs

(1) |

where to are the independent variables, is the partial derivative of the unknown function with respect to , and the rank of the Jacobian matrix is . In the sequel, we will say that a property holds locally if it is true on an open ball of its domain of validity. The system of equations (1) is integrable (i.e. admits a locally smooth solution) provided the expressions to derived from it locally satisfy the conditions

(2) |

To see this, consider a solution of the system of equations (1). Then, locally, . Thus, the latter differential form is locally exact. So, in particular, it is locally closed. Therefore, its exterior differential vanishes; that is, , or equivalently, after some calculations, , which implies (2). Conversely, if the system of equations (2) is locally satisfied, then the differential form is locally closed and by Poincaré’s lemma, it is also locally exact. Hence, for some locally smooth function . Therefore is locally defined by , where is an arbitrary constant.

Bour and Mayer (see e.g. [4]) showed that (1), subject to the condition on the Jacobian matrix of the with respect to the , is integrable if and only if the Jacobi–Mayer

(3) |

whenever (1) is satisfied. From now on, abbreviate the phrase “ whenever (1) is satisfied” to .

For a given system of equations (1) satisfying the nondegeneracy condition mentioned, four cases arise.

The first case is when and all the Jacobi–Mayer brackets vanish whenever (1) is satisfied. In this case, we can solve (1) for to . The solution of the system is then obtained by integrating the exact differential form .

The second case is when there are distinct indices and such that . Then (1) is incompatible and there are no solutions.

In the third case, , and all the Jacobi–Mayer brackets vanish in (1). We must supplement (1) with additional equations until we get to the first or second case. These equations are obtained by solving the system of linear first-order PDEs , where and . For example, we get the additional equation , where is an arbitrary constant, by solving the system of linear first-order PDEs , where . The solution of the completed system depends on arbitrary constants. We obtain the general solution of the initial system of equations (1) by expressing one of the arbitrary constants as a function of the remaining ones, then eliminating the remaining constant between the resulting equations and their first-order partial derivatives with respect to the arbitrary constants.

In the fourth and final case, some brackets are zero in (1) and other brackets have the form , where the depend at least on some . In this case, we must prepend the equations to the equations in (1) and proceed as in the third case.

The procedure just described is the essence of the Bour–Mayer approach to the solution of (1). One has to solve overdetermined systems of linear scalar PDEs and ensure that the equations one adds to the initial system are compatible with them and that the equations of the resulting systems are linearly independent. In our implementation of the Bour–Mayer approach, we complete the initial system of equations (1) by prepending to it the appropriate compatibility constraints prescribed by Jacobi–Mayer brackets until we obtain either a compatible or an incompatible system. Starting from compatibility constraints, we iteratively solve the compatible system obtained by using the built-in function . The remainder of this article is devoted to the implementation and testing of this approach.

Here we focus on the coding of the algorithm described in the introduction. Specifically, we start by iteratively solving a system of consistent first-order PDEs in one dependent variable. Then we implement the test of consistency of a system of first-order PDEs in one unknown. Finally, we couple the last two programs in such a way that a single function is used to compute the general solution of the input system when it exists or to indicate that it is inconsistent.

Our program for iteratively solving a compatible system of scalar first-order PDEs is made of the main function and three helper functions, , and .

is a recursive function that takes as input the system to be solved, the dependent variable , the list of independent variables , a container for the list of successive solutions, a list of equations that could not be solved, a string that is used as a root to form the names of intermediate dependent variables, and a variable that is used to count and name intermediate dependent variables. The output of is a list of rules and a list of unsolved equations.

The function mimics what one would do by hand when solving a system of first-order PDEs in one unknown: Solve an equation, substitute its solution into the remaining equations, and continue as long as possible. At each stage, the number of independent variables is reduced by one and it is necessary to rename the variables before proceeding. Also, the dependent variables are curried functions that must be undone to ensure that the chain rule is applied properly during substitution into the remaining PDEs. This is perhaps the trickiest part of our implementation.

The function takes the output of and converts it into the solution of the system to be solved. The helper function converts an expression depending on several variables into a pure function of these variables. Finally, the function composes and to solve a compatible system of scalar PDEs. Its inputs are like those of and its output is formatted like that of .

This subsection implements the compatibility test provided by the Bour–Mayer method as described in the introduction using . The input to is the underlying system of PDEs , the dependent variable and the list of independent variables ; outputs a pair: the first element indicates whether the system is compatible and the second element gives the completed system.

The function computes the pairwise Jacobi–Mayer brackets of a system of PDEs according to equation (3) and in these brackets replaces some first-order partial derivatives of the unknown function obtained from the underlying system of PDEs. The function checks whether an expression contains a derivative of the unknown function.

Here we use the functions defined so far to solve an overdetermined system of first-order PDEs in one unknown. The function takes as arguments the system to be solved, , and its dependent and independent variables, and . The function verifies whether a given rule gives a solution of a system of first-order PDEs in one unknown.

This subsection is chiefly concerned with examples taken from various specified sources. For convenience, warnings are suppressed with the built-in function . Undefined global variables (, , , etc.) are used, so make sure there are no conflicts from your own session.

The examples presented here arise in the search of differential invariants of hyperbolic PDEs [2].

Except for example 9, gives for all systems, so it is only shown once here.

Examples 5 and 6 come from Saltykow [7].

The two systems of PDEs treated here are in Mansion [4].

The second entry of shows that there are two PDEs that were not solved. It is straightforward to separate these PDEs with respect to and to obtain new PDEs that are easily solved using the built-in function . The separation can be done automatically through the following one-liner.

The last example is due to Boole [8].

This article has introduced and implemented an algorithm based on the Bour–Mayer method for solving an overdetermined system of PDEs in one unknown. We have demonstrated the efficiency of our approach through the consideration of 13 examples.

I gratefully acknowledge partial financial support from the DST-NRF Centre of Excellence in Mathematical and Statistical Sciences, School of Computer Science and Applied Mathematics, University of the Witwatersrand, Johannesburg, South Africa. I thank Prof. F. M. Mahomed for securing the necessary funds and his team for the hospitality during my visit last summer. This article is dedicated to my daughter Katlego on her sixteenth birthday.

[1] | P. E. Hydon, “How to Construct the Discrete Symmetries of Partial Differential Equations,” European Journal of Applied Mathematics, 11(5), 2000 pp. 515–527. |

[2] | I. K. Johnpillai, F. M. Mahomed and C. Wafo Soh, “Basis of Joint Invariants for () Linear Hyperbolic Equations,” Journal of Nonlinear Mathematical Physics, 9(Supplement 2), 2002pp. 49–59. doi:10.2991/jnmp.2002.9.s2.5. |

[3] | J. C. Ndogmo and P. Winternitz, “Generalized Casimir Operators of Solvable Lie Algebras with Abelian Nilradicals,” Journal of Physics A: Mathematical and General, 27(8), 1994pp. 2787–2800. iopscience.iop.org/article/10.1088/0305-4470/27/8/016/meta. |

[4] | P. Mansion, Théorie des équations aux dérivées partielles du premier ordre, Paris: Gauthier-Villars, 1875. |

[5] | B. Buchberger and F. Winkler, Gröbner Bases and Applications, Cambridge: Cambridge University Press, 1998. |

[6] | B. Kruglikov and V. Lychagin, “Compatibility, Multi-brackets and Integrability of Systems of PDEs,” Acta Applicandæ Mathematicæ, 109(1), 2010 pp. 151–196.doi:10.1007/s10440-009-9446-0. |

[7] | N. Saltykow, “Méthodes classiques d’intégration des équations aux dérivées partielles du premier ordre à une fonction inconnu,” Mémorial des sciences mathématiques, 50, 1931pp. 1–72. www.numdam.org/item?id=MSM_ 1931__ 50__ 1_ 0. |

[8] | G. Boole, “On Simultaneous Differential Equations of the First Order in Which the Number of the Variables Exceeds by More Than One the Number of the Equations,” Philosophical Transactions of the Royal Society of London, 152(5), 1862 pp. 437–454. doi:10.1098/rstl.1862.0023. |

C. W. Soh, “Symbolic Solutions of Simultaneous First-Order PDEs in One Unknown,” The Mathematica Journal, 2018. dx.doi.org/doi:10.3888/tmj.20-2. |

Dr. C. Wafo Soh is currently an associate professor of mathematics at Jackson State University and a visiting associate professor of applied mathematics at the University of the Witwatersrand. He is the cofounder of the South African startup Recursive Thinking Consulting, which specializes in process mining.

**Célestin Wafo Soh**

*Department of Mathematics and Statistical Science
JSU Box 1760, Jackson State University
1400 JR Lynch Street
Jackson, MS 39217
*

*DST-NRF Centre of Excellence in Mathematical and Statistical Sciences
School of Computer Science and Applied Mathematics, University of the Witwatersrand
Johannesburg, Wits 2050, South Africa*

We simultaneously introduce effective techniques for solving Sudoku puzzles and explain how to implement them in Mathematica. The hardest puzzles require some guessing, and we include a simple backtracking technique that solves even the hardest puzzles. The programming skills required are kept at a minimum.

Sudoku, for those unfamiliar with this puzzle, consists of a square grid with nine subgrids. The 81 entries are to be filled with the integers 1 to 9 in such a way that each row, column and subgrid contains all the nine integers. Some of the entries are already chosen, and the final puzzle solution must contain these initial choices. Here is a sample puzzle.

The input for this puzzle is a list of nine lists consisting of blanks (shown as □) or integers between 1 and 9. A list of lists of the same length is regarded as a matrix in Mathematica, so we input for the puzzle and then show it in matrix form.

We can also display this in Sudoku format by drawing column and row lines and a frame.

In attempting a solution, a blank gets replaced with a list of candidate entries shown compactly without braces or commas.

Each element of a Sudoku matrix is obtainable as , where and are the row and column of the element in the matrix.

To obtain the entire row in `X` that contains the entry , we evaluate ; so, for example, this gives row 3, which contains element .

To obtain the column in `X` that contains is a little trickier. One could first “transpose” the matrix , that is, interchange rows and columns and then find row of the transposed matrix; the command for transposing a matrix is . However, Mathematica has a faster way using the option . We just enter to get column of `X`. For example, the column that contains , that is, column 5 of , can be obtained by entering .

The function displays that list vertically.

It is more difficult to obtain the block to which belongs. To do this we define a function that gives a list of the entries that comprise the block of in `X`.

For example, this is the block containing , the sixth entry in row 1, □.

To get a single list of these entries by removing the inner parentheses, we use .

Our first step is to replace each □ in (using ) by the list of the nine numbers, , which are possible candidates to occupy that position in .

Our next task is to start eliminating candidate values in the entries that are lists of numbers in `X`, proceeding one entry at a time. We start with in order to be able to redefine entries.

Since no entry can appear more than once in any row, column or block, we let be the set of integers in the row, column and block containing .

If the entry is a list rather than an integer, we redefine by removing the entries that also belong to .

Finally, if is a list of one element, we redefine it to be that element.

To apply again and again to until the result no longer changes, we use . The puzzle simplifies, but we see that we are still not done!

However, the first block has three entries (colored red) that are all sublists of .

While we do not know the exact value of any of the red entries, we know that the three numbers 5, 6 and 8 will be used up filling them; thus we can remove 5, 6 and 8 from the *other* entries in this block (colored green).

Similarly, in the first row, there are three entries that are sublists of , so we remove 5, 6 and 8 from at the end of row 1; this defines .

Then we use again and display the result. We are done! We explore this technique further in the next section.

If any row, column or block contains the pair twice, both and must be used up in the two entries containing , even though we do not know which pair contains and which contains . Hence, no other entry in that row, column or block can contain either or . This obvious fact is surprisingly useful in solving Sudoku puzzles.

To use it, we define the function .

1. We select the set of pairs (the lists of length two).

2. The twins are the identical pairs.

3. The numbers in the twins are the numbers to prune.

4. The lists are pruned.

5. Any singleton list is changed to its element.

Here are some examples using , which was defined before. The twins in this row are and .

Hence 5 and 8 are removed from the other lists in the row.

The twins in row 3 are and .

We can map over all the rows of a matrix.

We now use starting with until the result does not change.

We are done. It was only necessary to use on the rows.

It is easy to apply on the columns: transpose, apply to the rows, and transpose back.

The blocks are more complicated. We make use of a general theorem: transform an matrix by taking the elements of each block in order as the rows of a new matrix ; then (i.e. is an involution). Here stands for block transpose.

This verifies the theorem in the case.

To construct the new kind of transposed matrix, we define the function .

Here is the transformed matrix .

Finally, we look at the matrix .

It is the same as the original matrix ; here is a direct check that they are equal.

It is a common technique in problem solving to first transform the problem, solve the transformed problem and then transform back. As an example, we apply , followed by , followed by , to the matrix defined earlier; blocks , and change.

This makes a function out of that line of code.

The function puts together the discussion so far.

We apply it to the matrix and get a solution as before.

We generalize the function to to deal with triples and quadruplets as well as twins.

Just as with , we want to use on rows, columns and blocks of a matrix and then combine them in .

We had already solved with , but let us apply for triples as a check.

combines the three solvers into one.

Consider the puzzle .

Unfortunately, does not solve the puzzle.

However, there are entries that are pairs.

We propose to replace the pair by 1 and to try to solve the modified puzzle; if that leads to a contradiction, then .

We introduce the functions and via the helper function . If there are any pairs in `X`, replaces the first such pair with its left entry and applies ; the function replaces with its right entry .

The two blank entries indicate a contradiction.

Therefore, the alternative must solve the puzzle, and it does.

We have just seen that guessing between two alternatives quickly led us to a solution. However, if a solution was not obtained with the first alternative, it might be necessary to again guess between two alternatives, and so on. If there are always just two alternatives, this leads to a binary tree with the root at the top.

It is not clear how many levels or guesses are needed before reaching a solution. Also, it may not be necessary to generate the entire tree before a solution is reached. There is a systematic and efficient way to search such a tree, usually referred to as backtracking.

Here is the method: start at the root and go left as long as there is no contradiction. If there is a contradiction, go back one level and go right. Then resume going left as far as possible. If there is a contradiction after going right, go back through all the right branches traversed so far; then go back through an additional left branch and go right.

For example, assume that contradictions exist at all nodes on level 4 except for the last one, node 15. The labels in the following tree indicate how to backtrack to the solution at node 15.

The binary choice in the Sudoku situation is to go either left or right. A path through the tree corresponds to a sequence of such choices; for example, the path (1, 2, 6, 8) generates the sequence: , from which a composition of functions can be built.

Here is how the built-in function works with undefined functions.

This example shows a clear contradiction, since there are two blank entries.

The function , when given a nonempty sequence of and functions, drops the last in the sequence (if any) until there are none, drops the last , and finally appends a .

Here is an example.

In the next two functions, this kind of code tests for a list of lists of integers, a necessary condition for a Sudoku solution.

The function goes left if applying the sequence (a global variable with entries and ) to the matrix with does not contain an empty list ; otherwise it backtracks using . If the new sequence applied to the matrix contains only numbers, throws the matrix to the nearest containing .

We now use inside the function , which initializes the global variable .

Consider the following Sudoku puzzle.

We have failed so far to solve the puzzle using ; so we try the backtracking technique.

To see what sequence solved this puzzle, we only have to enter .

We next try on defined in the previous section.

If we now enter , we can see what sequence solved this puzzle.

Next is Evil Puzzle 8,076,199,743 from Web Sudoku [1].

Again, the solving sequence is given by .

This final puzzle was created by Arto Inkala, a mathematician based in Finland; it is claimed to be the world’s hardest Sudoku puzzle [2].

Here is how this puzzle was solved.

There are many other techniques known to experienced Sudoku solvers that could be added to our programs; also backtracking could obviously be extended to triples, and so on.

Sudoku provides a superb opportunity to introduce useful programming techniques to students of Mathematica. Backtracking is one such technique that is largely absent from standard discussions of Mathematica programming but, as we have shown, is easily implemented in Mathematica when needed.

[1] | Web Sudoku. (Jan 18, 2018) www.websudoku.com. |

[2] | Efamol. “Introducing the World’s Hardest Sudoku.” (Jan 18, 2018) www.efamol.com/efamol-news/news-item.php?id=43. |

R. Cowen, “A Beginner’s Guide to Solving Sudoku Puzzles by Computer,” The Mathematica Journal, 2018. dx.doi.org/doi:10.3888/tmj.20-1. |

Robert Cowen is Professor Emeritus in Mathematics, Queens College, CUNY. He does research in logic, combinatorics and set theory. He taught a course in Mathematica programming for many years, emphasizing discovery of mathematics, and is currently working on a text on learning Mathematica through discovery with John Kennedy. His website is sites.google.com/site/robertcowen.

**Robert Cowen***16422 75th AvenueFresh Meadows, NY 11366*

Rubik’s cube has a natural extension to four-dimensional space. This article constructs the basic concepts of the puzzle and implements it in a program. The well-known three-dimensional Rubik’s cube consists of 27 unit subcubes. Each face of determines a set of nine subcubes that have a face in the same plane as . The set can be rotated around the normal through the center of . Rubik’s 4-cube (or 4D hypercube) consists of 81 unit 4-subcubes, each containing eight 3D subcubes. Each 3-face of determines a set of 27 4-subcubes that have a cube in the same hyperplane as . The set can be rotated around the normal (a plane) through the center of . Projecting the whole 4D configuration to 3D exhibits Rubik’s 4-cube as a four-dimensional extension of Rubik’s cube. Starting from a random coloring of the 4-cube, the goal of the puzzle is to return to the initial coloring of the 3-faces.

To understand the 4D hypercube, it helps to first see how its lower-dimensional analogs relate to each other. The zero-dimensional hypercube (or 0-cube) is a point, with one vertex. The 1D hypercube (or 1-cube) is a segment, with two vertices and one edge. The 2D hypercube (or 2-cube) is a square, with four vertices, four edges and one face (the square including its interior). The 3D hypercube is a cube (or 3-cube), with eight vertices, edges, six square faces and one volume. Going up a dimension doubles the number of vertices. More generally, the number of -cubes (points, segments, squares, …) in an -cube, , is .

The 3D cube can be represented in a 2D plane using central projection, defined by taking the intersection of the plane with the line joining the two points and . This projection maps the point to . Choose to obtain the projection shown on the right in

Figure 1. Five of the faces overlap with a sixth face, the price to pay for the loss of one dimension.

**Figure 1.** A cube and its image under a central projection.

Overall, the 4D Rubik puzzle is a 4-cube [1] (or 4D hypercube or tesseract), with vertices, edges, squares, eight cubes and one 4-cube. The eight cubes are called *cells*, which are like the six square faces of a 3D cube. The proper faces of the 4-cube are its vertices, edges, squares and cells.

Each point of a proper face is on the 3D hypersurface of the 4-cube. No point of a proper face is strictly in the interior of the 4-cube; that is, a hypersphere at such a point contains points inside and points outside the 4-cube. In particular, no interior point of a cell as a 3D object is in the interior of the hypercube; all the points of a cell are on the boundary of the 4-cube.

The 16 vertices of a 4-cube can be defined as lists of length four of all possible combinations of

and .

The 24 squares of the 4-cube are described in terms of their vertex indices.

Besides the 4-cube, there are five other regular polytopes in four dimensions. The .csv and .m files containing information for these polytopes are provided at [2]: the positions of the vertices, vertex indices for the proper faces and which faces are neighbors.

To display the 4-cube in 3D, central projection from 4D to 3D is analogous to central projection from 3D to 2D; the function is the natural extension of ; see Figure 2.

**Figure 2.** Projected image of a 4-cube by means of center projection. The larger outer cube is one of the cells of the 4-cube.

An axis of rotation in 3D is a fixed line. In 4D, an axis of rotation in four dimensions is a fixed plane [3]. For example, the rotation matrix about the - plane is defined by:

(1) |

There are six planes of rotation spanned by pairs of coordinate axes, namely -, -, -, -, -, -.

Here is the first one, for example, which leaves points in the - plane fixed.

This animation shows two successive rotations of the 4-cube projected to 3D.

Consider a 4-cube with center at the origin , side length 3, and with all proper faces of positive dimension parallel to the coordinate axes. Then its 16 vertices are:

(2) |

The eight cells of the initial 4-cube are colored differently. The word “initial” means that no rotations have been applied. The coloring touches every point of a cell, including its 3D interior points.

Just as the faces of Rubik’s cube are divided into nine squares by dividing each edge into three, the edges of Rubik’s 4-cube are also divided into three. Then the initial 4-cube is divided into small 4-cubes, each with edge length 1. The boundary (a hypersurface) of a small 4-cube contains eight small cubes, its cells.

The Rubik 3-cube has 27 subcubes in ; no square of the center cube is colored and some squares of other cubes are colored. These 26 subcubes are classified into three types according to whether they are at a corner, at an edge or at the center of a face of the larger cube. Figure 3 shows one of each type.

**Figure 3.** Three types of small cubes: in the center of a square face, at an edge and at a vertex, with one, two or three colored squares.

Analogously, the 81 small 4-cubes of include the uncolored one at the center and 80 partially colored small 4-cubes. These are classified into four types according to the dimension of their intersection with . The type of a small 4-cube does not change after rotation. Table 1 summarizes the numbers for each type for and .

cube or cell.

The number of colored small squares for Rubik’s cube is calculated using the data in Table 1:

Another way is to count the number of faces times the number of squares per face: .

The small 4-cubes with nonzero coordinates form the hypersurface of . In particular, a small 4-cube with center given by four nonzero coordinates contains a vertex of . Again from Table 1, the number of colored small cells is:

This number can also be obtained as the number of cells of times the number of small cells per cell of : .

We define several global variables to be used here and later. Figure 4 shows the divided 4-cube with 216 colored small cells.

**Figure 4.** Center projection of a hypercube consisting of 216 colored small cells.

Each edge is divided into three parts, so that the length of the 4-subcubes is 1. Consider a 4-subcube with center at . Then, the vertices of the 4-subcube are . The coordinates of the center of each 4-subcube are a combination of one of , 0 and . When the value is nonzero ( or ), the 4-subcubes face outward in the corresponding directions. In other words, the nonzero values in coordinates denote the outward-facing 4-subcubes.

For the initial state, the colors of the cells are set according to the coordinates of their centers:

For example, in the small 4-cube with center , the two small cells with vertices and are colored because both the and coordinate values are nonzero.

The geometry of the 216 small colored cells is used to manage the puzzle. Each element of the datasets consists of four elements: (1) the vertices of six squares; (2) the location of the center of the small 4-cube to which the small cell belongs; (3) color; and (4) the location of the center of the small cell. The vertices of the six squares are used for drawing the subcubes, and the locations of the centers of the subcubes are used to judge the completeness of the puzzle. The dataset of the initial state is obtained by the following procedures. First, the vertex numbers of the squares making up each small cell are defined.

Next, the 216 small cells are selected by checking all possible small cells.

This list contains 216 entries and each entry contains four components corresponding to a small cell.

For example, here is entry 123 of . The components for this small cell are its six square faces, center, color and current position.

sets up a 4-cube for drawing.

Here is an example.

sets up a cell (with 27 4-cubes) for drawing.

Figure 5 shows the initial state of Rubik’s 4-cube.

**Figure 5.** Center projection of initial state of the Rubik 4-cube with its eight cells.

In the 3D case of Rubik’s cube , a *block* is a set of nine small cubes whose centers have one coordinate that is constant, , 0 or 1. There are nine blocks, three per coordinate axis. A natural technique to rotate a middle block, for example, the one cut by the plane by , is to rotate the block above by , the block below by , and then the whole cube by .

In the 4D case, a *block* is a set of 27 small 4-cubes whose centers have one coordinate that is constant. There are 12 blocks, four per axis and three per choice of constant coordinate , 0 or 1. Under a rotation, the small 4-cubes in a block change position simultaneously. Each block is a four-dimensional *hyperprism* with height 1.

Figure 6 shows an example of the block ; the cell opposite the orange cell is not colored.

**Figure 6.** Example of a block of 27 small 4-cubes (orange, ).

A block is rotated by , or around an axis, which is a fixed plane. Therefore, the information needed for an action on the Rubik 4-cube is (1) the block to be rotated; (2) the axis of rotation; and (3) the angle. For Rubik’s cube , the axis of rotation is automatically determined by selecting a block. But for , two coordinate axes must be chosen to determine the fixed plane. One is the constant coordinate axis used to select the block, and the other must be chosen from the remaining three coordinate axes. There are 108 possible actions on : 12 choices of block, three choices for the second coordinate axis and three choices of angle: . Therefore, 108 buttons are required for the rotations in the Rubik 4-cube computer program. Table 2 lists the properties of Rubik’s cube and Rubik’s 4-cube.

The program to realize Rubik’s 4-cube in 3D relies on central projection of a hypercube and rotation matrices in 4D. The program is shown in the next section.

Implementing an interface consists of three parts: constructing the buttons for the rotations, displaying the current state and judging whether the puzzle is complete.

The buttons for the rotations are placed in grids. The player can rotate a block by clicking one of the buttons. The rows correspond to the selection of the axis of the coordinates of the block and the columns correspond to the coordinate values for that axis. The player can select a block by choosing one of the rows and one of the columns. For example, clicking the button where row crosses column chooses the block on . For each block, the other three axes are listed. Then, the selection of the second axis is required to verify the rotational plane. Finally, one of the three buttons (up, diagonal and down) must be chosen to determine the rotation angle of (▲), (■) and (▼). (The 0 rows can be ignored—the player can perform an equivalent pair of actions instead in the parallel blocks.)

When the colors of the 27 subcubes on a cell are all the same, that cell is complete. The puzzle is solved when all the cells are complete.

Although we succeeded in implementing Rubik’s 4-cube, some problems remain to be addressed. We aimed for ease of implementation rather than efficiency. Therefore, in the future, we should consider enhancing the application to get a more effective visualization method and an intuitive interface.

The program redraws 1,296 squares after each rotation, so that efficient coding is important. There is a great deal of redundancy in calculating the vertices of the 4-subcubes for each rotation. The most effective method for handling the vertices using subcubes or 4-subcubes remains to be clarified. Note that we must transfer the vertices of the 4-subcubes to those of the subcubes when we handle the vertices of 4-subcubes as a dataset rather than handling them as subcubes.

Effective visualization is a common problem for four-dimensional geometry. In this article, we used central projection to represent 4-cubes. However, the proposed projection does not completely represent the features of the puzzle. Although there are other projections for representing a 4-cube, the most suitable method is not yet clear.

Another possible improvement would be to animate the rotation. The animation of the rotation of the colored small 4-cubes would help the player intuitively understand their rearrangement.

An intuitive interface is important for playing this puzzle. The interface and visualization issues are related, and their development may provide a new method for understanding four-dimensional space.

[1] | H. S. M. Coxeter, Introduction to Geometry, 2nd ed., Hoboken: Wiley, 1989. |

[2] | T. Yoshino. “Activities of Dr. Takashi Yoshino.” (Dec 11, 2017) takashiyoshino.random-walk.org/basic-data-of-4d-regular-polytopes. |

[3] | K. Miyazaki, M. Ishii and S. Yamaguchi, Science of Higher-Dimensional Shape and Symmetry, Kyoto: Kyoto University Press, 2005 (In Japanese). |

T. Yoshino, “Rubik’s 4-Cube,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-8. |

Profession: Science of Form. Fields of interest: skeletal structure of plankton, non-Euclidean geometry, hyperspace, pattern formation.

**Takashi Yoshino**

*Toyo University,
Kujirai 2100, Kawagoe, 350-8585, JAPAN
*

Given a finite vertex set, one can construct every connected spanning hypergraph by first choosing a spanning hypertree, then choosing a blob on each of its edges.

If is a finite vertex set and is a collection of finite subsets (called edges), none of which is a subset of another, we recursively define the *swell* of , , to be the collection of all sets that either:

- belong to
- are the union of some pair of
*overlapping*sets, both already belonging to

For example, if

then

If we also have , then the set system is called a *clutter*. This condition means that (except in the case ) each edge contains at least two vertices, and the hypergraph spanned by the edge set is connected. Here the hypergraph spanned by a set of edges is defined to have vertex set and edge set . (There is no agreed-upon definition of “hypergraph.” For some authors it is any set system; for others it is a simplicial complex; for others it is an antichain of sets.)

Here is a larger clutter.

The number of clutters spanning vertices is given by A048143 (oeis.org/A048143),

This sequence varies as , so the number of digits required roughly doubles with each consecutive term. Our main example is just one of some 56 sextillion members of .

This normalizing function is a universal invariant for the species of labeled clutters, meaning two clutters are isomorphic iff they have the same image.

Here is a list of nonisomorphic representatives for all clutters with up to four vertices, corresponding to “unlabeled” clutters. This brute-force enumeration may not work for .

A *kernel* of is a clutter (the restriction of to edges that are subsets of ) for some .

Define .

A *set partition* is a set of disjoint sets with .

Suppose is a set partition of such that each block is a kernel of (i.e. and ). Since would imply , it follows that the set of unions is itself a clutter, which we call a *cap* of .

Equivalently, a cap of is a clutter satisfying both:

- every edge of is a subset of exactly one edge of

To see that this does *not* establish a partial order of clutters with a vertex set, observe that

is a nontransitive chain of caps. The following is the set of all set partitions of the edge set indices corresponding to each cap of the clutter.

In these plots of clutter partitions, the filled squares correspond to all pairs of a vertex and an edge such that the vertex belongs to the edge; these squares are then shaded according to which block of the partition the edge belongs to.

The *density* of a clutter is

where the sum is over all edges .

A clutter with two or more edges is a *tree* iff . This is equivalent to the usual definition of a spanning hypertree [1].

A clutter is a *blob* iff no cap of is a tree.

The trees and blobs among the caps and kernels (respectively) of our running example are as follows.

Suppose a clutter decomposes into a cap and corresponding set of kernels . Then

where the sum is over all . In particular, , and iff every is a tree. Using this simple identity, one easily proves the following.

**Lemma**

The following is also straightforward.

**Proposition**

We now come to the main result.

For our running example, the theorem corresponds to the following decomposition into a tree of blobs.

**Theorem**

**Proof**

First we show that any blob (kernel) is contained within a single branch of any tree (cap). Suppose that is a kernel of and is a blob, and that is a cap of that is a tree. Let be the subtree of contributing to the set partition of *non-empty* intersections for each branch . The set of unions forms a clutter that is obtained from by deleting in turn all vertices not in , a process that weakly decreases density. Let be the set partition comprised of maximal kernels (i.e. connected components) contained in blocks of . Then is a cap of and . Since is a connected clutter, we have , and therefore . But since is a blob, cannot be a tree, hence it must be a maximal cap (viz. , ).

Next we show that . If any two blobs overlap, both blobs must be contained entirely in whatever branch (of any given tree) contains their intersection. This implies that there is another blob containing their union, and hence that the maximal blobs are disjoint. Since every singleton is also a blob, we conclude that is a cap of .

Finally, if any kernel of were a blob, so would be the restriction of to its union, contradicting maximality of . This proves that the set of unions of is a tree. ■

The following are the decompositions for each nonisomorphic clutter with four vertices.

Let be the set of all kernels of . If is itself a (connected) clutter with vertex set , then there exists a unique subset-minimal upper bound satisfying both

- for all

In general, we can only define uniquely for a *connected* set of kernels, so is not strictly a join operation for the poset of subsets . But if is not connected as a clutter, then letting be its (maximal) connected components, we say that is a *connected set of kernels* iff is connected as a set of kernels, in which case the join is given by

In practice, the verification of connectedness and the computation of may require several iterations constructing joins of connected components. For example, consider the connected set of kernels.

It has the following sequence of joins of connected components.

**One Problem**

If is a connected set of kernels, we define its *compression* to be the number of iterations in the computation of by constructing consecutive joins of connected components. For the previous example, we have . Although it seems unlikely that is a bounded invariant, we do not know how to construct an example with compression greater than .

- For which positive numbers does there exist a connected set of kernels such that ?
- Does there exist an infinite chain of connected sets of kernels such that for all ?

Define an invariant by

for all , where the sum is over all clutter partitions . Here denotes the indicator function for a proposition , equal to 1 or 0 depending on whether is true or false, respectively.

**Theorem**

**Proof**

Let be the set of all clutter partitions . What we have essentially shown above is that , regarded as a subposet of the lattice of set partitions ordered by refinement, is a lattice. We have the simple enumerative identity

where the product is over all non-singleton kernels , here regarded as elements of whose only non-singleton block is . Expanding the right-hand side gives

where the sum is over all sets of non-singleton kernels , again regarded as lattice elements. Here is algorithmically the same operation as the connected-join operation on . Expanding and factoring accordingly, this becomes

where the outer sum is over all , the product is over all , and the inner sum is over all connected sets of kernels spanning . For any kernel , define

where the outer sum is over all clutter partitions , and where the product and inner sum are as before. Letting be the set partition of whose only non-singleton block is , we have shown that

Hence our theorized expansion does indeed satisfy the defining identity of . ■

Note that it is sufficient in the preceding theorem and proof to consider only connected sets of *subset-minimal* non-singleton kernels, and it is often practical to do so. The hypergraph whose edges are minimal non-singleton kernels is also of some interest. The well-known Möbius function of a hypergraph is defined on the lattice of connected set partitions, and in this context an element of may be called a *pseudo-kernel*. In comparison, however, our invariant , which is defined on essentially all clutters, seems to be more interesting; we do not know if it has been studied before.

A *semi-clutter* is any anti-chain of subsets . For each finite set , let be the set of semi-clutters spanning . A *species* [2] is an endofunctor on the category of finite sets and bijections, so here we have defined a species of semi-clutters. The compound semi-clutter of a decomposition , as defined by Billera [3], is obtained as a disjoint “sum” of Cartesian “products.” Interpreted in the language of species theory, this is a certain natural transformation

where denotes the composition operation on species, a generalization of composition of exponential formal power series. Let be the set of (connected) clutters spanning , let be the set containing only the maximal clutter on , and let be the set of clutters having no expression as a compound of a proper decomposition (i.e. is the species of “prime” clutters). Billera’s main theorem (attributed to Shapley) establishes a unique reduced compound representation, which is itself a species of decompositions

From this it is evident that can ultimately be reduced to a nested compound expression using only trivial and prime clutters. Hence the problem of enumerating semi-clutters on a vertex set is reduced to the problem of constructing, for any connected clutter, its “maximal proper committees,” which is the nontrivial solution of [2] for the enumeration of prime clutters considered. This is a particularly interesting application of formal species.

[1] | R. Bacher, “On the Enumeration of Labelled Hypertrees and of Labelled Bipartite Trees.” arxiv.org/abs/1102.2708. |

[2] | A. Joyal, “Une théorie combinatoire des séries formelles,” Advances in Mathematics, 42(1), 1981 pp. 1–82. doi:10.1016/0001-8708(81)90052-9. |

[3] | L. J. Billera, “On the Composition and Decomposition of Clutters,” Journal of Combinatorial Theory, Series B, 11(3), 1971 pp. 234–245. doi:10.1016/0095-8956(71)90033-5. |

G. Wiseman, “Every Clutter Is a Tree of Blobs,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-7. |

The author is a former graduate student of pure mathematics at the University of California, Davis. He is interested in categorical technology applied to discrete mathematics.

**Gus Wiseman**

*gus@nafindix.com*

We present a Mathematica implementation of an algorithm for computing new closed-form evaluations for classes of trig-logarithmic and hyperbolic-logarithmic definite integrals based on the substitution of logarithmic functions into the Maclaurin series expansions of trigonometric and hyperbolic functions. Using this algorithm, we offer new closed-form evaluations for a variety of trig-logarithmic integrals that state-of-the-art computer algebra systems cannot evaluate directly. We also show how this algorithm may be used to evaluate interesting infinite series and products.

Although there are many well-known techniques for the symbolic evaluation of definite integrals, such as Slater’s convolution method, there are many open issues concerning symbolic definite integration using computer algebra systems [1–4]. Computing a closed-form expression for a definite integral often easily reduces to the evaluation of the corresponding indefinite integral, but there are many natural definite integrals of elementary functions that cannot be directly evaluated following that procedure [5]. This article considers the general problem of the symbolic computation of trig-logarithmic and hyperbolic-logarithmic definite integrals. Integrals of this form are interesting in part because they often have surprising, simple and elegant closed-form evaluations involving special functions such as the gamma function and the generalized Riemann zeta function. They may also be used to evaluate infinite series:

In this article, we are primarily concerned with the evaluation of integrals involving an expression of the form or , where is a complex number and is a variable. Integrals of this form emerge in a natural context within physics and engineering, since the Cauchy–Euler differential equation yields the basic solutions and . For example, there are applications based on integrals involving these basic solutions related to the Einstein–Barber field equations [6, 7].

Integrals with an integrand involving a trigonometric function composed with a logarithmic function as a factor are also interesting because there are many integrals of this form that cannot be directly evaluated by state-of-the-art computer algebra systems. For example, consider the definite integral . Mathematica 11.2 is not able to directly evaluate this definite integral nor the underlying indefinite integral.

Here is a numerical approximation.

A natural way to evaluate the symbolic definite integral would be to expand the integrand by substituting into the Maclaurin series for the sine function and then use Mathematica to evaluate the corresponding infinite series.

Evaluate the expression using Mathematica. This integral would occur in the term-by-term expansion.

The Maclaurin series substitution technique applied to this integral gives the series . Here is a numerical approximation.

Mathematica is able to evaluate the series involving the Riemann zeta function, leading to the elegant closed-form evaluation for the definite integral that we have previously noted [8].

In this article, we generalize the strategy illustrated using a Mathematica program for a function that extends the built-in Wolfram Language function with respect to definite integrals so as to be able to find new evaluations for a large variety of trig-logarithmic and hyperbolic-logarithmic integrals that Mathematica 11.2 is not able to directly evaluate otherwise. The underlying algorithm of the function is based on the substitution of logarithmic functions into the Maclaurin series expansions of certain trigonometric/hyperbolic expressions within numerators of integrands, as illustrated in the preceding example. Since Mathematica is able to evaluate a large variety of integrals involving powers of logarithmic functions, the function is able to evaluate a large variety of interesting and natural definite integrals. Also, the function may be used to prove new evaluations for interesting infinite series:

We begin by offering a variety of illustrations of applications of this algorithm. The integration results given in this article are new in the sense that Mathematica 11.2 is not able to evaluate the definite integrals. The function is documented in the Mathematica package corresponding to this article.

In Section 4, we discuss some nuances concerning . In Section 5, we summarize the main capabilities of this function. In Section 6, we discuss some avenues for future research related to the function.

In this article, Mathematica 11.2 was used to generate results; earlier versions may be too slow.

Given an integral involving an expression of the form , where is a complex number, the function generalizes the strategy outlined in Section 1 to attempt to evaluate the corresponding integral. The function uses the Maclaurin series for the sine function, substitutes a logarithmic function into this power series and integrates it term by term. The resulting infinite series often involves the generalized Riemann zeta function. For example, consider the definite integral .

The function is also able to compute new evaluations of integrals involving expressions such as .

is also able to compute new evaluations of integrals involving a factor such as .

Integration results such as those in Section 2 may be used to prove interesting infinite series formulas using this identity:

(1) |

This elegant infinite series formula is easily verified using equation (1) together with the function:

(2) |

To prove it, use (1) with the Maclaurin series expansion of the expression within the integrand of . It is known that [9], and several elegant proofs of this formula are given in [9]. Similar formulas, such as the following new infinite series formulas, may be proven similarly.

**Proposition 1**

**Proof**

Begin by evaluating .

Expand the factor in the integrand as a Maclaurin series, . The desired result follows by integrating both sides of this equality using the evaluation and simplifying the resultant summand. □

**Proposition 2**

**Proof**

First evaluate .

We have that . Integrating both sides of this equality using the evaluation may be used to finish the proof. □

**Proposition 3**

**Proof**

Evaluate .

To evaluate the preceding infinite series, expand the factor within the integrand, integrate the resultant summand and use the evaluation of the integral. □

**Proposition 4**

**Proof**

Evaluate .

Expand the factor and follow the technique used in the previous proofs.

The function is able to evaluate a variety of definite integrals involving expressions of the form , where is a complex number.

The evaluation of this trig-logarithmic integral involves Catalan’s constant.

This one involves the Glaisher–Kinkelin constant.

Here .

We prove the following well-known infinite product formula following the strategy used in Section 2.1: evaluate , substitute the Maclaurin series for into this integral and integrate term by term.

Similarly, we prove the following new infinite series formulas.

**Proposition 5**

**Proof**

First evaluate .

Expanding in the preceding integrand,

Integrating both sides of this equation using the integral evaluation and simplifying yields the desired result. □

**Proposition 6**

**Proof**

Evaluate .

To evaluate the series, expand in the integrand, integrate the resultant summand and simplify, and use the integral evaluation. □

**Proposition 7**

**Proof**

Evaluate .

The series may then be evaluated like the others. □

As discussed in Section 1, the function is based on the substitution of logarithmic functions into Maclaurin series. Our implementation is in many ways based on the use of string manipulation through Wolfram Language string operations such as . Given an integrand as input, the function converts the numerator of to a string and determines whether the numerator of contains trigonometric, hyperbolic or logarithmic expressions. If contains a suitable combination of functional expressions, such as a sine function and a logarithmic function, uses string manipulation to determine whether Mathematica 11.2 is able to evaluate a definite integral involving logarithmic powers that is needed to evaluate an infinite series, following the technique described in Section 1. If Mathematica 11.2 is able to compute a closed-form evaluation for this log-power integral, then the function uses such an evaluation to evaluate the infinite series.

The function is an extension of the function in terms of definite integrals, in the sense that if the function is able to evaluate a definite integral, then returns the same evaluation. However, does not extend with respect to indefinite integrals. Also, there are certain classes of definite integrals that neither the function nor can evaluate (as an infinite series or otherwise).

The function is able to directly compute new evaluations for a variety of integrals of the following forms that cannot be directly computed by state-of-the-art computer algebra systems (here , , , etc. denote complex numbers).

where

We have shown how may be used to construct new evaluations of a variety of interesting infinite series, such as series involving the inverse tangent function. is also able to directly compute closed-form evaluations of definite integrals involving products of trig-logarithmic and hyperbolic-logarithmic expressions.

We currently leave it as an open problem to generalize . There are many natural ways to generalize it. Many definite integrals have an integrand with a factor consisting of the composition of two elementary transcendental functions that may be evaluated using Maclaurin series, but that cannot be directly evaluated using state-of-the-art computer algebra systems. For example, Mathematica 11.2 is not able to compute the following integral.

Using the Maclaurin series for the square of the inverse sine, it is easily seen that

Implementing a generalization of this strategy to evaluate using Mathematica may be difficult, since Mathematica 11.2 is not able to compute integrals such as .

We currently leave it as an open problem to construct an analog of the function that is able to directly evaluate infinite series such as the summation evaluated in Section 2.

A natural way to generalize the function would be to construct a program for directly evaluating more general classes of logarithmic-power integrals. For example, these logarithmic-power integrals have simple closed-form evaluations, but Mathematica 11.2 is not able to directly evaluate general formulas for these integrals.

The function is designed to evaluate definite integrals with an integrand with a *factor* in the form of a trig-logarithmic or hyperbolic-logarithmic function. However, it is clear that the general strategy outlined in the Section 1 may be applied more generally. That is, this technique may be applied to certain types of definite integrals with factors different than or . We currently leave it as an open problem to generalize the function so as to be able to apply this Maclaurin series substitution technique to integrands without such factors.

The author would like to thank a reviewer for many useful comments and suggestions.

[1] | A. A. Adams, H. Gottliebsen, S. A. Linton and U. Martin, “VSDITLU: A Verifiable Symbolic Definite Integral Table Look-Up,” in CADE-16: Proceedings of the 16th International Conference on Automated Deduction: Automated Deduction, H. Ganzinger (ed.), Trento, Italy, July 7–10, 1999, London: Springer, 1999 pp. 112–126. doi:10.1007/3-540-48660-7_8. |

[2] | D. Lichtblau, “Symbolic Definite Integration: Methods and Open Issues,” talk given at 2005 Wolfram Technology Conference, Champaign IL, 2005. library.wolfram.com/infocenter/Conferences/5832. |

[3] | D. Stoutemyer, “Crimes and Misdemeanors in the Computer Algebra Trade,” Notices of the American Mathematical Society, 38(7), 1991 pp. 778–785. www.researchgate.net/publication/235941737_Crimes_and_Misdemeanors_in_the_Computer_Algebra_Trade. |

[4] | B. Terelius, “Symbolic Integration,” Master’s thesis, Royal Institute of Technology, School of Computer Science and Communication, Stockholm, Sweden, 2009. www.nada.kth.se/utbildning/grukth/exjobb/rapportlistor/2009/rapporter09/terelius_bjorn_09095.pdf. |

[5] | P. S.-H. Wang, Evaluation of Definite Integrals by Symbolic Manipulation, Technical report MIT/LCS/TR-92, Project MAC, Massachusetts Institute of Technology, Cambridge, MA, 1971. publications.csail.mit.edu/lcs/pubs/pdf/MIT-LCS-TR-092.pdf. |

[6] | K. S. Adhav, A. S. Nimkar, V. G. Mete and M. V. Dawande, “Axially Symmetric Bianchi Type-I Model with Massless Scalar Field and Cosmic Strings in Barber’s Self-Creation Cosmology,” International Journal of Theoretical Physics, 49(5), 2010 pp. 1127–1132. doi.org/10.1007/s10773-010-0293-6. |

[7] | D. D. Pawar, S. N. Bayaskar and A. G. Deshmukh, “String Cosmological Model in Presence of Massless Scalar Field in Modified Theory of General Relativity,” Romanian Journal of Physics, 56(5–6), 2011 pp. 842–848. www.nipne.ro/rjp/2011_56_5-6/0842_0848.pdf. |

[8] | N. J. A. Sloane, “The On-Line Encyclopedia of Integer Sequences,” sequence A265011. oeis.org. |

[9] | Namagiri, “A Closed Form for the Infinite Series .” Mathematics Stack Exchange. (Apr 30, 2014) math.stackexchange.com/questions/776182. |

J. M. Campbell, “An Algorithm for Trigonometric-Logarithmic Definite Integrals,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-6. |

Additional electronic files:

Available at: www.mathematica-journal.com/data/uploads/2017/10/TIntegratePackage.m

John M. Campbell received an M. Math. degree from the University of Waterloo and graduated first class with distinction with a Specialized Honours Bachelor of Science degree from York University. He has worked as a research assistant at the Fields Institute for Research in Mathematical Sciences and at York University.

**John M. Campbell**

*York University
4700 Keele Street
Toronto, ON M3J 1P3
*

Rice–Rampsberger–Kassel–Marcus (RRKM) theory calculates an energy-dependent microcanonical unimolecular rate constant for a chemical reaction from a sum and density of vibrational quantum states. This article demonstrates how to program the Beyer–Swinehart direct count of the sum and density of states for harmonic oscillators, as well as the Stein–Rabinovitch extension for anharmonic oscillators. Microcanonical rate constants are calculated for the decomposition of vinyl cyanide () into , and as an example.

The essential framework for our understanding of unimolecular reaction rates was developed largely in the first half of the twentieth century [1]. The *Lindemann–Hinshelwood mechanism*, proposed in 1922 by Lindemann and later expanded upon by Hinshelwood [2], posited that the gas phase unimolecular reactions were initiated by bimolecular collisions. This allowed the rate of the reaction to be calculated as a function of internal energy, regardless of the method of activation. Later that decade, Rice and Ramsperger and, independently, Kassel developed an improvement on the model known as the *Rice–Ramsperger–Kassel (RRK)* model [3, 4]. The RRK model viewed the molecule as a system of identical harmonic oscillators and introduced the idea of activation energy: that sufficient energy must be deposited into specific modes of motion in order for the reaction to occur. In 1952, Rice and Marcus extended this into what is now known as *Rice–Rampsberger–Kassel–Marcus* theory *(RRKM)*, which incorporates a more complete quantum mechanical description of the molecule and utilizes the concept of the transition state, a specific conformation that the molecule must adopt in order for the reaction to proceed [5]. The RRKM model remains the prevailing scientific description of unimolecular chemical kinetics to this day.

Chemical transformations require bonds to break in the reactants and new bonds to form as products are generated. The potential energy associated with these transformations defines an -dimensional *potential energy surface (PES)*, where is the number of atoms [6]. The *reaction pathway* is a one-dimensional cross section of the PES composed of the precursor, intermediates, transition states and products of a particular reaction [6]. Reactants and intermediates of a reaction correspond to minima, whereas transition states are located at maxima along the reaction pathway.

The rate of the reaction depends on the probability of the molecule adopting the conformations along the reaction pathway, which, in turn, stems from the probability that a sufficient amount of energy is partitioned into the necessary modes of motion. The RRKM formalism assumes that: (1) once the molecule adopts the transition state orientation, the reaction proceeds to products; and (2) the *internal vibrational energy redistribution (IVR)* is fast with respect to the timescale of the reaction. Under these constraints, the RRKM microcanonical rate constant, , is given by

(1) |

where is the internal energy of the system, is the activation energy for the reaction, is the sum of states of the transition state from to , is the density of states of the precursor at energy , is Planck’s constant, and is the degeneracy of the reaction pathway [3, 4, 5]. A state is defined as any unique vibrational configuration that determines the internal energy of the molecule. The sum of states, , is the total number of states within a specific energy range [6]. The density of states is the number of states per energy level. Equivalently, the density of states is the derivative of the sum of states with respect to energy [6]. The density is expressed in units of inverse energy

(), while the sum of states is a dimensionless quantity. Thus, the microcanonical rate constant of equation (1) has units .

There are vibrational frequencies for nonlinear polyatomic molecules and vibrational frequencies for the corresponding transition state. It is the values of these frequencies, which typically range from , that dictate how the internal energy is distributed in the molecule. The calculation for the sum and density of states depends on these frequencies, which can be measured or looked up in standard data tables. However, for transition states, these values must be estimated or, more likely, calculated using theoretical packages.

This article is divided into the following sections: In Section 2, the sum and density of states for vinyl cyanide () are determined through a direct count method analogous to that developed by Beyer and Swinehart [7]. In Section 3, the method is extended to anharmonic oscillators using the Stein–Rabinovitch algorithm [8]. In Section 4, the RRKM microcanonical rate constants are calculated for the unimolecular dissociation of vinyl cyanide and are used to predict various dynamic properties of the chemical reaction.

*Beyer and Swinehart* (*BS*) developed a surprisingly simple yet computationally intensive direct count method to calculate the sum and density of states at defined energies of a harmonic oscillator [7]. The advantage of conducting such calculations in Mathematica is that the user can leverage high-quality numerical ODE solvers and interactive graphical features to visualize dynamic properties of the chemical reaction. This is demonstrated in Section 4, where the temporal development of nine different species is plotted and automatically updated as the user sets the internal energy of the system with a slider. All of this can be done in real time after some initial computations that are exact with Mathematica.

There are a number of equivalent ways to implement the BS direct count of states. To ease the exposition, we first illustrate the task with a loop. The sum and density of states are determined by the vibrational frequencies of the species of interest, which we define in the list . In more complex situations, such as biomolecules or reactions at very high energies, to lessen the computational burden it may be necessary to divide the value of the total energy into larger packets and modify the vibrational frequencies to be multiples of the new packet size [8]. In this case, the packet size is 1 and the vibrational frequencies for vinyl cyanide are taken from reference [9].

The final result for the sum or density of states is a list with elements. For the sum of states, this list is initialized with all values set to 1. The sum of states can then be calculated using a nested (or doubly indexed) loop.

Because the density of states is the derivative of the sum of states, the density of states can also be determined from the sum of states table by setting the element of the density of states to the difference between the and elements of the sum of states table. Alternatively and equivalently, this can be done to the initial sum of states table to derive the initial density of states table, which is a list with elements with the initial element set to 1 and all other elements set to 0. This can be conveniently done with .

A visual comparison of the sum and density of states, calculated for vinyl cyanide (), shows the behavior of the two quantities over the energy range. As the general trend of the graphic is clear, we subsample the lists of values to avoid plotting millions of points, needlessly increasing file size.

Thus, when vinyl cyanide has an internal energy of , the sum and density of quantum states are approximately and .

Notice that the loops for computing the sum and density of states perform the same basic process—the BS direct count method—on different initial lists. As we proceed, adopting a functional programming approach eases the exposition and vastly improves code clarity. We do this by introducing two new functions, and . The first updates the lists of sums and densities for a single vibrational frequency, and the second uses to iteratively update for each of the vibrational frequencies in the table.

We then use these to create the two functions and , which take a frequency list and total energy as arguments and do just as their names imply. The last argument to these functions, , is used in Section 4.

For completeness, we recompute the sum and density of states using these functions to illustrate how that can be done. It is readily verified that these compute the same quantities.

The harmonic oscillator model assumes that the difference between successive vibrational energy levels remains constant. In real oscillators, this difference decreases as the vibrational excitation increases. These anharmonic effects increase both the sum and density of states and thus often cancel when taking the ratio to determine the RRKM rate constant in equation (1). However, there are some cases where this effect does not completely cancel or where highly resolved measurements necessitate a more accurate description of the vibrational energy.

The *Stein–Rabinovitch (SR)* algorithm extends the BS method to incorporate the effects of anharmonicity into the RRKM rate constant [8]. Following [10], the anharmonic effects are incorporated into the energy level expression through the equation

(2) |

where is the energy of the vibrational level, is the vibrational frequency, and is the anharmonicity constant. The subscript stands for equilibrium. It is chosen because this would be the vibrational frequency if the oscillator vibrated harmonically about its equilibrium position (the minimum in the well). The SR extension counts each energy level of the anharmonic model to calculate the sum and density of states in a way that is functionally similar to the BS method; however, the input table is modified to include the anharmonicity constant for each vibrational mode. For this example, we have organized the anharmonicity constants, , into the table . Using the vibrational frequencies of vinyl cyanide from [9], and assigning 0.01 as the anharmonicity constant for each vibration, the energy for each level of each mode can be calculated.

With these we can calculate the sum and density of states, initializing the lists in the same manner as the BS method (a list of 1s for the sum of states, and a 1 followed by 0s for the density of states). The process is functionally similar to that used to calculate the sum and density of states for a set of harmonic oscillators. In the BS method, the elements of a single table are modified with each iteration of the inner loop. By contrast, in the SR extension, the elements of the table are updated through separable computations combined after each complete loop for each oscillator.

The implementations can be verified by replicating the calculations in the appendix of [8], which shows the calculation of the density of states up to 10 units of energy for a model system with two oscillators. The energies of the first oscillator are and the energies of the second are .

With these, we can define SR analogs to and .

We can then use them to compute the SR sum and density of states.

Here are two plots comparing the harmonic and anharmonic models for the density and sum of states calculated for vinyl cyanide.

As a more substantive example, we calculate rate constants for the decomposition of vinyl cyanide () into hydrogen cyanide (), hydrogen isocyanide () and acetylene (), compare them to published values, and interactively visualize the temporal dependence of the various species. In [9], Homayoon and colleagues considered the following kinetic model that involved seven different paths, three different intermediates, 13 rate constants and three different products:

Each step of each path passes through a transition state to yield either an intermediate or a final product. The following differential rate equations characterize the temporal dependence of the amounts of each species:

The determination of the sum and density of states requires the vibrational frequencies for each species located at minima and maxima along the reaction pathway. For the decomposition reaction of vinyl cyanide, there are six minima (the geometry of along with each intermediate) and 11 maxima (each unique transition state). These 17 sets of frequencies are taken from reference [9] and are provided in Table 1. In the following definitions, we have chosen the same notation as in reference [9]. TS indicates a transition state, the particular path is indicated by a Roman numeral, and the minimum energy of each species relative to the minimum energy of vinyl cyanide has the prefix `re`.

We now turn to the calculation of the sum and density of states and then use the values to calculate each rate constant. We begin by defining the vibrational constants.

Because the internal energy is defined relative to the well depth of vinyl cyanide, the sums and densities of states for the other species must be calculated using energy adjusted by the lowest energy of each species relative to the lowest energy of vinyl cyanide (peak height for the transition states and well depth for the intermediates).

Using and from Section 1, we can now compute the sums and densities of states in parallel.

Using equation (1), we now compute the 13 rate constants, taking care to avoid indeterminate values such as 0/0.

The rate constants can now be visualized. Note that as energy increases, all the reaction rates also increase.

The temporal dependence of the relative concentrations of each stable species (precursor, intermediates and products) can be determined using the rate equations and . The reaction dynamics of the decay of the precursor vinyl cyanide, the subsequent build up and decay of the intermediates, and the formation of products, each normalized with respect to , are plotted at a given internal energy determined by a variable slide bar.

A number of qualitative and quantitative results can be gleaned by simple inspection of this dynamic plot. One example is the amount of time it takes for the reaction to run to completion as energy increases. The plot also indicates that there is never a significant amount of any intermediate apart from that in path III. This is not surprising, given the significantly lower energy for the first transition state in path III relative to the first transition state in pathways IV and V, and given that three pathways share .

A final element determined from this analysis is the exit channel ratios (or product distribution ratios) determined at reaction completion and for a specific system internal energy. This is important because many experimental measurements are only capable of measuring product ratios as a function of precursor internal energy. Thus, experimental scientists require such an RRKM analysis to provide a dynamic and atomistic picture of the chemical reaction. The following code determines and plots the final to ratio at internal energies up to .

This determines the energy at which HNC reaches its maximum relative abundance.

The last plot indicates that production is favored at all system internal energies. Isocyanide () only becomes a noticeable product at a system internal energy above . This is due to paths IV–VII becoming competitive with path I at higher energies. Surprisingly, the product ratio reduces at energies greater than (equivalently, there is a maximum in the product distribution curve). This result would be difficult to predict, even qualitatively, with just the calculated rate constants, demonstrating the usefulness of the powerful tools available in Mathematica.

[1] | D. I. Giacomo, “A Short Account of RRKM Theory of Unimolecular Reactions and of Marcus Theory of Electron Transfer in a Historical Perspective,” Journal of Chemical Education, 92(3), 2015 pp. 476–481. doi:10.1021/ed5001312. |

[2] | F. A. Lindemann, S. Arrhenious, I. Langmuir, N. R. Dhar, J. Perrin and W. C. McC. Lewis, “Discussion on ‘The Radiation Theory of Chemical Action’,” Transactions of the Faraday Society, 17, 1922 pp. 598–606. doi: 10.1039/TF9221700598. |

[3] | O. K. Rice and H. C. Ramsperger, “Theories of Unimolecular Gas Reactions at Low Pressures,” Journal of the American Chemical Society, 49(7), 1927 pp. 1617–1629. doi:10.1021/ja01406a001. |

[4] | L. S. Kassel, “Studies in Homogeneous Gas Reactions I,” The Journal of Physical Chemistry, 32(2), 1928 pp. 225–242. doi:10.1021/j150284a007. |

[5] | R. A. Marcus, “Unimolecular Dissociations and Free Radical Recombination Reactions,” The Journal of Chemical Physics, 20(3), 1952 pp. 359–364. doi:10.1063/1.1700424. |

[6] | IUPAC. Compendium of Chemical Terminology, 2nd ed. (the “Gold Book”) (compiled by A. D. McNaught and A. Wilkinson), Blackwell Scientific Publications: Oxford, 1997. goldbook.iupac.org (XML online corrected version, created by M. Nic, J. Jirat and B. Kosata; updates compiled by A. Jenkins). |

[7] | T. Beyer and D. F. Swinehart, “Algorithm 448: Number of Multiply-Restricted Partitions,” Communications of the ACM, 16(6), 1973 pp. 379. doi.org/10.1145/362248.362275. |

[8] | S. E. Stein and B. S. Rabinovitch, “Accurate Evaluation of Internal Energy Level Sums and Densities Including Anharmonic Oscillators and Hindered Rotors,” The Journal of Chemical Physics, 58(6), 1973 pp. 2438–2445. doi.org/10.1063/1.1679522. |

[9] | Z. Homayoon, S. A. Vázquex, R. Rodríguez-Fernández and E. Martínez-Núñez, “Ab Initio and RRKM Study of the HCN/HNC Elimination Channels from Vinyl Cyanide,” The Journal of Physical Chemistry A, 115(6), 2011 pp. 979–985. doi.org/10.1021/jp109843a. |

[10] | P. M. Morse, “Diatomic Molecules According to the Wave Mechanics. II. Vibrational Levels,” Physical Review, 34(1), 1929 pp. 57–64. doi:10.1103/PhysRev.34.57. |

A. C. Mansell, D. J. Kahle and D. J. Bellert, “Calculating RRKM Rate Constants from Vibrational Frequencies and Their Dynamic Interpretation,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-5. |

Adam C. Mansell is a Ph.D. candidate in the Bellert Group of the Department of Chemistry and Biochemistry at Baylor University. He holds a B.A. in chemistry from Concordia University Irvine.

David J. Kahle is a computational statistician and assistant professor of statistics at Baylor University. He holds a B.A. in mathematics from the University of Richmond and M.A. and Ph.D. degrees in statistics from Rice University.

Darrin J. Bellert is an associate professor of chemistry at Baylor University. He received his B.S. from Wright State University in Ohio, and his Ph.D. from Florida State University.

**Adam C. Mansell**

*Department of Chemistry and Biochemistry
Baylor University
One Bear Place #97348
Waco, TX 76798*

**David J. Kahle**

*Department of Statistical Science
Baylor University
One Bear Place #97140
Waco, TX 76798*

**Darrin J. Bellert**

*Department of Chemistry and Biochemistry
Baylor University
One Bear Place #97348
Waco, TX 76798*

This article continues the presentation of a variety of applications around the theme of inversion: quandles, inversion of one circle into another and inverting a pair of circles into congruent or concentric circles.

Recent decades have seen a rebirth of geometry as an important subject both in the curriculum and in mathematical and computational research. Dynamic geometry programs have met the demand for visual, specialized computational tools that help bridge the gap between purely visual and algebraic methods. This development has also extended the understanding of the theoretical and computational foundations of geometry, which in turn has stimulated the proliferation of several new branches of geometry, producing a more mature and modern discipline.

In this spirit, these articles [1, 2] have been written to be useful as additional material in a teaching environment on computational geometry, following the practice of the author in teaching the subject at the beginning university level. This third article includes a section on *quandles* (algebraic generalizations of inversion) that describes their properties and generates all finite quandles up to order five. Also, we include the construction of a circle inverting one circle into another, followed by a section on the construction of a circle inverting two circles, or a circle and a line, into a pair of concentric circles.

Let mean the inverse of the object in the circle with center and radius , drawn as a red dashed circle.

We repeat the definitions of the functions , , and from the previous article [2].

The function computes the square of the Euclidean distance between two given points. (It is more convenient to use the following definition than the built-in Mathematica function .)

The function tests whether three given points are collinear. When exactly two of them are equal, it gives , and when all three are equal, it gives , because there is no unique line through them.

The function computes the unique circle passing through three given points; if they are collinear, then the function is applied first.

The function computes the circle passing through the points , and . If the points are collinear, it gives the line through them; if all three points are the same, it returns an error message, as there is no meaningful definition of inversion in a circle of zero radius.

The function computes the inverse of in a circle or line . The object can be a point (including the special point that inverts to the center of ), a circle or a line (specified by two points).

The geometric definition of inversion of circles can be formalized algebraically and thus be generalized. Let denote the result of inverting in . Quandles arise mostly in knot theory and group theory and are characterized by the following axioms [3]:

The first two axioms correspond to well-known properties of inversion.

The following figure illustrates the third axiom. Red arrows go from the center of the circle to be inverted to the center of its inversion.

A set equipped with a binary operation whose elements satisfy these three axioms is called an *involutory quandle*, or *quandle* for short. The operation is neither commutative nor associative. Inversive geometry applied to generalized circles is an example of an infinite quandle. There are other sets that also verify the axioms; for example, if , the operation is a quandle.

Finite quandles are somewhat curious; for instance, the following is the operation matrix corresponding to a six-element quandle (due to Takasaki).

This verifies that under any modulus, this structure generalizes to a quandle.

A matrix corresponding to a finite quandle has different elements appearing in its main diagonal and also different elements in each of its columns (i.e. for all elements , , there exists a unique such that ). Also, for every triple of indices, we must have

Is there an arrangement of generalized circles that forms a finite quandle under mutual inversion, that is, is closed by inversion? Any two orthogonal circles form a two-element quandle. Also, consider a set of lines equally spaced, passing through the origin. This set is closed under reflection. Taking and labeling the lines from to gives the Takasaki matrix. A circle centered at the origin and lines equally spaced produce a set closed under inversion; if we label the circle as , the matrix in this case has all elements in the last row equal to . Let us now generate all finite quandles of size .

Computing the number of finite quandles by this method is computationally expensive, as the variable is of length . With , using the previous code, it took about an hour reporting 404 instances (time measured on a Mac Pro 3.1 GHz, 16 Gb).

The following is an example of a set of four circles closed under inversion (that is, any circle in the set inverted in any other circle results in a circle in the set), also called the *inversive group of three points* [3]. The function computes a disk or a line passing through point such that points and form an inverse pair under inversion in . In the following , you can drag a locator to modify one of the four circles; the others are computed accordingly.

The function slightly varies three points that are coincident or collinear.

For more on quandles, see [4].

Throughout this section, let and be circles with , and let the inversion circle be , such that . Call such an the *midcircle* of and . There are three cases, depending on the relative positions of and .

If , a reflection takes into , so assume from now on.

**Theorem 1**

Let and not intersect and be external to each other; say is to the left of and assume . Then is at a distance from along the line , is to the right of , and .

Additionally, is orthogonal to every circle tangent to both and , and is orthogonal to every circle orthogonal to and .

Assume without loss of generality that and . Draw two parallel radii from and defining variable points and . Extend the lines and to intersect at point . From similar triangles, , and we easily conclude that .

Extend the lines and to intersect at the point . Construct the circle , which is tangent to and .

The first part of the following output checks that the circle inverts into , with .

The second part checks that , so that and invert to each other in .

The circle separates and , while connects them, so intersects in two points and , each of which is fixed under inversion in . Since , and on invert into , and , also on , inverts to itself (though not pointwise), and and are orthogonal.

Let be orthogonal to and , so that both and are invariant under inversion in . If inverts into in , then and invert into each other in , so , and and are orthogonal.

This notation is followed in the next . (A circle orthogonal to and is not drawn.)

**Theorem 2**

Suppose and do not intersect, and let be inside . Then is at a distance from along the line joining the centers, and .

To verify inverts the circle into the circle , proceed as in the previous section.

The next follows the previous construction of the midcircle of and .

**Theorem 3**

Let and intersect. Then there are two midcircles taking to , corresponding to the midcircles in theorems 1 and 2.

To verify this, is inverted in the two circles from theorems 1 and 2.

The following shows the construction of both midcircles following the previous notation.

**Theorem 4**

**Proof**

Let be a circle centered on the midcircle . Inverting in , we obtain a line (drawn in black in the following ). As separates and , must separate and ; in fact, must invert into . As is a line, inversion in is reflection, hence and are congruent. □

The following shows additionally that the brown line joining the centers of and inverts into a brown circle orthogonal to the blue line and to the congruent circles and , as was expected.

For more on the geometry of circles and inversion, see [5, 6, 7].

The radical axis of two circles is the locus of a point from which tangents to the two circles have the same length. It is always a straight line perpendicular to the line joining the centers of the two circles. If the circles intersect, the radical axis is the line through the points of intersection. If the circles are tangent, it is their common tangent. We will need the following property of the radical axis of two circles [8, 9, 10].

**Theorem 5**

For and , let the point , where . Construct the line at perpendicular to the line AB. Then is the radical axis of and .

This checks that any point on has tangents to and of equal length.

**Theorem 6**

A few words on the assumptions in the following `Simplify` to verify theorem 6. The first three, , , , hold in general. The fourth, , is for the `Solve` that computes to find a solution. The next two, and , are for the inversion of to be feasible, and the last two are for the inversion of to be feasible.

**Theorem 7**

Two nonconcentric, nonintersecting circles can be inverted into two concentric circles [11].

**Proof**

Let and be the given circles. Choose a circle centered at a point on the radical axis with radius equal to the length of the tangent from to . This circle intersects the line in two points , . Any circle with or as center inverts and into concentric circles. □

The next shows the circles in blue and in yellow, their radical axis, a point on the radical axis, the circle now moving freely on the radical axis, the circle of inversion centered at one of the intersections of and , and two concentric circles obtained by inverting and in . Only one of the inversive circles is shown; the other is a mirror image in the radical axis.

The center of the inversive circle does not depend on the position of . This checks the inverses of the circles and are concentric.

Let be the center of the other inversive circle.

**Theorem 8**

A line and a circle that do not intersect or a pair of nonintersecting circles can be inverted into two concentric circles. The key is to obtain a common orthogonal circle and then choose the inversion center at its intersection with a particular line [11].

**Theorem 9**

Let the circle and the nonintersecting line be such that on is the nearest point to , making perpendicular to the line . Then the circle , where , is orthogonal to both and . Either of the two intersections of with the line can serve as the center of a circle of inversion inverting and into concentric circles.

I am grateful to the Center of Investigation and Advanced Studies in Mexico City for the use of its extensive library and to Wolfram Research for providing an ideal environment to develop this series of articles.

[1] | J. Rangel-Mondragon, “Inversive Geometry: Part 1,” The Mathematica Journal, 15, 2013. doi:10.3888/tmj.15–7. |

[2] | J. Rangel-Mondragon, “Inversive Geometry: Part 2,” The Mathematica Journal, 18, 2014. doi:10.3888/tmj.18-5. |

[3] | F. Morley and F. V. Morley, Inversive Geometry, New York: Ginn and Company, 1933. |

[4] | B. Ho and S. Nelson, “Matrices and Finite Quandles,” Homology, Homotopy and Applications, 7(1), 2005 pp. 197–208. doi:10.4310/HHA.2005.v7.n1.a11. |

[5] | H. S. M. Coxeter, “Inversive Geometry,” Educational Studies in Mathematics, 3(3), 1971pp. 310–321. doi:10.1007/BF00302300. |

[6] | D. Pedoe, Geometry, A Comprehensive Course, New York: Dover Publications Inc., 1988. |

[7] | H. S. M. Coxeter and S. L. Geitzer, Geometry Revisited, New York: Random House, 1967. |

[8] | E. W. Weisstein. ”Radical Line” from MathWorld–A Wolfram Web Resource. mathworld.wolfram.com/RadicalLine.html. |

[9] | S. R. Murthy. “Radical Axis and Radical Center” from the Wolfram Demonstrations Project–A Wolfram Web Resource. demonstrations.wolfram.com/RadicalAxisAndRadicalCenter. |

[10] | J. Rangel-Mondragón, “The Arbelos,” The Mathematica Journal, 16, 2014. doi:10.3888/tmj.16-5. |

[11] | D. E. Blair, Inversion Theory and Conformal Mapping, Providence, RI: American Mathematical Society, 2000. |

J. Rangel-Mondragón, “Inversive Geometry: Part 3,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-4. |

Jaime Rangel-Mondragón received M.Sc. and Ph.D. degrees in Applied Mathematics and Computation from the School of Mathematics and Computer Science at the University College of North Wales in Bangor, UK. He was a visiting scholar at Wolfram Research, Inc. and held positions in the Faculty of Informatics at UCNW, the Center of Literary and Linguistic Studies at the College of Mexico, the Department of Electrical Engineering at the Center of Research and Advanced Studies, the Center of Computational Engineering (of which he was director) at the Monterrey Institute of Technology, the Department of Mechatronics at the Queretaro Institute of Technology and the Autonomous University of Queretaro in Mexico, where he was a member of the Department of Informatics and in charge of the Academic Body of Algorithms, Computation and Networks. His research included combinatorics, the theory of computing, computational geometry and recreational mathematics. Jaime Rangel-Mondragón died in 2015.

]]>Input shaping is an established technique to generate prefilters so that flexible mechanical systems move with minimal residual vibration. Many examples of such systems are found in engineering—for example, space structures, robots, cranes and so on. The problem of vibration control is serious when precise motion is required in the presence of structural flexibility. In a wind turbine blade, untreated flapwise vibrations may reduce the life of the blade and unexpected vibrations can spread to the supporting structure. This article investigates one of the tools available to control vibrations within flexible mechanical systems using the input shaping technique.

Among other choices [1, 2] for reducing vibrations in flexible systems, input shaping control is an open-loop control technique that is implemented by convolving a sequence of impulses with a desired command. The amplitudes and time locations of the impulses are determined from the system’s natural frequency and damping ratio by solving a set of constraint equations. Historically, input shaping dates from the late 1950s. Originally named “Posicast Control,” the initial development of input shaping is largely credited to Smith [3, 4], with one notable precursor due to Calvert and Gimpel [5]. All three works proposed a simple technique to generate a non-oscillatory response from a lightly damped system subjected to a step input, which was motivated by a simple wave cancellation concept for the elimination of the oscillatory motion of the under-damped system. The early forms of command generators suffered from poor robustness properties, as they were sensitive to modeling errors of natural frequencies and damping ratios. Since this initial work, there have been many developments in the area of input shaping control, with one of the pacing elements being the progress in microprocessor technology to implement the concept. More recent robust command generators have proven beneficial for real systems with, for example, Swigert [6] proposing techniques for the determination of torque profiles that considered the sensitivity of the terminal states to variations in the model parameters. Other examples of input shapers have been developed that are robust to natural frequency modeling errors, the first of which was called the Zero Vibration and Derivative (ZVD) shaper [7], which improved robustness to modeling errors by adding additional constraints on the derivative of the residual vibration magnitudes. With this robustness present, input shaping has been implemented in a variety of systems, including movement of cranes [8, 9], precise movement of disk drives [10], flexible spacecraft [11, 12], industrial robots [13, 14] and coordinate measuring machines [15]. There have also been developments using hybrid input shaping [16] and three-step input shaping techniques [17].

Many types of solutions are possible for the problem of flexible dynamics—for example, feedback control, command shaping or redesigning the physical geometry [18]. A simple example of this challenging area is that of an overhead traveling crane, as shown in Figure 1, which consists of a point mass of the moveable structure (crab or crane) of a point mass of the payload and of a non-extensible load carrying rope (cable) of length .

**Figure 1.** Schematic of overhead gantry crane.

The equations of motion for such a system can be set up either directly from Newtonian mechanics or indirectly using Lagrangian methods. Using either results in the nonlinear system of equations for the motion as

(1) |

When the gantry crane is accelerating or retarding, then the hanging cable starts to vibrate. The code for equation (1) is shown in Figure 2, with the force set as positive, and the movement of the cable in particular is shown when the crane is accelerating. A result for retardation can easily be found by simply assigning a negative value for in the code for .

**Figure 2.** Overhead gantry crane accelerating.

A special case is when the gantry crane is moving at a constant velocity; that is, is set to zero. Then the hanging cable for lifting is not swinging, and the crane and pendulum move as in Figure 3.

**Figure 3.** Overhead gantry crane moving with constant speed.

The upper end of the cable is attached to a trolley that travels along a rail to position the payload. Cranes are usually controlled by a human operator who moves levers or presses buttons to cause the trolley to move. If the operator presses the control button for a finite time period, then the trolley will move a finite distance and come to rest. However, the payload usually oscillates about some support on the trolley due to the trolley motion, as shown in Figure 3 by the uncontrolled oscillation. The crane driver can smooth this situation by suitably pressing the button multiple times. The payload motion for this scenario could be as shown in Figure 4 labeled as “Operator controlled.”

**Figure 4.** Payload response for uncontrolled and operator controlled.

Here we start with the simplest commands to move systems without vibration. An impulse applied to a system usually causes it to vibrate, but that can be canceled by a second impulse. This concept is shown in Figure 5, where each input is piecewise constant and the system being considered is purely oscillatory with no damping. As can be seen, the response functions add together to give zero.

**Figure 5.** Simple cancellation of a vibration.

Next, Figure 6 shows the response of a typical forced damped system to a two-impulse command.

**Figure 6.** Typical spring-damped system.

For the preceding system, the equations of motion are

(2) |

where and are coefficients due to drag and spring stiffness, respectively.

Again, each input is piecewise constant, but the equation of motion has an additional damping term dependent on the speed of motion.

It is instructive to derive the amplitudes and time locations of the two-impulse command shown in Figure 7.

**Figure 7.** Two-impulse response with damping.

If a reasonable estimate of the system’s natural frequency and damping ratio is known, then the residual vibration that results from a sequence of impulses can be described [10] using the expression

(3) |

where

(4) |

and are the amplitudes and time locations of the impulses, is the number of impulses in the impulse sequence, and .

Equation (3) is actually the percentage of residual vibration, which is a measure of the amount of vibration a sequence of impulses will cause relative to the vibration caused by a single impulse with unit magnitude. On setting equation (3) equal to zero and avoiding a trivial solution, values for the impulse amplitudes and time locations that would lead to a zero residual vibration can be found. To avoid the zero-valued trivial solution and to obtain a normalized result, the impulses are required to sum to one; that is,

(5) |

However, impulses can still satisfy equation (5) by taking very large numbers, both positive and negative. To alleviate this, a bounded solution is imposed that limits the values of to positive values

(6) |

For a two-impulse sequence, there are four unknowns, , , , . Without loss of generality, we can set the time location of the first impulse equal to zero. For equation (3) to be satisfied, the expressions in equation (4) must both be equal to zero. Therefore, we get

(7) |

The second of these two expressions can be satisfied nontrivially by setting the sine term equal to zero. This occurs when

(8) |

where is the damped period of vibration. This of course means that there are an infinite number of possible values for the location of the second impulse, but to cancel the vibration in the shortest amount of time, the smallest value of is chosen:

(9) |

For this case, the amplitude constraint given in equation (5) reduces to

(10) |

Using the expression for the damped natural frequency and substituting equations (9) and (10) into the first expression of equation (5) gives

(11) |

The sequence of two impulses that leads to zero residual vibration can be summarized as

(12) |

where .

A zero-vibration (ZV) input shaper, as just described, is useful in situations where the parameters of the system are known with a high degree of accuracy. Also, if little faith is held in the input shaping approach, the application will never increase vibration beyond the level before shaping [19]. It has been pointed out [20] that previous articles on input shaping have confused the issue of natural frequency, even if the conceptual explanation when using the method is generally acceptable. Kang [20] differentiates between (a variable), which is the actual value of the undamped natural frequency of the system, and (a constant), which is the “modeled” value of the undamped natural frequency . Kang [20] also proves that vibration approaches zero as . The article shows clearly that for a vibratory system, a modeling frequency is chosen such that at the modeling frequency .

The following code generates the sensitivity curve (Figure for a ZV shaper by plotting the amplitude of residual vibration as a function of the system parameters. In this case, the modeling frequency was set as rad/s and the damping ratio as 0.0.

**Figure 8.** Sensitivity curve for ZV input shaper.

The amplitudes and time locations of the impulses depend on the system parameters and . If there are errors in these values, (and there always are [18]), then the impulse sequence will not result in zero vibration. A Zero Vibration and Derivative (ZVD) shaper is a command generation scheme designed to make the input shaping process more robust to these modeling errors. To increase robustness to modeling error, the ZVD input shaper adds two constraints [20], the derivatives

(13) |

The sequence for the ZVD shaper can be summarized as

(14) |

where .

An alternative to the ZV shaper is the ZVD shaper, which is much more robust than the ZV shaper, as shown in Figure 9. However, the ZVD shaper has a time duration equal to one period of the vibration frequency, as opposed to the one-half period length of the ZV shaper. This tradeoff is typical of the input shaper design process; that is, increasing insensitivity usually requires increasing the length of the input shaper. An input shaper with even more insensitivity than the ZVD can be obtained by setting the second derivative of equation (3) with respect to equal to zero. This is called the ZVDD shaper. The algorithm can be extended indefinitely with repeated differentiation of equation (3). Closed-form solutions of the ZV, ZVD and ZVDD shapers for damped systems exist [7]. An alternative procedure for increasing insensitivity using extra-insensitive constraints has been derived [21]. Instead of forcing the residual vibration to zero at the modeling frequency, the residual vibration is limited to a low level of . The width of the notch in the sensitivity curve is then maximized by forcing the vibration to zero at two frequencies, one lower than the modeling frequency and the other higher. Figure 9 indicates that there are two inner maxima, say at frequencies and , where the vibration must equal as defined in equation (15) and the derivative must equal zero. These two constraints translate to

(15) |

and

(16) |

where and is the difference between and . Note that represents the frequency shift from the modeling frequency to the frequency that corresponds to the first hump in the sensitivity curve; depends on and does not appear in the final formula for the shaper. Other conditions are that the impulse amplitudes must sum to one, and following the hypothesis that the shaper contains four evenly spaced impulses with a duration of one and a half periods to form the sensitivity curve [21], then

(17) |

Using these conditions, it can be shown that

(18) |

Expanding equations (15) and (16), combining terms and using equation (18) gives

(19) |

and

(20) |

Equation (19) can be solved for :

(21) |

Substituting equation (21) into equation (20) yields

(22) |

where

The two-hump shaper for an undamped system can now be summarized as

(23) |

The following code generates a two-hump shaper based on the above analysis and compares it to the ZVD shaper. When , the insensitivity to modeling errors (i.e. the width of the sensitivity curve) is increased by over 100%. Again, the modeling frequency is set at rad/s.

**Figure 9.** Sensitivity curves.

Robustness is not restricted to errors in the frequency. Figure 10 shows a three-dimensional sensitivity curve for a shaper that was designed to suppress vibration over the range of damping ratios between 0 and 0.1.

**Figure 10.** Three-dimensional curve including variation with damping ratio.

A damped oscillatory dynamic system model has the transfer function

(24) |

where again and are the natural frequency and damping ratio, respectively. Figure 11 gives various responses, depending on the damping factor.

**Figure 11.** Responses to input for different damping factors.

The equation for the responses shown in Figure 11 is

(25) |

where and are the amplitude and time of the impulse, respectively. Further, the response to a sequence of impulses can be obtained using the superposition principle. Thus for impulses, the impulse response can be expressed as , where

(26) |

(27) |

where and are again the magnitude and times at which the impulses occur.

To demonstrate the effect on the response when the model is not perfect, the following code was written using a robust four-impulse ZVDD shaper. Figure 12 shows the response when no shaping is imposed, when the model is perfect, and when there is a 20% error in the frequency estimate. The initial peak response is cut to 57% when there is no input shaper applied.

**Figure 12.** Responses when model is not perfect.

Wind turbine blade vibration is a serious problem because it will reduce the life of the blade and vibrations can also be transferred to the supporting tower, causing the complete structure to vibrate. One source of an increase in vibration amplitudes is the change of pitch angle input. Use is made here of a ZV input shaper to demonstrate a decrease in amplitude when the pitch angle changes from large to small. Although many ways to suppress wind turbine blade vibration have been developed, there has not been much work done on the effect on the vibrations when changing the pitch angle rapidly. A rapidly changing pitch angle input could be considered as a step input, causing some additional vibration (larger amplitudes) to the blade. In this example, the effect of using an input shaper to reduce the blade angle deflection is investigated. We consider the wind turbine blade as a cantilever beam with the hub end clamped and the other end free to move. The effect of the rotation is taken into account by the inclusion of centrifugal stiffening, and the modal shapes were calculated using the Adomian modified decomposition method [22]. To incorporate the effect of changing the pitch angle, the well-known blade element theory [23] was used to form a generalized normal force consisting of components of lift and drag forces as functions of pitch angle. The expressions for kinetic energy, potential energy and aerodynamic forces were then used to form a Lagrangian of the blade that governs the motion of blade flapwise deflection.

A ZV input shaper is used in a scheme summarized in Figure 13, which is a block diagram of an input shaping control scheme dealing with unexpected wind disturbances.

**Figure 13.** Schematic of input shaper controller.

The input shaping control is a feed-forward control method, and only the shaped input is used to control the system. The idea is to see how the blade flapwise deflection reacts to a pitch angle change. The pitch angle is initially set at a 4° angle of attack. Figure 14 shows the flapwise deflection with the pitch angle is 4° when it has reached its steady state, and Figure 15 shows the deflection at 14°. It can be seen that the deflection of the blade is worse at the smaller angle, and this is due to the wind turbine blade being a pitch-to-feather type.

**Figure 14.** Flapwise deflection (pitch angle 4°).

**Figure 15.** Flapwise deflection (pitch angle 14°).

To see how the pitch angle affects the flapwise deflection, the pitch angle is changed from 14° to 4° at 30 seconds. Figure 16 shows that some residual vibration is caused (solid blue curve) since the deflection after 30 seconds is different from when the pitch angle was set at 14°. This is because in the model there is no damping at first. Next, the input shaper is added, and clearly, from the dashed orange curve in Figure 16, the residual vibration is reduced.

**Figure 16.** Pitch angle change effect.

Some of the tools available for input shaping have been investigated here, where the input to a given system has been shaped so as to minimize the residual vibration. Important to future use of these techniques is that they have been shown to be robust and able to tolerate errors within the system parameters; that is, although a residual vibration may not become zero due to the shaper, there is generally a large reduction in vibration.

[1] | J.-H. Park and S. Rhim, “Experiments of Optimal Delay Extraction Algorithm Using Adaptive Time-Delay Filter for Improved Vibration Suppression,” Journal of Mechanical Science and Technology, 23(4), 2009 pp. 997–1000. doi:10.1007/s12206-009-0328-1. |

[2] | Q. H. Ngo, K.-S. Hong and I. H. Jung, “Adaptive Control of an Axially Moving System,” Journal of Mechanical Science and Technology, 23(11), 2009 pp. 3071–3078.doi:10.1007/s12206-009-0912-4. |

[3] | O. J. M. Smith, Feedback Control Systems, New York: McGraw-Hill Book Company, 1958. |

[4] | O. J. M. Smith, “Posicast Control of Damped Oscillatory Systems,” Proceedings of the IRE, 45(9), 1957 pp. 1249–1255. doi:10.1109/JRPROC.1957.278530. |

[5] | D. J. Grimpel and J. F. Calvert, “Signal Component Control,” Transactions of the American Institute of Electrical Engineers, 71(5), 1952 pp. 339–343. doi:10.1109/TAI.1952.6371288. |

[6] | C. J. Swigert, “Shaped Torque Techniques,” Journal of Guidance, Control, and Dynamics, 3(5), 1980 pp. 460–467. doi:10.2514/3.56021. |

[7] | N. C. Singer and W. P. Seering, “Preshaping Command Inputs to Reduce System Vibration,” Journal of Dynamic Systems, Measurement, and Control, 112(1), 1990 pp. 76–82. doi:10.1115/1.2894142. |

[8] | K. L. Sorensen, W. E. Singhose and S. Dickerson, “A Controller Enabling Precise Positioning and Sway Reduction in Bridge and Gantry Cranes,” Control Engineering Practice, 15(7), 2007 pp. 825–837. doi:10.1016/j.conengprac.2006.03.005. |

[9] | M. A. Ahmad, R. M. T. R. Ismail, M. S. Ramli, R. E. Samin and M. A. Zawawi, “Robust Input Shaping for Anti-Sway Control of Rotary Crane,” Proceedings of TENCON 2009—IEEE Region 10 Conference, Singapore, Jan. 23–26, 2009 pp. 1039–1043. doi:10.1109/TENCON.2009.5395891. |

[10] | W. E. Singhose, W. Seering and N. C. Singer, “Time-Optimal Negative Input Shapers,” Journal of Dynamic Systems, Measurement, and Control, 119(2), 1997 pp. 198–205. doi:10.1115/1.2801233. |

[11] | D. Gorinevsky and G. Vukovich, “Nonlinear Input Shaping Control of Flexible Spacecraft Reorientation Maneuver,” Journal of Guidance, Control, and Dynamics, 21(2), 1998pp. 264–270. doi:10.2514/2.4252. |

[12] | L. Y. Pao and W. E. Singhose, “Verifying Robust Time-Optimal Commands for Multimode Flexible Spacecraft,” Journal of Guidance, Control, and Dynamics, 20(4), 1997 pp. 831–833. doi:10.2514/2.4123. |

[13] | J. Park, P. H. Chang, H. S. Park and E. Lee, “Design of Learning Input Shaping Technique for Residual Vibration Suppression in an Industrial Robot,” IEEE/ASME Transactions on Mechatronics, 11(1), 2006 pp. 55–65. doi:10.1109/TMECH.2005.863365. |

[14] | C.-G. Kang, K. S. Woo, J. W. Kim, D. J. Lee, K. H. Park and H. C. Kim, “Suppression of Residual Vibrations with Input Shaping for a Two-Mode Mechanical System,” Proceedings of International Conference on Service and Interactive Robotics, Taipei, Taiwan, 2009 pp. 1–6. |

[15] | S. D. Jones and A. G. Ulsoy, “An Approach to Control Input Shaping with Application to Coordinate Measuring Machines,” Journal of Dynamic Systems, Measurement, and Control, 121(2), 1999 pp. 242–247. doi:10.1115/1.2802461. |

[16] | S. Kapucu, G. Alici and S. Bayseç, “Residual Swing/Vibration Reduction Using a Hybrid Input Shaping Method,” Mechanism and Machine Theory, 36(3), 2001 pp 311–326. doi:10.1016/S0094-114X(00)00048-3. |

[17] | S. S. Güreyük and S. Cinal, “Robust Three-Impulse Sequence Input Shaper Design,” Journal of Vibration and Control, 13(12), 2007 pp.1807–1818. doi:10.1177/1077546307080012. |

[18] | T. Singh and W. Singhose, “Tutorial on Input Shaping/Time Delay Control of Maneuvering Flexible Structures,” Proceedings of the 2002 American Control Conference, Vol. 3, Anchorage, AK, May 8–10, 2002 pp. 1717–1731. doi:10.1109/ACC.2002.1023813. |

[19] | I. Arolovich and G. Agranovich, “Control Improvement of Under-Damped Systems and Structures by Input Shaping,” Proceedings of the 8th International Conference on Material Technologies and Modeling (MMT-2014), Ariel, Israel, Jul. 28–Aug. 1, 2014 pp. 3.1–3.10. (May 23, 2017)www.semanticscholar.org/paper/Control-Improvement-of-Under-damped-Systems-and-St-Arolovich-Agranovich/5cd5f119710edc81be912aea09a66c64e92d48a2. |

[20] | C.-G. Kang, “On the Derivative Constraints of Input Shaping Control,” Journal of Mechanical Science and Technology, 25(2), 2011 pp. 549–554. doi:10.1007/s12206-010-1205-7. |

[21] | T. Singh and S. R. Vadali, “Robust Time-Optimal Control: Frequency Domain Approach,” Journal of Guidance, Control, and Dynamics, 17(2), 1994 pp. 346–353. doi:10.2514/3.21204. |

[22] | D. Adair and M. Jaeger, “Simulation of Tapered Rotating Beams with Centrifugal Stiffening Using the Adomian Decomposition Method,” Applied Mathematical Modelling, 40(4), 2016pp. 3230–3241. doi:10.1016/j.apm.2015.09.097. |

[23] | D. Adair and M. Alimaganbetov, “Propeller Wing Aerodynamic Interference for Small UAVs during VSTOL,” 56th Israel Annual Conference on Aerospace Sciences, Tel Aviv/Haifa, 9–10 Mar., 2016. (May 23, 2017)www.researchgate.net/publication/285356494_Propeller_Wing_Aerodynamic_Interference_ for_Small_UAVs_during_VSTOL. |

D. Adair and M. Jaeger, “Aspects of Input Shaping Control of Flexible Mechanical Systems,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-3. |

Desmond Adair is a professor of Mechanical Engineering in the School of Engineering, Nazarbayev University, Astana, Republic of Kazakhstan. His recent research interests include developing analytical methods for solving vibration problems, and computational fluid dynamics.

Martin Jaeger is an associate professor of Civil Engineering and manager of the Project Based Learning Centre in the School of Engineering, Australian College of Kuwait, Mishref, Kuwait. His recent research interests include construction management and total quality management, as well as developing strategies for engineering education.

**Desmond Adair**

*School of Engineering
Nazarbayev University
53 Kabanbay Batyr Ave.
Astana, 010000, Republic of Kazakhstan*

**Martin Jaeger**

*School of Engineering and ICT*

*University of Tasmania
Churchill Ave.
Hobart, TAS 7001, Australia*

This didactic synthesis compares three solution methods for polynomial approximation and systematically presents their common characteristics and their close interrelations:

1. Classical Gram–Schmidt orthonormalization and Fourier approximation in

2. Linear least-squares solution via QR factorization on an equally spaced grid in

3. Linear least-squares solution via the normal equations method in and on an equally

spaced grid in

The first two methods are linear least-squares systems with Vandermonde matrices ; the normal equations contain matrices of Hilbert type . The solutions on equally spaced grids in converge to the solutions in All solution characteristics and their relations are illustrated by symbolic or numeric examples and graphs.

Let . Consider the Hilbert space of real-valued square integrable functions (or , for short), equipped with Lebesgue measure and scalar product

(1) |

and the corresponding -norm

scalar products can be approximated by scalar products on discrete grids in based on Riemann sums and similarly for norms.

Partition the finite interval into subintervals by the points

and set

Suppose that is a bounded function on . Let be any point in the subinterval and define the grid

The *Riemann sums* on the partition and grid are defined by

(2) |

(3) |

(4) |

(5) |

Equation (3) is called the left-hand Riemann sum, (4) the right-hand Riemann sum and (5) the (composite) midpoint rule.

For an equally spaced partition, the step size is

The equally spaced partition points are

and the equally spaced grid of length (excluding the endpoint ) is

It is also possible to use the grid points or grid shifted by an amount , where so that , as

Let

(6) |

For equally spaced grids, the Riemann sums simplify to

(7) |

Setting , and gives the left-hand Riemann sum, the composite midpoint rule and the right-hand Riemann sum, respectively. The error of the Riemann sums is defined as

The set of continuous real-valued functions forms a dense subspace of , [1], Theorem (13.21). For , the restrictions and to this grid are well-defined. Define the -dimensional scalar product on this grid:

(8) |

The -dimensional 2-norm is

The factor ensures that the norms of constant functions agree:

Denote the linear space of polynomials with real coefficients of degree at most by and define the polynomial by

The polynomial can be written as a scalar product (or dot product) of two -tuples, the monomials up to degree and the coefficients:

Introducing the *Vandermonde matrix*

(9) |

every polynomial of degree can be written as the product of a matrix and a vector as

The product of a matrix and an -vector is regarded as a 1-vector, not a scalar, as in Mathematica.

Restricting the Vandermonde matrix to the interval gives an operator mapping into :

(10) |

Whereas is an unbounded operator, is a bounded operator with respect to the

2-norms on and .

The polynomial approximates in the -norm, as measured by

(11) |

In matrix-vector notation, this constitutes a linear least-squares problem for the coefficients :

(12) |

where

Now take a discrete grid

and sample on this discrete grid:

The polynomial of degree approximates in the 2-norm on this grid as measured by

(13) |

In matrix-vector notation, this constitutes a linear least-squares problem for the coefficients :

(14) |

where

(15) |

A rectangular or square matrix of this form is called a *Vandermonde matrix*.

Let , be Hilbert spaces with scalar products , and let be a bounded linear operator defined everywhere on . Then the *adjoint operator* of is defined by the identity ([2], p. 196, Definition 1)

(16) |

For Hilbert spaces , over the reals, one writes instead of .

All Riemann sums integrate constant functions exactly on any grid, since

If is bounded on and continuous except for finitely many points, it has a Riemann integral. Consult [3], Chapter 2, for proofs and further references on quadrature formulas.

**Theorem 1**

If exists and is bounded and integrable over , then the errors of the right-hand Riemann sums satisfy

(17) |

A similar result holds for the left-hand Riemann sums (with ).

If , the error term of the elementary midpoint formula is given in [3], (2.6.5):

(18) |

Therefore the error of the composite midpoint formula can be bounded by

(19) |

By Theorem 1, for functions and , the discrete scalar product converges at least as fast as to the scalar product:

By equation (19), for functions , the discrete scalar product converges at least as fast as to the scalar product:

See [4], sections 2.4 and 2.6.

**Theorem 2**

(20) |

Define to be the index of the last positive singular value

Then

(21) |

The *condition number* of a rectangular matrix with full column rank with respect to the 2-norm (in short, the 2-condition number) is defined as the quotient of its largest to its smallest singular value ([4], equation (2.6.5)):

(22) |

By equation (20), the 2-condition number has the properties

(23) |

(24) |

If is an invertible matrix, . The SVD of is obtained from the SVD of as

(25) |

(26) |

If is a real matrix with orthonormal columns , it can be completed to an orthogonal matrix . Therefore the SVD of is

(27) |

The 2-condition number of is

(28) |

See [2], sections VII.1 and VII.2.

**Theorem 3**

If is a bounded linear operator, defined everywhere on , then is a bounded linear operator defined everywhere on and

.

For and a subset , linear combinations from the subset define a finite-dimensional linear operator

(29) |

Obviously, and if and only if is linearly independent.

is bounded and has the matrix representation

Apply the definition of the adjoint operator and notice that the first scalar product is that of , ; then

Therefore the adjoint operator is

(30) |

Substituting into the preceding equation gives the representation of as an matrix (note the scalar products are taken in ):

(31) |

Here is Hermitian positive semidefinite if is over the complex numbers, and symmetric positive semidefinite if is over the reals, and

A polynomial of degree has at most distinct zeros, therefore the set of monomials is a linearly independent subset of and

(32) |

By [5], the determinant of the Vandermonde matrix is the product

Therefore, the rectangular Vandermonde matrix (15) has full rank if and only if the points are pairwise disjoint.

Names are chosen according to previous notation and terminology.

This defines the 2-condition number.

This defines symbolic integration with time constrained to 10 seconds.

This defines numerical integration.

The function (which is ) first attempts symbolic integration; if that is unsuccessful or takes too long, it performs numerical integration.

This defines the scalar product in .

Since is listable in and , it also implements the adjoint operator for a set of functions according to equation (32).

This defines the norm in .

This defines functions for discrete grids.

If , or is a machine number, the functions , , and return machine numbers.

This defines the functions , , and .

To avoid potential clashes with predefined values for the variables , and , the script letters , and are used for symbolic results.

These sample functions are used in the commands.

Let . Given a rectangular data matrix and an observation vector ,

(33) |

the linear least-squares (LS) problem is to find:

(34) |

A comprehensive description of the QR factorization of an matrix via Householder, Givens, fast Givens, classical Gram–Schmidt and modified Gram–Schmidt methods is given in [4], section 5.2. Here only the essential steps are presented.

Let be an orthogonal matrix. Such a matrix preserves lengths of vectors:

Given the real matrix , , the goal is to construct an orthogonal matrix such that

where is an upper-triangular matrix of the form

Obviously

The Mathematica function `QRDecomposition` deviates from the full QR factorization as follows:

is output as an matrix. The rows of are orthonormal. Only the upper-triangular

submatrix is output.

Then the unique solution of the upper-triangular system is straightforward:

It gives the unique solution of the full-rank linear least-squares problem (34):

Since multiplication with an orthogonal matrix does not change the singular values, the condition numbers do not change either:

(35) |

This holds in particular for the Vandermonde matrices, both in the discrete and continuous case.

This defines and .

Applying the classical Gram–Schmidt orthonormalization process in a pre-Hilbert space, described in [2], p. 88 ff., to the monomials in gives an orthonormal system of polynomials

(36) |

that satisfy

(37) |

The Fourier coefficients of any function with respect to this orthonormal system are defined according to [2], p. 86, equation (1) (the dot here denotes the placeholder for the integration argument in the scalar product):

The best approximation to is given as the Fourier sum of terms:

The orthonormal system of polynomials is given by the function . The functions and are also defined.

This defines the polynomials .

These polynomials differ from the classical Legendre polynomials `LegendreP` built into Mathematica only by normalization factors.

This shows the Fourier approximation of a sample set of functions.

**Proposition 1**

Define as the matrix of the monomial coefficients of the polynomials of equation (36). Then is an invertible lower-triangular matrix and

(38) |

(39) |

**Proof**

The Gram–Schmidt orthonormalization is an invertible linear mapping of from the basis of monomials to the orthonormal basis , representable by a real matrix . By (37), the matrix is lower triangular. Transposing (38) gives (39).

**Theorem 4**

The classical Gram–Schmidt orthonormalization applied to the monomials is equivalent to the QR factorization of the Vandermonde operator ;

**Proof**

For this QR factorization, the inverse of the upper-triangular matrix is already calculated by the Gram–Schmidt process:

Equations (26), (28) and (35) give the relations on the 2-condition numbers of the operators or matrices:

**Corollary 1**

The numerical instability of the classical Gram–Schmidt process in machine arithmetic is discussed in [4], section 5.2.8. However, since the Gram–Schmidt orthonormalization of the monomials with respect to the scalar product can be performed by symbolic calculations, numerical algorithm stability is not an issue here, contrary to the -dimensional scalar product .

This defines the Gram–Schmidt coefficient matrix .

This defines the lower- and upper-triangular matrices of QR decomposition.

This defines the orthogonal matrix .

Here is a set of orthonormal polynomials (, ).

This gives the Gram–Schmidt coefficient matrix , with .

Multiply the matrix of monomials from the right or left.

This reproduces orthogonal matrix .

By construction, the polynomials contained in the matrix columns are orthonormal with respect to the scalar product in .

This verifies the QR decomposition , as in Theorem 4.

Select one of the sample functions and compare the results from the Gram–Schmidt orthonormalization and QR factorization interpretation.

For the continuous case, because , the singular values of equal the singular values of .

Here is the case of a discrete grid.

Obviously, is the lower-triangular Cholesky factor of the .

Consequently, the 2-condition number of the Vandermonde matrix of size on is the square root of the 2-condition number of .

This gives summary results.

- The system matrix is a Vandermonde matrix of dimension in the continuous setting and in the discrete setting.
- has full rank in the continuous and discrete settings.
- The discrete case is solved by the QR factorization of into an orthogonal matrix and an upper-triangular matrix .
- In the continuous case, the Gram–Schmidt orthonormalization applied to the monomials in reads in matrix vector notation as .
- This is reformulated as QR factorization of the Vandermonde matrix: .
- Consequently, the 2-condition number of equals the 2-condition number of the upper-triangular matrix :

a. For intervals symmetric around zero, , the columns of are alternating

orthogonal. Therefore the 2-condition number is much smaller than for .

orthogonal. Therefore the 2-condition number is much smaller than for .

The approach for deriving the normal equations for the least-squares problem (34) is described in [4], section 5.3, for example. Define

A necessary condition for a minimum is , or equivalently,

(40) |

These are called the *normal equations*. The minimum is unique if the *Hessian matrix* has full rank . Then there is a unique solution of the linear least-squares problem (34) or (40):

(41) |

For an equally spaced , defined as in equation (5), the Vandermonde matrix has full rank, so the Hessian matrix for polynomial approximation has full rank as well.

If is rank deficient (), then there are an infinite number of solutions to the least-squares problem. There is still a unique solution with minimum 2-norm, which can be written with the pseudo-inverse matrix ([4], section 5.5):

For full rank, .

Suppose with (see [4] section 5.3.1). Then this algorithm computes the unique solution to the linear least-squares problem (34):

- Compute the lower-triangular portion of .
- Compute .
- Compute the Cholesky factorization (or via
`CholeskyDecomposition`). - Solve the triangular systems and (or and ).

The approach for deriving the normal equations for the case applies with one modification to the continuous least-squares approximation equation (12) (see [6]):

The matrix transpose has to be replaced by the adjoint operator :

A necessary condition for a minimum is , or equivalently,

(42) |

These are called the *normal equations*.

(43) |

is called the *Hessian matrix*. The minimum is unique if has full rank . The elements can be calculated via equation (31):

Obviously, the Hessian matrix is symmetric and positive semidefinite. Since has full rank for any nonzero interval , then has full rank (as well by equation (32)) and is therefore positive definite. Then there exists a unique solution of equations (42) and (12).

Finally, calculate the elements on the right-hand side of via equation (30):

This subsection investigates under which conditions and how fast the polynomial approximations on discrete grids converge to the continuous polynomial approximation in .

For an equally spaced , the normal equations, multiplied by the step size , read

(44) |

Define

(45) |

then by equation (7), the matrix elements of are just the Riemann sums for the matrix elements (integrals) of . Therefore, the Hessian matrices on the discrete grid converge to the continuous Hessian matrix in any matrix norm according to:

Define

(46) |

then by equation (7), the elements of are just the Riemann sums for the elements of , the moments of . Therefore, the right-hand side of the normal equations on the discrete grid converge to the right-hand side of the continuous normal equations in any vector norm according to:

(47) |

**Proposition 2**

The polynomial approximations on the discrete grids converge to the continuous polynomial approximation with the same order as the Riemann sums.

**Proof**

From equation (42), the solution of the polynomial approximation in is

From equation (44), the solution for the discrete grid is

(48) |

For the matrix inverses

Expanding the difference of the solution coefficient vectors completes the proof:

on

**Theorem 5**

The Hessian matrix of the normal equations on is related to the lower-triangular Gram–Schmidt coefficient matrix from Proposition 1 and the upper-triangular matrix from Theorem 4 by

Equivalently, is the lower-triangular Cholesky factor and the upper-triangular Cholesky factor of the symmetric positive-definite Hessian matrix of the normal equations method.

**Proof**

From Theorem 4 and because ,

By [4], Theorem 4.2.7, the Cholesky decomposition of a symmetric positive-definite square matrix is unique.

Equations (24) and (26) give the relation for the 2-condition numbers of the matrices.

**Corollary 2**

This defines the continuous and discrete Hessian matrices.

These are the right‐hand sides of the continuous and discrete normal equations.

This gives the solution of the normal equations system.

This gives the approximation polynomials for the continuous and discrete cases.

This gives the Gram–Schmidt coefficient matrix, its inverse and inverse transpose.

The matrix times its transpose equals the Hessian matrix of the normal equations.

Equivalently, is the lower-triangular and is the upper-triangular Cholesky factor of the Hessian matrix .

For , the Hessian matrix is identical to the Hilbert matrix of dimension :

Hilbert matrices are ill-conditioned already for dimensions (i.e. have condition numbers greater than about 10,000) and soon reach the limits of 64-bit IEEE arithmetic.

Here are the summary results. It takes some time for .

- Continuous case: The Hessian matrix originates from the full-rank Vandermonde matrix of the orthogonalization solutions by taking .
- Discrete case: The Hessian matrix comes from the full-rank Vandermonde matrix of the orthogonalization solutions by taking .
- Consequently,
- gives the exact Hilbert matrix of dimension .
- For other intervals , the Hessian matrix is close to a Hilbert matrix (“Hilbert-type matrix”).
- Hilbert matrices and Hilbert-type matrices are very ill-conditioned already for dimensions greater than four and return extremely inaccurate numerical solutions.
- For intervals symmetric around zero, , has a chessboard pattern of zeros, since the columns of are alternating orthogonal.
- The normal equations are solved by Cholesky decomposition in the continuous and discrete settings.

c. The 2-condition number of is the square of the 2-condition number of the Vandermonde

matrix . This is the root cause of the inherent ill-conditioning of the normal equations.

matrix . This is the root cause of the inherent ill-conditioning of the normal equations.

This performs Gram–Schmidt orthonormalization for a special case. For other cases, change the 6 in to an integer between 1 and 10.

Here are the results from the normal equations.

The two solutions agree both symbolically and in exact arithmetic.

But there are numeric differences in IEEE 758 machine arithmetic.

These differences come from the lower error amplification expressed in a lower 2-condition number,

The numerical solution via the Gram–Schmidt orthonormalization solution is usually more accurate than the normal equations solution.

Here is the QR factorization. Again, for other cases, change the 3 in to an integer between 1 and 10.

Normal equations.

The difference between the numerical solutions is due to the difference in how their rounding errors propagate.

Because of the lower error amplification expressed in a lower 2-condition number, the numerical solution via QR factorization is more accurate than the normal equations solution.

This calculates the convergence order for grids of powers of 2.

Choose problem parameters. Yet again, for other cases, change the 7 in to an integer between 1 and 10.

Here are the approximation errors for grids of powers of 2.

Here is the convergence order. To see the result for another function, select another sample function in . Zero errors are suppressed.

For case 3, all but the first two elements of are zero. For case 5, all but the first element of are zero; therefore the logarithmic plots look incomplete.

Sample functions 3, 4, 5, 6 have discontinuities in the zeroth, first or second derivative; 7 and 8 have singularities in the first or second derivative. These sample functions and do not satisfy all the assumptions of Theorem 1. Therefore the convergence order of the right-hand side can be lower than 1 (respectively 2) as predicted by equation (47). Sample functions 1, 2, 9, 10 are infinitely often continuously differentiable; therefore they have maximum convergence order 1 (respectively 2) according to equation (47).

These are the approximation errors for grids of powers of 2.

This takes a few minutes.

This plots the convergence order.

Analyzing polynomial approximation, this article has systematically worked out the close relations between the solutions obtained by:

- Gram–Schmidt orthonormalization and Fourier approximation
- QR factorization
- Normal equations

The interrelations are:

- The Gram–Schmidt orthonormalization applied to the monomials on is reformulated as QR factorization of the Vandermonde matrix with the following one-to-one correspondences.
- For , Gram–Schmidt orthonormalization returns scalar multiples of the classical Legendre polynomials.
- The Hessian matrix in the normal equations originates from the full-rank Vandermonde matrix of the orthogonalization solutions by taking .
- The 2-condition number of is the square of the 2-condition number of the Vandermonde matrix .
- The upper Cholesky factor of is identical to the upper-triangular matrix of the QR factorization of in both the continuous and discrete cases.
- is identical to of the QR factorization in both the continuous and discrete cases.
- The QR and normal equations solutions on an equally spaced grid of points in the interval converge to the solution in with the same order as the error of the quadrature formulas.

b. The upper-triangular matrix is the transpose inverse of the coefficient matrix of

the orthonormal polynomials .

the orthonormal polynomials .

c. The numerical condition number of equals the numerical condition number of

with respect to the 2-norm.

with respect to the 2-norm.

This article has analyzed the polynomial approximation of a real-valued function with respect to the least-squares norm in different settings:

Three different solution methods for this least-squares problem have been compared:

- The orthogonalization solutions
- The normal equations solution in the continuous and discrete settings

All definitions and solution methods were implemented in Mathematica 11.1.

All solution characteristics and their relations were illustrated by symbolic or numeric examples and graphs.

The author thanks Alexandra Herzog and Jonas Gienger for critical proofreading and the anonymous referees for valuable improvements of the paper. The paper evolved from a lecture series in numerical analysis, given by the author for ESOC staff and contractors from 2012 to 2016, using Mathematica:

- Lecture #4: Linear Least-Squares Systems
- Lecture #6: Numerical Quadrature
- Lecture #9: Interpolation and Approximation of Functions, Curve and Surface Fitting

[1] | E. Hewitt and K. Stromberg, Real and Abstract Analysis, New York: Springer-Verlag, 1975. |

[2] | K. Yosida, Functional Analysis, 5th ed., New York: Springer-Verlag, 1978. |

[3] | P. J. Davis and P. Rabinowitz, Methods of Numerical Integration, 2nd ed., London: Academic Press, 1984. |

[4] | G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed., Baltimore: The John Hopkins University Press, 2013. |

[5] | E. W. Weisstein. “Vandermonde Determinant” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/VandermondeDeterminant.html. |

[6] | J. D. Faires and R. Burden, Numerical Methods, 3rd ed., Pacific Grove, CA: Thomson/Brooks/Cole, 2003. |

G. Gienger, “Polynomial Approximation,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.19-2. |

2011–2016 ESA/ESOC research and technology management office

2010–2011 ESOC navigation support office

1987–2009 Mathematical analyst in ESOC Flight Dynamics division; supported more than 20

space missions

1988 Received Dr. rer. nat. from Heidelberg University, Faculty of Science and Mathematics

1984–1987 Teaching assistant at University of Giessen

1982–1984 Research scientist at Heidelberg University

1975–1981 Studied mathematics and physics at Heidelberg University

**Gottlob Gienger**

*European Space Operations Centre ESOC
Research and Technology Management Office OPS-GX
Robert-Bosch-Strasse 5, 64293 Darmstadt, Germany
Retired staff
*