This article continues the presentation of a variety of applications around the theme of inversion: quandles, inversion of one circle into another and inverting a pair of circles into congruent or concentric circles.
Recent decades have seen a rebirth of geometry as an important subject both in the curriculum and in mathematical and computational research. Dynamic geometry programs have met the demand for visual, specialized computational tools that help bridge the gap between purely visual and algebraic methods. This development has also extended the understanding of the theoretical and computational foundations of geometry, which in turn has stimulated the proliferation of several new branches of geometry, producing a more mature and modern discipline.
In this spirit, these articles [1, 2] have been written to be useful as additional material in a teaching environment on computational geometry, following the practice of the author in teaching the subject at the beginning university level. This third article includes a section on quandles (algebraic generalizations of inversion) that describes their properties and generates all finite quandles up to order five. Also, we include the construction of a circle inverting one circle into another, followed by a section on the construction of a circle inverting two circles, or a circle and a line, into a pair of concentric circles.
Let mean the inverse of the object in the circle with center and radius , drawn as a red dashed circle.
We repeat the definitions of the functions , , and from the previous article [2].
The function computes the square of the Euclidean distance between two given points. (It is more convenient to use the following definition than the builtin Mathematica function .)
The function tests whether three given points are collinear. When exactly two of them are equal, it gives , and when all three are equal, it gives , because there is no unique line through them.
The function computes the unique circle passing through three given points; if they are collinear, then the function is applied first.
The function computes the circle passing through the points , and . If the points are collinear, it gives the line through them; if all three points are the same, it returns an error message, as there is no meaningful definition of inversion in a circle of zero radius.
The function computes the inverse of in a circle or line . The object can be a point (including the special point that inverts to the center of ), a circle or a line (specified by two points).
The geometric definition of inversion of circles can be formalized algebraically and thus be generalized. Let denote the result of inverting in . Quandles arise mostly in knot theory and group theory and are characterized by the following axioms [3]:
The first two axioms correspond to wellknown properties of inversion.
The following figure illustrates the third axiom. Red arrows go from the center of the circle to be inverted to the center of its inversion.
A set equipped with a binary operation whose elements satisfy these three axioms is called an involutory quandle, or quandle for short. The operation is neither commutative nor associative. Inversive geometry applied to generalized circles is an example of an infinite quandle. There are other sets that also verify the axioms; for example, if , the operation is a quandle.
Finite quandles are somewhat curious; for instance, the following is the operation matrix corresponding to a sixelement quandle (due to Takasaki).
This verifies that under any modulus, this structure generalizes to a quandle.
A matrix corresponding to a finite quandle has different elements appearing in its main diagonal and also different elements in each of its columns (i.e. for all elements , , there exists a unique such that ). Also, for every triple of indices, we must have
Is there an arrangement of generalized circles that forms a finite quandle under mutual inversion, that is, is closed by inversion? Any two orthogonal circles form a twoelement quandle. Also, consider a set of lines equally spaced, passing through the origin. This set is closed under reflection. Taking and labeling the lines from to gives the Takasaki matrix. A circle centered at the origin and lines equally spaced produce a set closed under inversion; if we label the circle as , the matrix in this case has all elements in the last row equal to . Let us now generate all finite quandles of size .
Computing the number of finite quandles by this method is computationally expensive, as the variable is of length . With , using the previous code, it took about an hour reporting 404 instances (time measured on a Mac Pro 3.1 GHz, 16 Gb).
The following is an example of a set of four circles closed under inversion (that is, any circle in the set inverted in any other circle results in a circle in the set), also called the inversive group of three points [3]. The function computes a disk or a line passing through point such that points and form an inverse pair under inversion in . In the following , you can drag a locator to modify one of the four circles; the others are computed accordingly.
The function slightly varies three points that are coincident or collinear.
For more on quandles, see [4].
Throughout this section, let and be circles with , and let the inversion circle be , such that . Call such an the midcircle of and . There are three cases, depending on the relative positions of and .
If , a reflection takes into , so assume from now on.
Theorem 1
Assume without loss of generality that and . Draw two parallel radii from and defining variable points and . Extend the lines and to intersect at point . From similar triangles, , and we easily conclude that .
Extend the lines and to intersect at the point . Construct the circle , which is tangent to and .
The first part of the following output checks that the circle inverts into , with .
The second part checks that , so that and invert to each other in .
The circle separates and , while connects them, so intersects in two points and , each of which is fixed under inversion in . Since , and on invert into , and , also on , inverts to itself (though not pointwise), and and are orthogonal.
Let be orthogonal to and , so that both and are invariant under inversion in . If inverts into in , then and invert into each other in , so , and and are orthogonal.
This notation is followed in the next . (A circle orthogonal to and is not drawn.)
Theorem 2
To verify inverts the circle into the circle , proceed as in the previous section.
The next follows the previous construction of the midcircle of and .
Theorem 3
To verify this, is inverted in the two circles from theorems 1 and 2.
The following shows the construction of both midcircles following the previous notation.
Theorem 4
Proof
Let be a circle centered on the midcircle . Inverting in , we obtain a line (drawn in black in the following ). As separates and , must separate and ; in fact, must invert into . As is a line, inversion in is reflection, hence and are congruent. □
The following shows additionally that the brown line joining the centers of and inverts into a brown circle orthogonal to the blue line and to the congruent circles and , as was expected.
For more on the geometry of circles and inversion, see [5, 6, 7].
The radical axis of two circles is the locus of a point from which tangents to the two circles have the same length. It is always a straight line perpendicular to the line joining the centers of the two circles. If the circles intersect, the radical axis is the line through the points of intersection. If the circles are tangent, it is their common tangent. We will need the following property of the radical axis of two circles [8, 9, 10].
Theorem 5
This checks that any point on has tangents to and of equal length.
Theorem 6
A few words on the assumptions in the following Simplify to verify theorem 6. The first three, , , , hold in general. The fourth, , is for the Solve that computes to find a solution. The next two, and , are for the inversion of to be feasible, and the last two are for the inversion of to be feasible.
Theorem 7
Proof
Let and be the given circles. Choose a circle centered at a point on the radical axis with radius equal to the length of the tangent from to . This circle intersects the line in two points , . Any circle with or as center inverts and into concentric circles. □
The next shows the circles in blue and in yellow, their radical axis, a point on the radical axis, the circle now moving freely on the radical axis, the circle of inversion centered at one of the intersections of and , and two concentric circles obtained by inverting and in . Only one of the inversive circles is shown; the other is a mirror image in the radical axis.
The center of the inversive circle does not depend on the position of . This checks the inverses of the circles and are concentric.
Let be the center of the other inversive circle.
Theorem 8
A line and a circle that do not intersect or a pair of nonintersecting circles can be inverted into two concentric circles. The key is to obtain a common orthogonal circle and then choose the inversion center at its intersection with a particular line [11].
Theorem 9
I am grateful to the Center of Investigation and Advanced Studies in Mexico City for the use of its extensive library and to Wolfram Research for providing an ideal environment to develop this series of articles.
[1]  J. RangelMondragon, “Inversive Geometry: Part 1,” The Mathematica Journal, 15, 2013. doi:10.3888/tmj.15–7. 
[2]  J. RangelMondragon, “Inversive Geometry: Part 2,” The Mathematica Journal, 18, 2014. doi:10.3888/tmj.185. 
[3]  F. Morley and F. V. Morley, Inversive Geometry, New York: Ginn and Company, 1933. 
[4]  B. Ho and S. Nelson, “Matrices and Finite Quandles,” Homology, Homotopy and Applications, 7(1), 2005 pp. 197–208. doi:10.4310/HHA.2005.v7.n1.a11. 
[5]  H. S. M. Coxeter, “Inversive Geometry,” Educational Studies in Mathematics, 3(3), 1971 pp. 310–321. doi:10.1007/BF00302300. 
[6]  D. Pedoe, Geometry, A Comprehensive Course, New York: Dover Publications Inc., 1988. 
[7]  H. S. M. Coxeter and S. L. Geitzer, Geometry Revisited, New York: Random House, 1967. 
[8]  E. W. Weisstein. ”Radical Line” from MathWorld–A Wolfram Web Resource. mathworld.wolfram.com/RadicalLine.html. 
[9]  S. R. Murthy. “Radical Axis and Radical Center” from the Wolfram Demonstrations Project–A Wolfram Web Resource. demonstrations.wolfram.com/RadicalAxisAndRadicalCenter. 
[10]  J. RangelMondragón, “The Arbelos,” The Mathematica Journal, 16, 2014. doi:10.3888/tmj.165. 
[11]  D. E. Blair, Inversion Theory and Conformal Mapping, Providence, RI: American Mathematical Society, 2000. 
J. RangelMondragón, “Inversive Geometry: Part 3,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.194. 
Jaime RangelMondragón received M.Sc. and Ph.D. degrees in Applied Mathematics and Computation from the School of Mathematics and Computer Science at the University College of North Wales in Bangor, UK. He was a visiting scholar at Wolfram Research, Inc. and held positions in the Faculty of Informatics at UCNW, the Center of Literary and Linguistic Studies at the College of Mexico, the Department of Electrical Engineering at the Center of Research and Advanced Studies, the Center of Computational Engineering (of which he was director) at the Monterrey Institute of Technology, the Department of Mechatronics at the Queretaro Institute of Technology and the Autonomous University of Queretaro in Mexico, where he was a member of the Department of Informatics and in charge of the Academic Body of Algorithms, Computation and Networks. His research included combinatorics, the theory of computing, computational geometry and recreational mathematics. Jaime RangelMondragón died in 2015.
]]>Input shaping is an established technique to generate prefilters so that flexible mechanical systems move with minimal residual vibration. Many examples of such systems are found in engineering—for example, space structures, robots, cranes and so on. The problem of vibration control is serious when precise motion is required in the presence of structural flexibility. In a wind turbine blade, untreated flapwise vibrations may reduce the life of the blade and unexpected vibrations can spread to the supporting structure. This article investigates one of the tools available to control vibrations within flexible mechanical systems using the input shaping technique.
Among other choices [1, 2] for reducing vibrations in flexible systems, input shaping control is an openloop control technique that is implemented by convolving a sequence of impulses with a desired command. The amplitudes and time locations of the impulses are determined from the system’s natural frequency and damping ratio by solving a set of constraint equations. Historically, input shaping dates from the late 1950s. Originally named “Posicast Control,” the initial development of input shaping is largely credited to Smith [3, 4], with one notable precursor due to Calvert and Gimpel [5]. All three works proposed a simple technique to generate a nonoscillatory response from a lightly damped system subjected to a step input, which was motivated by a simple wave cancellation concept for the elimination of the oscillatory motion of the underdamped system. The early forms of command generators suffered from poor robustness properties, as they were sensitive to modeling errors of natural frequencies and damping ratios. Since this initial work, there have been many developments in the area of input shaping control, with one of the pacing elements being the progress in microprocessor technology to implement the concept. More recent robust command generators have proven beneficial for real systems with, for example, Swigert [6] proposing techniques for the determination of torque profiles that considered the sensitivity of the terminal states to variations in the model parameters. Other examples of input shapers have been developed that are robust to natural frequency modeling errors, the first of which was called the Zero Vibration and Derivative (ZVD) shaper [7], which improved robustness to modeling errors by adding additional constraints on the derivative of the residual vibration magnitudes. With this robustness present, input shaping has been implemented in a variety of systems, including movement of cranes [8, 9], precise movement of disk drives [10], flexible spacecraft [11, 12], industrial robots [13, 14] and coordinate measuring machines [15]. There have also been developments using hybrid input shaping [16] and threestep input shaping techniques [17].
Many types of solutions are possible for the problem of flexible dynamics—for example, feedback control, command shaping or redesigning the physical geometry [18]. A simple example of this challenging area is that of an overhead traveling crane, as shown in Figure 1, which consists of a point mass of the moveable structure (crab or crane) of a point mass of the payload and of a nonextensible load carrying rope (cable) of length .
Figure 1. Schematic of overhead gantry crane.
The equations of motion for such a system can be set up either directly from Newtonian mechanics or indirectly using Lagrangian methods. Using either results in the nonlinear system of equations for the motion as
(1) 
When the gantry crane is accelerating or retarding, then the hanging cable starts to vibrate. The code for equation (1) is shown in Figure 2, with the force set as positive, and the movement of the cable in particular is shown when the crane is accelerating. A result for retardation can easily be found by simply assigning a negative value for in the code for .
Figure 2. Overhead gantry crane accelerating.
A special case is when the gantry crane is moving at a constant velocity; that is, is set to zero. Then the hanging cable for lifting is not swinging, and the crane and pendulum move as in Figure 3.
Figure 3. Overhead gantry crane moving with constant speed.
The upper end of the cable is attached to a trolley that travels along a rail to position the payload. Cranes are usually controlled by a human operator who moves levers or presses buttons to cause the trolley to move. If the operator presses the control button for a finite time period, then the trolley will move a finite distance and come to rest. However, the payload usually oscillates about some support on the trolley due to the trolley motion, as shown in Figure 3 by the uncontrolled oscillation. The crane driver can smooth this situation by suitably pressing the button multiple times. The payload motion for this scenario could be as shown in Figure 4 labeled as “Operator controlled.”
Figure 4. Payload response for uncontrolled and operator controlled.
Here we start with the simplest commands to move systems without vibration. An impulse applied to a system usually causes it to vibrate, but that can be canceled by a second impulse. This concept is shown in Figure 5, where each input is piecewise constant and the system being considered is purely oscillatory with no damping. As can be seen, the response functions add together to give zero.
Figure 5. Simple cancellation of a vibration.
Next, Figure 6 shows the response of a typical forced damped system to a twoimpulse command.
Figure 6. Typical springdamped system.
For the preceding system, the equations of motion are
(2) 
where and are coefficients due to drag and spring stiffness, respectively.
Again, each input is piecewise constant, but the equation of motion has an additional damping term dependent on the speed of motion.
It is instructive to derive the amplitudes and time locations of the twoimpulse command shown in Figure 7.
Figure 7. Twoimpulse response with damping.
If a reasonable estimate of the system’s natural frequency and damping ratio is known, then the residual vibration that results from a sequence of impulses can be described [10] using the expression
(3) 
where
(4) 
and are the amplitudes and time locations of the impulses, is the number of impulses in the impulse sequence, and .
Equation (3) is actually the percentage of residual vibration, which is a measure of the amount of vibration a sequence of impulses will cause relative to the vibration caused by a single impulse with unit magnitude. On setting equation (3) equal to zero and avoiding a trivial solution, values for the impulse amplitudes and time locations that would lead to a zero residual vibration can be found. To avoid the zerovalued trivial solution and to obtain a normalized result, the impulses are required to sum to one; that is,
(5) 
However, impulses can still satisfy equation (5) by taking very large numbers, both positive and negative. To alleviate this, a bounded solution is imposed that limits the values of to positive values
(6) 
For a twoimpulse sequence, there are four unknowns, , , , . Without loss of generality, we can set the time location of the first impulse equal to zero. For equation (3) to be satisfied, the expressions in equation (4) must both be equal to zero. Therefore, we get
(7) 
The second of these two expressions can be satisfied nontrivially by setting the sine term equal to zero. This occurs when
(8) 
where is the damped period of vibration. This of course means that there are an infinite number of possible values for the location of the second impulse, but to cancel the vibration in the shortest amount of time, the smallest value of is chosen:
(9) 
For this case, the amplitude constraint given in equation (5) reduces to
(10) 
Using the expression for the damped natural frequency and substituting equations (9) and (10) into the first expression of equation (5) gives
(11) 
The sequence of two impulses that leads to zero residual vibration can be summarized as
(12) 
where .
A zerovibration (ZV) input shaper, as just described, is useful in situations where the parameters of the system are known with a high degree of accuracy. Also, if little faith is held in the input shaping approach, the application will never increase vibration beyond the level before shaping [19]. It has been pointed out [20] that previous articles on input shaping have confused the issue of natural frequency, even if the conceptual explanation when using the method is generally acceptable. Kang [20] differentiates between (a variable), which is the actual value of the undamped natural frequency of the system, and (a constant), which is the “modeled” value of the undamped natural frequency . Kang [20] also proves that vibration approaches zero as . The article shows clearly that for a vibratory system, a modeling frequency is chosen such that at the modeling frequency .
The following code generates the sensitivity curve (Figure for a ZV shaper by plotting the amplitude of residual vibration as a function of the system parameters. In this case, the modeling frequency was set as rad/s and the damping ratio as 0.0.
Figure 8. Sensitivity curve for ZV input shaper.
The amplitudes and time locations of the impulses depend on the system parameters and . If there are errors in these values, (and there always are [18]), then the impulse sequence will not result in zero vibration. A Zero Vibration and Derivative (ZVD) shaper is a command generation scheme designed to make the input shaping process more robust to these modeling errors. To increase robustness to modeling error, the ZVD input shaper adds two constraints [20], the derivatives
(13) 
The sequence for the ZVD shaper can be summarized as
(14) 
where .
An alternative to the ZV shaper is the ZVD shaper, which is much more robust than the ZV shaper, as shown in Figure 9. However, the ZVD shaper has a time duration equal to one period of the vibration frequency, as opposed to the onehalf period length of the ZV shaper. This tradeoff is typical of the input shaper design process; that is, increasing insensitivity usually requires increasing the length of the input shaper. An input shaper with even more insensitivity than the ZVD can be obtained by setting the second derivative of equation (3) with respect to equal to zero. This is called the ZVDD shaper. The algorithm can be extended indefinitely with repeated differentiation of equation (3). Closedform solutions of the ZV, ZVD and ZVDD shapers for damped systems exist [7]. An alternative procedure for increasing insensitivity using extrainsensitive constraints has been derived [21]. Instead of forcing the residual vibration to zero at the modeling frequency, the residual vibration is limited to a low level of . The width of the notch in the sensitivity curve is then maximized by forcing the vibration to zero at two frequencies, one lower than the modeling frequency and the other higher. Figure 9 indicates that there are two inner maxima, say at frequencies and , where the vibration must equal as defined in equation (15) and the derivative must equal zero. These two constraints translate to
(15) 
and
(16) 
where and is the difference between and . Note that represents the frequency shift from the modeling frequency to the frequency that corresponds to the first hump in the sensitivity curve; depends on and does not appear in the final formula for the shaper. Other conditions are that the impulse amplitudes must sum to one, and following the hypothesis that the shaper contains four evenly spaced impulses with a duration of one and a half periods to form the sensitivity curve [21], then
(17) 
Using these conditions, it can be shown that
(18) 
Expanding equations (15) and (16), combining terms and using equation (18) gives
(19) 
and
(20) 
Equation (19) can be solved for :
(21) 
Substituting equation (21) into equation (20) yields
(22) 
where
The twohump shaper for an undamped system can now be summarized as
(23) 
The following code generates a twohump shaper based on the above analysis and compares it to the ZVD shaper. When , the insensitivity to modeling errors (i.e. the width of the sensitivity curve) is increased by over 100%. Again, the modeling frequency is set at rad/s.
Figure 9. Sensitivity curves.
Robustness is not restricted to errors in the frequency. Figure 10 shows a threedimensional sensitivity curve for a shaper that was designed to suppress vibration over the range of damping ratios between 0 and 0.1.
Figure 10. Threedimensional curve including variation with damping ratio.
A damped oscillatory dynamic system model has the transfer function
(24) 
where again and are the natural frequency and damping ratio, respectively. Figure 11 gives various responses, depending on the damping factor.
Figure 11. Responses to input for different damping factors.
The equation for the responses shown in Figure 11 is
(25) 
where and are the amplitude and time of the impulse, respectively. Further, the response to a sequence of impulses can be obtained using the superposition principle. Thus for impulses, the impulse response can be expressed as , where
(26) 
(27) 
where and are again the magnitude and times at which the impulses occur.
To demonstrate the effect on the response when the model is not perfect, the following code was written using a robust fourimpulse ZVDD shaper. Figure 12 shows the response when no shaping is imposed, when the model is perfect, and when there is a 20% error in the frequency estimate. The initial peak response is cut to 57% when there is no input shaper applied.
Figure 12. Responses when model is not perfect.
Wind turbine blade vibration is a serious problem because it will reduce the life of the blade and vibrations can also be transferred to the supporting tower, causing the complete structure to vibrate. One source of an increase in vibration amplitudes is the change of pitch angle input. Use is made here of a ZV input shaper to demonstrate a decrease in amplitude when the pitch angle changes from large to small. Although many ways to suppress wind turbine blade vibration have been developed, there has not been much work done on the effect on the vibrations when changing the pitch angle rapidly. A rapidly changing pitch angle input could be considered as a step input, causing some additional vibration (larger amplitudes) to the blade. In this example, the effect of using an input shaper to reduce the blade angle deflection is investigated. We consider the wind turbine blade as a cantilever beam with the hub end clamped and the other end free to move. The effect of the rotation is taken into account by the inclusion of centrifugal stiffening, and the modal shapes were calculated using the Adomian modified decomposition method [22]. To incorporate the effect of changing the pitch angle, the wellknown blade element theory [23] was used to form a generalized normal force consisting of components of lift and drag forces as functions of pitch angle. The expressions for kinetic energy, potential energy and aerodynamic forces were then used to form a Lagrangian of the blade that governs the motion of blade flapwise deflection.
A ZV input shaper is used in a scheme summarized in Figure 13, which is a block diagram of an input shaping control scheme dealing with unexpected wind disturbances.
Figure 13. Schematic of input shaper controller.
The input shaping control is a feedforward control method, and only the shaped input is used to control the system. The idea is to see how the blade flapwise deflection reacts to a pitch angle change. The pitch angle is initially set at a 4° angle of attack. Figure 14 shows the flapwise deflection with the pitch angle is 4° when it has reached its steady state, and Figure 15 shows the deflection at 14°. It can be seen that the deflection of the blade is worse at the smaller angle, and this is due to the wind turbine blade being a pitchtofeather type.
Figure 14. Flapwise deflection (pitch angle 4°).
Figure 15. Flapwise deflection (pitch angle 14°).
To see how the pitch angle affects the flapwise deflection, the pitch angle is changed from 14° to 4° at 30 seconds. Figure 16 shows that some residual vibration is caused (solid blue curve) since the deflection after 30 seconds is different from when the pitch angle was set at 14°. This is because in the model there is no damping at first. Next, the input shaper is added, and clearly, from the dashed orange curve in Figure 16, the residual vibration is reduced.
Figure 16. Pitch angle change effect.
Some of the tools available for input shaping have been investigated here, where the input to a given system has been shaped so as to minimize the residual vibration. Important to future use of these techniques is that they have been shown to be robust and able to tolerate errors within the system parameters; that is, although a residual vibration may not become zero due to the shaper, there is generally a large reduction in vibration.
[1]  J.H. Park and S. Rhim, “Experiments of Optimal Delay Extraction Algorithm Using Adaptive TimeDelay Filter for Improved Vibration Suppression,” Journal of Mechanical Science and Technology, 23(4), 2009 pp. 997–1000. doi:10.1007/s1220600903281. 
[2]  Q. H. Ngo, K.S. Hong and I. H. Jung, “Adaptive Control of an Axially Moving System,” Journal of Mechanical Science and Technology, 23(11), 2009 pp. 3071–3078. doi:10.1007/s1220600909124. 
[3]  O. J. M. Smith, Feedback Control Systems, New York: McGrawHill Book Company, 1958. 
[4]  O. J. M. Smith, “Posicast Control of Damped Oscillatory Systems,” Proceedings of the IRE, 45(9), 1957 pp. 1249–1255. doi:10.1109/JRPROC.1957.278530. 
[5]  D. J. Grimpel and J. F. Calvert, “Signal Component Control,” Transactions of the American Institute of Electrical Engineers, 71(5), 1952 pp. 339–343. doi:10.1109/TAI.1952.6371288. 
[6]  C. J. Swigert, “Shaped Torque Techniques,” Journal of Guidance, Control, and Dynamics, 3(5), 1980 pp. 460–467. doi:10.2514/3.56021. 
[7]  N. C. Singer and W. P. Seering, “Preshaping Command Inputs to Reduce System Vibration,” Journal of Dynamic Systems, Measurement, and Control, 112(1), 1990 pp. 76–82. doi:10.1115/1.2894142. 
[8]  K. L. Sorensen, W. E. Singhose and S. Dickerson, “A Controller Enabling Precise Positioning and Sway Reduction in Bridge and Gantry Cranes,” Control Engineering Practice, 15(7), 2007 pp. 825–837. doi:10.1016/j.conengprac.2006.03.005. 
[9]  M. A. Ahmad, R. M. T. R. Ismail, M. S. Ramli, R. E. Samin and M. A. Zawawi, “Robust Input Shaping for AntiSway Control of Rotary Crane,” Proceedings of TENCON 2009—IEEE Region 10 Conference, Singapore, Jan. 23–26, 2009 pp. 1039–1043. doi:10.1109/TENCON.2009.5395891. 
[10]  W. E. Singhose, W. Seering and N. C. Singer, “TimeOptimal Negative Input Shapers,” Journal of Dynamic Systems, Measurement, and Control, 119(2), 1997 pp. 198–205. doi:10.1115/1.2801233. 
[11]  D. Gorinevsky and G. Vukovich, “Nonlinear Input Shaping Control of Flexible Spacecraft Reorientation Maneuver,” Journal of Guidance, Control, and Dynamics, 21(2), 1998 pp. 264–270. doi:10.2514/2.4252. 
[12]  L. Y. Pao and W. E. Singhose, “Verifying Robust TimeOptimal Commands for Multimode Flexible Spacecraft,” Journal of Guidance, Control, and Dynamics, 20(4), 1997 pp. 831–833. doi:10.2514/2.4123. 
[13]  J. Park, P. H. Chang, H. S. Park and E. Lee, “Design of Learning Input Shaping Technique for Residual Vibration Suppression in an Industrial Robot,” IEEE/ASME Transactions on Mechatronics, 11(1), 2006 pp. 55–65. doi:10.1109/TMECH.2005.863365. 
[14]  C.G. Kang, K. S. Woo, J. W. Kim, D. J. Lee, K. H. Park and H. C. Kim, “Suppression of Residual Vibrations with Input Shaping for a TwoMode Mechanical System,” Proceedings of International Conference on Service and Interactive Robotics, Taipei, Taiwan, 2009 pp. 1–6. 
[15]  S. D. Jones and A. G. Ulsoy, “An Approach to Control Input Shaping with Application to Coordinate Measuring Machines,” Journal of Dynamic Systems, Measurement, and Control, 121(2), 1999 pp. 242–247. doi:10.1115/1.2802461. 
[16]  S. Kapucu, G. Alici and S. Bayseç, “Residual Swing/Vibration Reduction Using a Hybrid Input Shaping Method,” Mechanism and Machine Theory, 36(3), 2001 pp 311–326. doi:10.1016/S0094114X(00)000483. 
[17]  S. S. Güreyük and S. Cinal, “Robust ThreeImpulse Sequence Input Shaper Design,” Journal of Vibration and Control, 13(12), 2007 pp.1807–1818. doi:10.1177/1077546307080012. 
[18]  T. Singh and W. Singhose, “Tutorial on Input Shaping/Time Delay Control of Maneuvering Flexible Structures,” Proceedings of the 2002 American Control Conference, Vol. 3, Anchorage, AK, May 8–10, 2002 pp. 1717–1731. doi:10.1109/ACC.2002.1023813. 
[19]  I. Arolovich and G. Agranovich, “Control Improvement of UnderDamped Systems and Structures by Input Shaping,” Proceedings of the 8th International Conference on Material Technologies and Modeling (MMT2014), Ariel, Israel, Jul. 28–Aug. 1, 2014 pp. 3.1–3.10. (May 23, 2017) www.semanticscholar.org/paper/ControlImprovementofUnderdampedSystemsandStArolovichAgranovich/5cd5f119710edc81be912aea09a66c64e92d48a2. 
[20]  C.G. Kang, “On the Derivative Constraints of Input Shaping Control,” Journal of Mechanical Science and Technology, 25(2), 2011 pp. 549–554. doi:10.1007/s1220601012057. 
[21]  T. Singh and S. R. Vadali, “Robust TimeOptimal Control: Frequency Domain Approach,” Journal of Guidance, Control, and Dynamics, 17(2), 1994 pp. 346–353. doi:10.2514/3.21204. 
[22]  D. Adair and M. Jaeger, “Simulation of Tapered Rotating Beams with Centrifugal Stiffening Using the Adomian Decomposition Method,” Applied Mathematical Modelling, 40(4), 2016 pp. 3230–3241. doi:10.1016/j.apm.2015.09.097. 
[23]  D. Adair and M. Alimaganbetov, “Propeller Wing Aerodynamic Interference for Small UAVs during VSTOL,” 56th Israel Annual Conference on Aerospace Sciences, Tel Aviv/Haifa, 9–10 Mar., 2016. (May 23, 2017) www.researchgate.net/publication/285356494_Propeller_Wing_Aerodynamic_Interference_ for_Small_UAVs_during_VSTOL. 
D. Adair and M. Jaeger, “Aspects of Input Shaping Control of Flexible Mechanical Systems,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.193. 
Desmond Adair is a professor of Mechanical Engineering in the School of Engineering, Nazarbayev University, Astana, Republic of Kazakhstan. His recent research interests include developing analytical methods for solving vibration problems, and computational fluid dynamics.
Martin Jaeger is an associate professor of Civil Engineering and manager of the Project Based Learning Centre in the School of Engineering, Australian College of Kuwait, Mishref, Kuwait. His recent research interests include construction management and total quality management, as well as developing strategies for engineering education.
Desmond Adair
School of Engineering
Nazarbayev University
53 Kabanbay Batyr Ave.
Astana, 010000, Republic of Kazakhstan
dadair@nu.edu.kz
Martin Jaeger
School of Engineering and ICT
University of Tasmania
Churchill Ave.
Hobart, TAS 7001, Australia
mjaeger@utas.edu.au
This didactic synthesis compares three solution methods for polynomial approximation and systematically presents their common characteristics and their close interrelations:
1. Classical Gram–Schmidt orthonormalization and Fourier approximation in
2. Linear leastsquares solution via QR factorization on an equally spaced grid in
3. Linear leastsquares solution via the normal equations method in and on an equally
spaced grid in
The first two methods are linear leastsquares systems with Vandermonde matrices ; the normal equations contain matrices of Hilbert type . The solutions on equally spaced grids in converge to the solutions in All solution characteristics and their relations are illustrated by symbolic or numeric examples and graphs.
Let . Consider the Hilbert space of realvalued square integrable functions (or , for short), equipped with Lebesgue measure and scalar product
(1) 
and the corresponding norm
scalar products can be approximated by scalar products on discrete grids in based on Riemann sums and similarly for norms.
Partition the finite interval into subintervals by the points
and set
Suppose that is a bounded function on . Let be any point in the subinterval and define the grid
The Riemann sums on the partition and grid are defined by
(2) 
(3) 
(4) 
(5) 
Equation (3) is called the lefthand Riemann sum, (4) the righthand Riemann sum and (5) the (composite) midpoint rule.
For an equally spaced partition, the step size is
The equally spaced partition points are
and the equally spaced grid of length (excluding the endpoint ) is
It is also possible to use the grid points or grid shifted by an amount , where so that , as
Let
(6) 
For equally spaced grids, the Riemann sums simplify to
(7) 
Setting , and gives the lefthand Riemann sum, the composite midpoint rule and the righthand Riemann sum, respectively. The error of the Riemann sums is defined as
The set of continuous realvalued functions forms a dense subspace of , [1], Theorem (13.21). For , the restrictions and to this grid are welldefined. Define the dimensional scalar product on this grid:
(8) 
The dimensional 2norm is
The factor ensures that the norms of constant functions agree:
Denote the linear space of polynomials with real coefficients of degree at most by and define the polynomial by
The polynomial can be written as a scalar product (or dot product) of two tuples, the monomials up to degree and the coefficients:
Introducing the Vandermonde matrix
(9) 
every polynomial of degree can be written as the product of a matrix and a vector as
The product of a matrix and an vector is regarded as a 1vector, not a scalar, as in Mathematica.
Restricting the Vandermonde matrix to the interval gives an operator mapping into :
(10) 
Whereas is an unbounded operator, is a bounded operator with respect to the
2norms on and .
The polynomial approximates in the norm, as measured by
(11) 
In matrixvector notation, this constitutes a linear leastsquares problem for the coefficients :
(12) 
where
Now take a discrete grid
and sample on this discrete grid:
The polynomial of degree approximates in the 2norm on this grid as measured by
(13) 
In matrixvector notation, this constitutes a linear leastsquares problem for the coefficients :
(14) 
where
(15) 
A rectangular or square matrix of this form is called a Vandermonde matrix.
Let , be Hilbert spaces with scalar products , and let be a bounded linear operator defined everywhere on . Then the adjoint operator of is defined by the identity ([2], p. 196, Definition 1)
(16) 
For Hilbert spaces , over the reals, one writes instead of .
All Riemann sums integrate constant functions exactly on any grid, since
If is bounded on and continuous except for finitely many points, it has a Riemann integral. Consult [3], Chapter 2, for proofs and further references on quadrature formulas.
Theorem 1
(17) 
A similar result holds for the lefthand Riemann sums (with ).
If , the error term of the elementary midpoint formula is given in [3], (2.6.5):
(18) 
Therefore the error of the composite midpoint formula can be bounded by
(19) 
By Theorem 1, for functions and , the discrete scalar product converges at least as fast as to the scalar product:
By equation (19), for functions , the discrete scalar product converges at least as fast as to the scalar product:
See [4], sections 2.4 and 2.6.
Theorem 2
(20) 
Define to be the index of the last positive singular value
Then
(21) 
The condition number of a rectangular matrix with full column rank with respect to the 2norm (in short, the 2condition number) is defined as the quotient of its largest to its smallest singular value ([4], equation (2.6.5)):
(22) 
By equation (20), the 2condition number has the properties
(23) 
(24) 
If is an invertible matrix, . The SVD of is obtained from the SVD of as
(25) 
(26) 
If is a real matrix with orthonormal columns , it can be completed to an orthogonal matrix . Therefore the SVD of is
(27) 
The 2condition number of is
(28) 
See [2], sections VII.1 and VII.2.
Theorem 3
.
For and a subset , linear combinations from the subset define a finitedimensional linear operator
(29) 
Obviously, and if and only if is linearly independent.
is bounded and has the matrix representation
Apply the definition of the adjoint operator and notice that the first scalar product is that of , ; then
Therefore the adjoint operator is
(30) 
Substituting into the preceding equation gives the representation of as an matrix (note the scalar products are taken in ):
(31) 
Here is Hermitian positive semidefinite if is over the complex numbers, and symmetric positive semidefinite if is over the reals, and
A polynomial of degree has at most distinct zeros, therefore the set of monomials is a linearly independent subset of and
(32) 
By [5], the determinant of the Vandermonde matrix is the product
Therefore, the rectangular Vandermonde matrix (15) has full rank if and only if the points are pairwise disjoint.
Names are chosen according to previous notation and terminology.
This defines the 2condition number.
This defines symbolic integration with time constrained to 10 seconds.
This defines numerical integration.
The function (which is ) first attempts symbolic integration; if that is unsuccessful or takes too long, it performs numerical integration.
This defines the scalar product in .
Since is listable in and , it also implements the adjoint operator for a set of functions according to equation (32).
This defines the norm in .
This defines functions for discrete grids.
If , or is a machine number, the functions , , and return machine numbers.
This defines the functions , , and .
To avoid potential clashes with predefined values for the variables , and , the script letters , and are used for symbolic results.
These sample functions are used in the commands.
Let . Given a rectangular data matrix and an observation vector ,
(33) 
the linear leastsquares (LS) problem is to find:
(34) 
A comprehensive description of the QR factorization of an matrix via Householder, Givens, fast Givens, classical Gram–Schmidt and modified Gram–Schmidt methods is given in [4], section 5.2. Here only the essential steps are presented.
Let be an orthogonal matrix. Such a matrix preserves lengths of vectors:
Given the real matrix , , the goal is to construct an orthogonal matrix such that
where is an uppertriangular matrix of the form
Obviously
The Mathematica function QRDecomposition deviates from the full QR factorization as follows:
is output as an matrix. The rows of are orthonormal. Only the uppertriangular
submatrix is output.
Then the unique solution of the uppertriangular system is straightforward:
It gives the unique solution of the fullrank linear leastsquares problem (34):
Since multiplication with an orthogonal matrix does not change the singular values, the condition numbers do not change either:
(35) 
This holds in particular for the Vandermonde matrices, both in the discrete and continuous case.
This defines and .
Applying the classical Gram–Schmidt orthonormalization process in a preHilbert space, described in [2], p. 88 ff., to the monomials in gives an orthonormal system of polynomials
(36) 
that satisfy
(37) 
The Fourier coefficients of any function with respect to this orthonormal system are defined according to [2], p. 86, equation (1) (the dot here denotes the placeholder for the integration argument in the scalar product):
The best approximation to is given as the Fourier sum of terms:
The orthonormal system of polynomials is given by the function . The functions and are also defined.
This defines the polynomials .
These polynomials differ from the classical Legendre polynomials LegendreP built into Mathematica only by normalization factors.
This shows the Fourier approximation of a sample set of functions.
Proposition 1
(38) 
(39) 
Proof
Theorem 4
Proof
For this QR factorization, the inverse of the uppertriangular matrix is already calculated by the Gram–Schmidt process:
Equations (26), (28) and (35) give the relations on the 2condition numbers of the operators or matrices:
Corollary 1
The numerical instability of the classical Gram–Schmidt process in machine arithmetic is discussed in [4], section 5.2.8. However, since the Gram–Schmidt orthonormalization of the monomials with respect to the scalar product can be performed by symbolic calculations, numerical algorithm stability is not an issue here, contrary to the dimensional scalar product .
This defines the Gram–Schmidt coefficient matrix .
This defines the lower and uppertriangular matrices of QR decomposition.
This defines the orthogonal matrix .
Here is a set of orthonormal polynomials (, ).
This gives the Gram–Schmidt coefficient matrix , with .
Multiply the matrix of monomials from the right or left.
This reproduces orthogonal matrix .
By construction, the polynomials contained in the matrix columns are orthonormal with respect to the scalar product in .
This verifies the QR decomposition , as in Theorem 4.
Select one of the sample functions and compare the results from the Gram–Schmidt orthonormalization and QR factorization interpretation.
For the continuous case, because , the singular values of equal the singular values of .
Here is the case of a discrete grid.
Obviously, is the lowertriangular Cholesky factor of the .
Consequently, the 2condition number of the Vandermonde matrix of size on is the square root of the 2condition number of .
This gives summary results.
The approach for deriving the normal equations for the leastsquares problem (34) is described in [4], section 5.3, for example. Define
A necessary condition for a minimum is , or equivalently,
(40) 
These are called the normal equations. The minimum is unique if the Hessian matrix has full rank . Then there is a unique solution of the linear leastsquares problem (34) or (40):
(41) 
For an equally spaced , defined as in equation (5), the Vandermonde matrix has full rank, so the Hessian matrix for polynomial approximation has full rank as well.
If is rank deficient (), then there are an infinite number of solutions to the leastsquares problem. There is still a unique solution with minimum 2norm, which can be written with the pseudoinverse matrix ([4], section 5.5):
For full rank, .
Suppose with (see [4] section 5.3.1). Then this algorithm computes the unique solution to the linear leastsquares problem (34):
The approach for deriving the normal equations for the case applies with one modification to the continuous leastsquares approximation equation (12) (see [6]):
The matrix transpose has to be replaced by the adjoint operator :
A necessary condition for a minimum is , or equivalently,
(42) 
These are called the normal equations.
(43) 
is called the Hessian matrix. The minimum is unique if has full rank . The elements can be calculated via equation (31):
Obviously, the Hessian matrix is symmetric and positive semidefinite. Since has full rank for any nonzero interval , then has full rank (as well by equation (32)) and is therefore positive definite. Then there exists a unique solution of equations (42) and (12).
Finally, calculate the elements on the righthand side of via equation (30):
This subsection investigates under which conditions and how fast the polynomial approximations on discrete grids converge to the continuous polynomial approximation in .
For an equally spaced , the normal equations, multiplied by the step size , read
(44) 
Define
(45) 
then by equation (7), the matrix elements of are just the Riemann sums for the matrix elements (integrals) of . Therefore, the Hessian matrices on the discrete grid converge to the continuous Hessian matrix in any matrix norm according to:
Define
(46) 
then by equation (7), the elements of are just the Riemann sums for the elements of , the moments of . Therefore, the righthand side of the normal equations on the discrete grid converge to the righthand side of the continuous normal equations in any vector norm according to:
(47) 
Proposition 2
Proof
From equation (42), the solution of the polynomial approximation in is
From equation (44), the solution for the discrete grid is
(48) 
For the matrix inverses
Expanding the difference of the solution coefficient vectors completes the proof:
Theorem 5
Proof
From Theorem 4 and because ,
By [4], Theorem 4.2.7, the Cholesky decomposition of a symmetric positivedefinite square matrix is unique.
Equations (24) and (26) give the relation for the 2condition numbers of the matrices.
Corollary 2
This defines the continuous and discrete Hessian matrices.
These are the right‐hand sides of the continuous and discrete normal equations.
This gives the solution of the normal equations system.
This gives the approximation polynomials for the continuous and discrete cases.
This gives the Gram–Schmidt coefficient matrix, its inverse and inverse transpose.
The matrix times its transpose equals the Hessian matrix of the normal equations.
Equivalently, is the lowertriangular and is the uppertriangular Cholesky factor of the Hessian matrix .
For , the Hessian matrix is identical to the Hilbert matrix of dimension :
Hilbert matrices are illconditioned already for dimensions (i.e. have condition numbers greater than about 10,000) and soon reach the limits of 64bit IEEE arithmetic.
Here are the summary results. It takes some time for .
This performs Gram–Schmidt orthonormalization for a special case. For other cases, change the 6 in to an integer between 1 and 10.
Here are the results from the normal equations.
The two solutions agree both symbolically and in exact arithmetic.
But there are numeric differences in IEEE 758 machine arithmetic.
These differences come from the lower error amplification expressed in a lower 2condition number,
The numerical solution via the Gram–Schmidt orthonormalization solution is usually more accurate than the normal equations solution.
Here is the QR factorization. Again, for other cases, change the 3 in to an integer between 1 and 10.
Normal equations.
The difference between the numerical solutions is due to the difference in how their rounding errors propagate.
Because of the lower error amplification expressed in a lower 2condition number, the numerical solution via QR factorization is more accurate than the normal equations solution.
This calculates the convergence order for grids of powers of 2.
Choose problem parameters. Yet again, for other cases, change the 7 in to an integer between 1 and 10.
Here are the approximation errors for grids of powers of 2.
Here is the convergence order. To see the result for another function, select another sample function in . Zero errors are suppressed.
For case 3, all but the first two elements of are zero. For case 5, all but the first element of are zero; therefore the logarithmic plots look incomplete.
Sample functions 3, 4, 5, 6 have discontinuities in the zeroth, first or second derivative; 7 and 8 have singularities in the first or second derivative. These sample functions and do not satisfy all the assumptions of Theorem 1. Therefore the convergence order of the righthand side can be lower than 1 (respectively 2) as predicted by equation (47). Sample functions 1, 2, 9, 10 are infinitely often continuously differentiable; therefore they have maximum convergence order 1 (respectively 2) according to equation (47).
These are the approximation errors for grids of powers of 2.
This takes a few minutes.
This plots the convergence order.
Analyzing polynomial approximation, this article has systematically worked out the close relations between the solutions obtained by:
The interrelations are:
This article has analyzed the polynomial approximation of a realvalued function with respect to the leastsquares norm in different settings:
Three different solution methods for this leastsquares problem have been compared:
All definitions and solution methods were implemented in Mathematica 11.1.
All solution characteristics and their relations were illustrated by symbolic or numeric examples and graphs.
The author thanks Alexandra Herzog and Jonas Gienger for critical proofreading and the anonymous referees for valuable improvements of the paper. The paper evolved from a lecture series in numerical analysis, given by the author for ESOC staff and contractors from 2012 to 2016, using Mathematica:
[1]  E. Hewitt and K. Stromberg, Real and Abstract Analysis, New York: SpringerVerlag, 1975. 
[2]  K. Yosida, Functional Analysis, 5th ed., New York: SpringerVerlag, 1978. 
[3]  P. J. Davis and P. Rabinowitz, Methods of Numerical Integration, 2nd ed., London: Academic Press, 1984. 
[4]  G. H. Golub and C. F. Van Loan, Matrix Computations, 4th ed., Baltimore: The John Hopkins University Press, 2013. 
[5]  E. W. Weisstein. “Vandermonde Determinant” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/VandermondeDeterminant.html. 
[6]  J. D. Faires and R. Burden, Numerical Methods, 3rd ed., Pacific Grove, CA: Thomson/Brooks/Cole, 2003. 
G. Gienger, “Polynomial Approximation,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.192. 
2011–2016 ESA/ESOC research and technology management office
2010–2011 ESOC navigation support office
1987–2009 Mathematical analyst in ESOC Flight Dynamics division; supported more than 20
space missions
1988 Received Dr. rer. nat. from Heidelberg University, Faculty of Science and Mathematics
1984–1987 Teaching assistant at University of Giessen
1982–1984 Research scientist at Heidelberg University
1975–1981 Studied mathematics and physics at Heidelberg University
Gottlob Gienger
European Space Operations Centre ESOC
Research and Technology Management Office OPSGX
RobertBoschStrasse 5, 64293 Darmstadt, Germany
Retired staff
Gottlob.Gienger@gmail.com
The derivation of the scattering force and the gradient force on a spherical particle due to an electromagnetic wave often invokes the Clausius–Mossotti factor, based on an ad hoc physical model. In this article, we derive the expressions including the Clausius–Mossotti factor directly from the fundamental equations of classical electromagnetism. Starting from an analytic expression for the force on a spherical particle in a vacuum using the Maxwell stress tensor, as well as the Mie solution for the response of dielectric particles to an electromagnetic plane wave, we derive the scattering and gradient forces. In both cases, the Clausius–Mossotti factor arises rigorously from the derivation without any physical argumentation. The limits agree with expressions in the literature.
Recently, we made a theoretical study of a system to sort submicrometer dielectric spheres in the interference field of a laser in slowly flowing air [1, 2]. In the course of that project, we derived the scattering and gradient forces rigorously from Maxwell’s equations. The derivation was too detailed for that article, so we are presenting it here. The results agree with expressions from Harada and Asakura [3], as we show in the Appendix. In addition, we present the code we used to derive the force on a spherical particle used in [1, 2].
In this article, we develop code to generate the Mie scattering coefficients and the stress tensor formulas, and we combine them to form first the scattering force and then the gradient force. The scattering force comes first because it requires an incident plane wave, whereas the gradient force requires an incident standing wave that is a little more difficult to set up.
The solution for the response of a spherical dielectric particle in a vacuum was given more than 100 years ago [4]. The problem has been studied extensively by Bohren and Huffman [5]. Our formulation follows the textbook of Zangwill [6]. Since the problem is so well studied, we go directly to the solution. We omit an implicit time dependence in the plane wave traveling in the positive direction with the electric field linearly polarized in the direction. The electric field and magnetic field are thereby given by
where and are scalar functions related to the transverse electric and transverse magnetic parts of the solution, respectively, is the speed of light, and is a point in space. The International System of Units (SI) is used; in these units has the same dimensions as . Compared to the equations in Zangwill, the overall sign differs here, and we have inserted a factor of to simplify the implementation. We can obtain from by the substitution ; we will use this relationship later to simplify the calculation. The scalar function is given by
Here, and are the usual spherical coordinates, and the are associated Legendre polynomials. If , where is a spherical Bessel function, the expressions describe the incident fields for a plane wave in the case of an incident plane wave with spatial dependence . The sum of the incident and scattered fields is given by
for , where is a spherical Hankel function. For , we substitute and in the previous two expressions.
Explicit forms for the Mie coefficients and are given here for a particle with index of refraction and radius . We restrict attention to the case of a nonmagnetic dielectric sphere:
where and are Riccati–Bessel functions; a prime denotes differentiation with respect to the argument. The total scattering cross section is given by
written here in dimensionless form by incorporating the geometric cross section of the spherical particle; this form is commonly denoted by .
The function calculates the Mie coefficients and using equations (5) and (6); the table of those coefficients is . The functions and are and , respectively. The variable (with maximum value ) is an index used for Legendre coefficients in physics. The variable is , the product of the wavevector and the particle radius ; ranges from to with step size . The function returns a list pairing with the cross section normalized to the geometry. A simpler threeargument function calls a fiveargument function. The parameter helps to tell how many terms to calculate to achieve convergence; 1.5 seems to work well, but the reader may wish to test this.
Next, we plot the Mie cross section, similar to the one found in Zangwill [6] and Bohren and Huffman [5]. This is slow: it took 442 seconds on a 3.7 GHz personal computer. The code could be written to run significantly faster, but it would become more cryptic. The main point here is to verify the correctness of the code and to clarify the exposition. The red line is the asymptotic value for large , a dimensionless parameter comparing the particle size to the wavelength. The fact that this value is exactly twice the geometric cross section is discussed in [3, 4].
Figure 1. Mie cross section for a sphere of radius and wavevector compared to the geometric cross section and its large limit, in red. For the physical significance of the sharp peaks, see [7].
Here we develop the Taylor expansion of the Mie coefficients for small . This subsection confirms equations (8–11). The results are used in Section 5.
We start with a simplification.
The following definitions are motivated by equations (5) and (6).
Later, we will find we need these for . The index of refraction is .
The Taylor series is taken next. A high order is necessary even for the small limit. In many cases, the spherical Neumann function enters the calculation, which leads to diverergence in this limit. The real and imaginary parts are represented by the suffixes 1 or 2 in the variable names. The series expansions are chosen so as to include terms to the lowest nonvanishing order.
The results are given here as mathematical formulas, which should agree with results in the cell above. Superscripts 1 and 2 refer to real and imaginary parts.
For a harmonic electromagnetic field, the timeaveraged electromagnetic force on a dielectric particle in a vacuum is given by [1, 2, 4]
where the angle brackets mean the time average, is the differential of the surface normal, and the integral is taken over any surface enclosing the particle. Thus the Maxwell stress tensor is given by
where
and where is the electric constant, is a dyadic, and is the identity matrix. The companion matrix is given by with . (The magnetic constant used in the references is related by .)
Since the particle is spherical, it is natural to do the integral on the surface of a sphere at radius . Since we are in a vacuum, the electromagnetic field exhibits no force there. Looking ahead, we will check to see that the result is independent of . The direction of is , so it is sufficient to calculate a mixedbasis tensor component , where refers to the component along and is a Cartesian direction. Rewriting equation (12) in spherical coordinates leads to
We accept the complexity of the mixedbases tensor because the dot product leads to a single component in spherical coordinates, but the integral requires coordinates that are independent of the point of integration. In practice, these are Cartesian. Looking ahead, we will see that only the component will be nonzero, which further motivates the choice. The calculation will proceed by forming the electric field in spherical coordinates and making a row of , where the index implies spherical coordinates. The vector will be transformed to Cartesian coordinates by rightmultiplying by a rotation matrix. (The same process is used for the magnetic field .)
The effect of the time average is the following: given , . The stress tensor will be computed from the second term. The first term is included later by adding the complex conjugate.
For the scattering force, we take the incident electric field to be a plane wave traveling in the positive direction and linearly polarized in the direction. Explicitly,
As discussed, we need to transform a vector from spherical to Cartesian coordinates. We give that function first.
The electric or magnetic stress tensor is computed from the following code. is a factor required for the time average.
gets the component, since the field components have the usual order (, , ); is the complex conjugate, but we need to help the kernel find simplifications. (The magnetic part will be found by symbolic substitution.) The conjugation is performed as follows.
is used for real expansion.
Following are some additional simplifications that we will use. We operate in the dimensionless radial variable called , which is (i.e. times ), where and do not enter the calculation independently. The assumptions are made because of the range of integration.
We introduced extra factors of into equations (1) and (2) because we implement rather than to enable the use of . The operation is named , the curl for a dimensionless radial variable.
Next we define a function to create the electric and magnetic fields (“em”), omitting the prefactor . Before the final answer for the force is obtained, the factor will be reintroduced as the variable .
This gives a simplified interface for a positive plane wave.
It is sufficient to set for both the scattering and gradient forces, since both occur in the limit of small particles. Setting does not capture all contributing terms to lowest order, and setting or higher does not cause the limit to change. For the scattering, we will have , which is for a plane wave going in the positive direction. For the gradient force where we have a standing wave, we decompose that into the sum of a plane wave going toward and one going toward . The backward wave is still polarized in the direction, so we produce it by mapping . Flipping is necessary because the direction of travel is ; if we wish to preserve , it is necessary to change the sign of . We will always set and ; these may be set to zero to suppress the external field or the induced field, respectively.
We next determine the electric and magnetic fields. Although we could increase , the run time increases dramatically. For example, the case of from [1, 2] was run overnight.
For , use 1 for a quick test, 2 to get the main results, and 3 or more to confirm that a higherorder expansion does not change the limit. The run time increases rapidly as is increased. Run times can vary, but typically it is best to start the code and come back to it after a few minutes to a couple of hours. The parameter is used both here and for the gradient force in Section 4.
Next, we make rows from the electric part and magnetic part (multiplied by ) of the stress tensor in Cartesian coordinates.
We do the integrals over next, yielding the three Cartesian components of the electric part of the stress tensor, after azimuthal integration.
The azimuthal integrals of the and components of the electric part of the stress tensor are zero, as they should be. The component is nonzero in general and will be used later.
The magnetic terms are similar.
The integral over is done next. Although these variables contain the word , they lack some constant factors to be forces, hence the prefix . These factors will be included after a few manipulations. The notation ending with is for the component. The successive variables lack the constant, but the final result is the force.
Combining the electric and magnetic terms into a total force at this point leads to some cancellations, so it is convenient to do it next.
Up to this point, we have done the manipulations without telling the kernel that and are, in fact, spherical Bessel functions and . Moreover, we use (a complex function of real arguments) because we will be obtaining real and imaginary parts shortly.
As discussed previously, we have calculated only one of two terms in the force. Next, we add the complex conjugate.
We introduce real and imaginary parts for the Mie coefficients.
The expression is still quite complicated. However, knows (for ) that the Wronskian of the spherical Bessel functions [8]. We will see that the force does not depend on the radius of the integration sphere, as long as . We expect this on physical grounds: there should be no force on the vacuum region surrounding the physical sphere. Mathematically, the Wronskian plays a key role in achieving this condition.
We next include some constant prefactors, namely , and . The factors arise from the definition of the stress tensor. The factor comes from the fact that when we performed the two angular integrals over the sphere, we did not yet include the dimensional constant associated with the area element. This extra factor of is the only exception to the radius appearing in the dimensioness variable . The substitution allows powers of to be canceled.
The real and imaginary parts of the Mie coefficient are and , respectively, and similarly, for they are and .
In the previous section, we used the general form for the electric and magnetic fields for the Mie expansion as input to the Maxwell stress tensor to derive the force on a particle. We do the same in this section, borrowing from the previous section to the extent possible. However, the spherical particle is in a standing wave field instead of a plane wave traveling in the direction. The standing wave is a linear superposition of two plane waves going toward and toward , so we need a solution from the latter field. We obtain this by symmetry from the existing solution. Sending does what is needed: the polarization of is unchanged, but the direction of propagation changes sign. This is implemented with the wrapper to the function described in Section 3.
Although the standing wave is the sum of two plane waves, more precisely, a phased sum is required. Although we do not show it here, the answer is proportional to . This is because the dielectric sphere is located at the coordinate origin, and we need to have a maximum of the electric field there. We do this as follows. (In the variable names, ▽ stands for gradient.)
From here, the manipulation is the same as for the scattering force, so the code is written without further explanation. Certain terms integrate to zero.
First, the electric terms in the stress tensor are calculated. As previously, the and components
are 0.
Second, the magnetic terms in the stress tensor are calculated. Again, the and components are 0 and the component is nonzero in general.
Finally, the electric and magnetic terms are combined and the force is found.
Our next task is to determine the scattering and gradient forces on spheres up to the leading order in . Physically, these are spheres that are small compared to a wavelength. We combine the results of Section 3 for the scattering force and Section 4 for the gradient force with those of Section 2 for the Mie coefficients. In Section 1.2, we showed that , , , , , , and using superscripts and for the real and imaginary parts, respectively. Therefore, the scattering force, to lowest order in , retains only the terms in , and for the gradient force, the lowestorder coefficient is . The terms were given in equation (9). The key point is that the Clausius–Mossotti factor appears with the second and first powers in the two terms. The result falls out of a Taylor expansion without any appeal to physical arguments about the response of dipoles. All of the terms are proportional to , leading to the physically required result that if the sphere in fact contains a vacuum , there is no electromagnetic response and thus no force. However, the denominator is characteristic only of the terms giving the lowestorder response. Higher terms have different dependencies on the index of refraction , such as , as seen previously. The gradient and scattering forces properly exist only in the limit of small .
Having selected the lowestorder terms for the two force expansions, these are the forces to lowest order in .
These formulas agree with those given by Harada and Asakura [3], as shown in the Appendix.
Our goal was to derive the scattering force and the gradient force rigorously from the Mie solution and the Maxwell stress tensor. We began by presenting the Mie solution for a plane wave incident on a dielectric sphere in a vacuum and showing that our implementation matches a figure from a textbook. We then presented the formula for the force in terms of a surface integral of the Maxwell stress tensor, which we take on a sphere of arbitrary radius centered on and including the whole dielectric sphere. We analyzed the scattering force first, giving a formula for the force in terms of the Mie coefficients and then taking the limit as the radius of the sphere tends to zero. This yields agreement with the usual formula for the scattering force in a vacuum. The forces were reformulated for a standing wave, and a similar program was carried out, leading to agreement for the gradient force.
The Claussius–Mossotti term that appears in expressions for the scattering force and the gradient force is seen to be implicit in the rigorous Mie solutions to Maxwell’s equations. By finding the Maxwell stress tensor for a plane wave or a standing wave acting on a dielectric sphere, we are able to show the response in lowest order is in agreement with a widely used formula for the scattering force and the gradient force, respectively. Since part of the derivation includes a tenthorder Taylor expansion of special functions, it is difficult to see how the result could be obtained without computerassisted algebra.
First, we wish to match equation (17) to the scattering force as given in equation (12) of [3], hereafter called equation (HA12). Variables with superscript (HA) are from the reference. We consider only the case of the external medium being a vacuum, so . This implies , the index in the sphere, so that the Clausius–Mossotti factor is present in both our equation (17) and equation (HA12). The same holds for equation (18) and equation (HA16). Next, [3] considers a Gaussian beam profile, so we simply pick the point in the middle, setting . The factor is the intensity at the center of the beam . The beam intensity is related to the electrical field by . Given these expressions, all the factors can be matched by inspection.
Next, we wish to match equation (18) to the gradient force given in equation (HA16). The one additional equation to note is that . The factor of 2 occurs in the numerator because, by our definition, represents the field of one of the two interfering beams. The 2 in the denominator appears in the intensityfield conversion equation. The twos in are due to the physical fact that a standing plane wave interference pattern is periodic with half the wavelength of the electric fields of the plane waves of which the interference pattern is composed. Again, a match can be made by inspection.
Eric L. Shirley provided a key step in the derivation.
J. J. Curry and Z. H. Levine, “ContinuousFeed Optical Sorting of Aerosol Particles,” Optics Express, 24(13), 2016 pp. 14100–14123. doi:10.1364/OE.24.014100.  
Z. H. Levine and J. J. Curry, “OptSortSph: Optical Sorting in a Standing Wave Field Calculated with Effective Velocities and Diffusion Constants,” Journal of Research of the National Institute of Standards and Technology, 121, 2016 pp. 420–421. doi:10.6028/jres.121.020.  
Y. Harada and T. Asakura, “Radiation Forces on a Dielectric Sphere in the Rayleigh Scattering Regime,” Optics Communications, 124(56), 1996 pp. 529–541. doi:10.1016/00304018(95)007539. 

G. Mie, “Beiträge zur Optik trüber Medien, speziell kolloidaler Metallösungen,” Annalen der physik, 330(3), 1908 pp. 377–445. onlinelibrary.wiley.com/doi/10.1002/andp.19083300302/pdf.  
C. F. Bohren and D. R. Huffman, Absorption and Scattering of Light by Small Particles, New York: Wiley, 1998. 

A. Zangwill, Modern Electrodynamics, Cambridge: Cambridge University Press, 2013.  
P. Chylek, J. T. Kiehl and M. K. W. Ko, “Optical Levitation and PartialWave Resonances,” Physical Review A, 18(5), 1978 pp. 2229–2233. doi:10.1103/PhysRevA.18.2229.  
M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions, 10th printing, New York: Dover, 1972, Equation (10.1.6). 

Z. H. Levine and J. J. Curry, “Scattering and Gradient Forces from the Electromagnetic Stress Tensor Acting on a Dielectric Sphere,” The Mathematica Journal, 2017. dx.doi.org/doi:10.3888/tmj.191. 
Zachary H. Levine is a physicist at NIST and a Fellow of the American Physical Society. He has concentrated on various aspects of the interaction of light and matter, including the calculation of dielectric constants, second harmonic coefficients and higher optical tensors.
J. J. Curry is an electrical engineer at the National Institute of Standards and Technology. His current interests include manipulation and measurement of particles using optical forces. In recent years, he has worked on problems of interest to the lighting industry.
Zachary H. Levine
The National Institute of Standards and Technology
Gaithersburg, MD 208998410
zlevine@nist.gov
J. J. Curry
The National Institute of Standards and Technology
Gaithersburg, MD 208998441
john.curry@nist.gov
This article illustrates how Mathematica can be employed to model stochastic processes via stochastic differential equations to compute trajectories and their statistical features. In addition, we discuss parameter estimation of the model via the maximum likelihood method with global optimization. We consider handling modeling error, system noise and measurement error, compare the stochastic and deterministic models and use the likelihoodratio test to verify the possible improvement provided by stochastic modeling. The Mathematica code is simple and can be easily adapted to similar problems.
Stochastic differential equations (SDEs) have received great attention in a large number of fields, including finance, physics, system biology, biochemical processes and pharmacokinetics. SDEs serve as a natural way of introducing uncertainty into a deterministic model represented by ordinary differential equations (ODEs). In contrast to the classical approach where uncertainty only exists in the measurements, SDEs can provide a more flexible framework to account for deviation in states and parameters that describe the underlying system. For an introduction to the theory and numerical solution of SDEs, see [1].
In this article, we illustrate how Mathematica deals with the problem of simulation and parameter estimation of such systems, represented here by the FitzHugh–Nagumo model describing excitable media. The first section sets up the model equations. The second section deals with the simulation of the stochastic system and computes the trajectories, their distribution at a specified time point and the change of the statistical data of these distributions in time, namely the trajectory of the mean value and standard deviation. The third section describes the parameter estimation of the model via the maximum likelihood method (ML). The last section analyzes the goodness of fit between deterministic and stochastic models using the likelihood ratio test (LRT).
The FitzHugh–Nagumo model for excitable media is a nonlinear model describing the reciprocal dependencies of the voltage across an exon membrane and a recovery variable summarizing outward currents. The model was developed in [2]. The model is general and can also model excitable media, for example, heart tissue. The deterministic ODE model (whitebox model) is described by the following system of ordinary differential equations:
(1) 
(2) 
with parameters , and and initial condition and .
Whitebox models are mainly constructed on the basis of knowledge of physics about the system. Solutions to ODEs are deterministic functions of time, and hence these models are built on the assumption that the future value of the state variables can be predicted exactly. An essential part of model validation is the analysis of the residual errors (the deviation between the true observations and the onestep predictions provided by the model). This validation method is based on the fact that a correct model leads to uncorrelated residuals. This is rarely obtainable for whitebox models. Hence, in these situations, it is not possible to validate ODE models using standard statistical tools. However, by using a slightly more advanced type of equation, this problem can be solved by replacing ODEs with SDEs that incorporate the stochastic behavior of the system: modeling error, unknown disturbances, system noise and so on.
The stochastic SDE graybox model can be considered as an extension of the ODE model by introducing system noise:
(3) 
(4) 
where is a Wiener process (also known as Brownian motion), a continuoustime random walk. The next section carries out the numerical simulation of the SDE model using the parameter settings , and .
Equations (3) and (4) represent an Itostochastic process that can be simulated in Mathematica employing a stochastic Runge–Kutta method.
First, a single realization is simulated in the time interval .
Figure 1. The trajectories of the state variables (blue) and (brown) in the case of a single realization of the Ito process.
The values of and can be found at any time point. Here is an example.
This computes the trajectories for 100 realizations.
Figure 2. The trajectories of the state variables (blue) and (brown) in the case of 100 realizations of the Ito process.
Slice distributions of the state variables can be computed at any time point. First, let us simulate the trajectories in a slightly different and faster way, now using 1000 realizations.
At the time , we can compute the mean and standard deviation for both state variables and , as well as their histograms.
Figure 3. The distribution of (left) and (right) in the case of 100 realizations of the Ito process, at time .
The mean value and the standard deviation of the trajectories along the simulation time can also be computed and visualized. (This may take a few minutes.)
Figure 4. , the mean value of (red) with its standard deviation, (blue).
Parameter estimation is critical since it determines how well the model compares to the measurement data. The measurement process itself may also have serially uncorrelated errors due to the imperfect accuracy and precision of the measurement equipment.
Write the measurement equation as
(5) 
where . The voltage is assumed to be sampled between and at discrete time points , where , , with an additive measurement noise . To get , we consider a single realization and sample it at time points . Then we add white noise with .
Figure 5. A single realization of (brown) and the simulated measured values (black points).
As we have seen, the solutions to SDEs are stochastic processes that are described by probability distributions. This property allows for maximum likelihood estimation. Let the deviation of the measurements from the model be
(6) 
where . Assuming that the density function of can be approximated reasonably well by Gaussian density, the likelihood function to be maximized is
(7) 
For computation, we use its logarithm. Now we accept the model parameter and estimate and from the SDE model, employing the maximum likelihood method.
Here are the values of the .
This defines the logarithmic likelihood function.
The optimization of the likelihood function is not an easy task, since the objective function is often flat, with nondifferentiable terms and many local extrema. In addition, the model takes a long time to evaluate. Instead of using direct global optimization, first we compute the values of the objective function on a grid to use parallel computation to speed up the evaluation. We are looking for the parameters in the range . Let us create the grid points.
This computes the function values at the grid points in parallel.
Now apply interpolation for the grid point data and visualize the likelihood function.
Figure 6. The likelihood function of the parameter estimation problem.
Employing different global optimization methods, we compute the parameters. The third one generates a warning message that we suppress with Quiet.
There are many local maxima. Let us choose the global one.
Here is a new parameter list.
This computes 500 realizations.
This visualizes the result and the measurement data.
Figure 7. The simulated process with the estimated parameters as in Figure 4.
In the previous section, Parameter Estimation, we illustrated the technique of stochastic modeling. The measurement data was simulated with the correct model parameters , no assumed modeling error, existing system noise and measurement error . The model to be fitted had two free parameters, and , since the system noise was chosen for the model.
Now consider a different situation. Suppose that we have modeling error, since the measured values are simulated with ; however, in the model to be fitted, we use . In addition, let us increase the measurement error to . To compare the efficiencies of the deterministic and stochastic models, we should handle as a free parameter in the stochastic model to be fitted. So then we have three free parameters to be estimated: , and . Let us carry out the parameter estimation for different values of .
The results can be seen in Table 1; corresponds to the deterministic model. In order to demonstrate that the stochastic model can provide significant improvement compared with the deterministic model, the likelihoodratio test can be applied [3]. The likelihoodratio test has been used to compare two nested models. In our case, one of the models is the ODE model, with fewer parameters than the SDE model. The null hypothesis is that the two models are basically the same. The test statistic is
(8) 
where the subscripts and stand for the deterministic and stochastic models, respectively.
The distribution of is , where is the difference in the number of parameters between the two models; in our case, . Here is the critical value for at a confidence level of 95%.
If the value of is less than the critical value, then the null hypothesis can be accepted. In our case, . Therefore, we reject the null hypothesis, which means the SDE model can be considered as a different model providing a significant improvement compared with the ODE model.
For changing parameters to simulate the measured values, you should modify the content of the list in the Simulation section and the value of in the Measurement Values subsection. To change the parameter of the model to be fitted, change the value of in the list in the Likelihood Function subsection.
In reality, the three parameters should be handled simultaneously during the optimization process.
Assuming there is no modeling error, using the SDE model, we can separate the measurement error from the system noise represented by the estimated of the fitted model; see [4].
The advantage of using stochastic modeling is that the actual states in the model are predicted from data. This allows the prediction to stay close to the data even when the parameters in the model are imprecise. It has been demonstrated that Mathematica can significantly support a user carrying out stochastic modeling. There are many builtin functions that help in the computation of the trajectories of stochastic differential equations and their statistics, global optimization for parameter estimation and likelihoodratio test for model verification.
The author is grateful for the EU FP7 International Research Staff Exchange Scheme, which provided the financial support that partly facilitated his stay at the Department of Mechanical Engineering of the University of Canterbury, NZ, where this article was completed.
[1]  E. Allen, Modeling with Itô Stochastic Differential Equations, Dordrecht, Netherlands: Springer, 2007. 
[2]  R. FitzHugh, “Impulses and Physiological States in Theoretical Models of Nerve Membrane,” Biophysics Journal, 1(6), 1961 pp. 445–466. doi:10.1016/S00063495(61)869026. 
[3]  H. Madsen and P. Thyregod, Introduction to General and Generalized Linear Models, Boca Raton: CRC Press, 2010. 
[4]  C. W. Tornøe, J. L. Jacobsen, O. Pedersen, T. Hansen and H. Madsen, “GreyBox Modelling of Pharmacokinetic/Pharmacodynamic Systems,” Journal of Pharmacokinetics and Pharmacodynamics, 31(5), 2004 pp. 401–417. doi.10.1007/s1092800483238. 
B. Paláncz, “Stochastic Simulation and Parameter Estimation of the FitzHugh–Nagumo Model,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.186. 
Dr. Béla Paláncz is professor emeritus at the Department of Photogrammetry and Geoinformatics at the Faculty of Civil Engineering of the Budapest University of Economics and Technology, in Budapest, Hungary. His main interests are mathematical modeling and numericsymbolic computations. He is the coauthor of several books and more than a hundred papers.
Dr. Béla Paláncz:
Budapest University of Economics and Technology
1111 Műegytem rkp. 3
Budapest, Hungary
palancz@epito.bme.hu
This series of articles showcases a variety of applications around the theme of inversion, which constitutes a strategic way of manipulating configurations of circles and lines in the plane. This article includes the Riemann sphere, rings of four tangent circles, and inverting the Sierpinski sieve.
The story of inversion, propelled by the study of conic sections and stereographic projection, is quite complex [1]. Although Vietá (1540–1603) already spoke of mutually inverted points, the proper development started with a plethora of notable geometers including L’Huilier (1750–1840), Dandelin (1794–1847), Quetelet (1796–1874), Steiner (1796–1863), Magnus (1790–1861), Plücker (1801–1868), Bellavitis (1803–1880), and Simpson (1849–1924). Steiner, in a manuscript published in 1913, is considered to have been the first to formulate inversion as a method to systematically simplify the study of complex geometric figures where circles play a prominent role. This work culminated in 1855 with the studies of Möbius (1790–1868) (hence the choice of the letter for the inversive circle in this article) [2]. Peaucellier (1832–1913) also applied inversion to his famous linkage [3], and Lord Kelvin (1824–1907) applied inversion to elasticity.
Let denote either the line or the segment from to , depending on the context. The length of the segment is denoted by . The circle with center and radius is denoted by . Denote by the inverse of in the current circle of inversion , where is either a symbolic point or often ; in Manipulate results, is often shown as a red dashed circle.
The following functions encapsulate the basic properties of inversive geometry developed in the first part of this series [4].
The function squareDistance computes the square of the Euclidean distance between two given points. (It is more convenient to use the following definition than the builtin Mathematica function SquaredEuclideanDistance.)
The function collinearQ tests whether three given points are collinear. When exactly two of them are equal, it gives True, and when all three are equal, it gives False, because there is no unique line through them.
The function computes the circle passing through the points , , and . If the points are collinear, it gives the line through them; if all three points are the same, it returns an error message, as there is no meaningful definition of inversion in a circle of zero radius.
The function computes the inverse of in the circle . The object can be a point (including the special point that inverts to the center of ), a circle, or a line (specified by two points).
Here is an example.
Since a circle can be inverted into a line, define a generalized circle to be either an ordinary circle or a line, as in the first article in this series. A consequence of theorem 2 in the next section is that the set of generalized circles is closed under inversion in a circle. The following Manipulate shows that as , the circle tends to the line , and the inverse of the circle tends to its reflection in the line , the circle .
Therefore, it makes sense to define inversion in a line to be reflection.
Here is an example.
The function squareDistanceToLine computes the square of the distance of a given point to a given line.
The function exCircles computes the four circles tangent to the lines forming the sides of a triangle having given vertices.
The function inCircle computes the incircle of a triangle having given vertices as the excircle of smallest radius.
The function redPoint is used to mark a given point in red.
The function draws an arc of radius r and center b from the line ab to the line bc.
The function intersections computes the possible points of intersection of a given line and circle. (There may be zero, one, or two points.)
The following Manipulate shows several of the functions introduced so far.
The concept of orthogonality between circles plays an important role in inversive geometry. Two intersecting circles are orthogonal if they have perpendicular tangents at either point of intersection. It is rather amusing to consider orthogonality of circles without mentioning any notion of perpendicularity [3, 4], as is done in the following definition.
Definition
The following Manipulate shows the central circle is orthogonal to the black one by showing the other two circles (blue). Some values of the positions of the black circle or the radii of the blue circles make it impossible to close the three circles, so adjustments to close them are done on the fly.
Theorems 1–11 review the basic properties of inversion and introduce some of its remarkable properties. Consider all inversions to be with respect to the circle .
Theorem 1
The following Manipulate illustrates this property and shows that triangles and are similar. The gray circle is orthogonal to .
The following interesting variation of theorem 1 was presented at the Swiss Mathematical Contest
in 1999 [5, 6]:
Theorem 2
For instance, invert the circle in .
The set of circles and lines, called generalized circles, is therefore closed under inversion.
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Proof
By the definition of inversion, . Then . Since , the result follows. □
Theorem 7
To verify this result, notice that in the following expression (the condition implies does not pass through and thus indeed inverts into a circle).
Theorem 8
.
Subtracting the first points of the following lines obtained by inversion gives the result.
Theorem 9
The following Manipulate shows the terms mentioned in theorem 9.
Proof
Let and be the ends of the diameter of shown in the figure above. The point is outside the segment , as otherwise would not be defined. Then
and the result follows. □
For instance, consider inverting in .
Then , which we verify as follows.
In the case of theorem 7, this is the corresponding comparison.
In order to verify this result, assume that and , with and (otherwise would not be a circle).
From this theorem, it follows that the product of the lengths of the tangents from to and
must be .
Theorem 10
Proof
Invert , , and in the circle . Then and (by theorem 6). Substituting this and similar results for , , , and in the inequality gives
which reduces to ; this always applies in the triangle . Equality holds if and only if the points , , and are collinear, that is, if and only if points , , , and are concyclic or collinear. □
Ptolemy (circa 127 AD) compiled much of what today is the pseudoscience of astrology. His Earthcentered universe held sway for 1500 years. In the words of Carl Sagan, in the first episode of his TV series Cosmos, “… showing that intellectual brilliance is no guarantee against being dead wrong.” You can find fascinating applications of Ptolemy’s theorem in [9, 10].
Theorem 11
This result is useful when inverting to concentric circles elsewhere. In order to verify it, assume without loss of generality that the two circles are and , with , and that the orthogonal circle is . Then the following quantity is independent of , and it is equal to the position of one of the two fixed points mentioned.
The following provides a framework with which to interpret inversion. Consider a sphere of unit radius centered at the origin. Draw a line from the north pole to a point on the sphere. The point at which intersects the  plane defines a onetoone correspondence between points on the sphere and points in the plane; is called stereographic projection. The image of the south pole is the origin. The image of the north pole is not defined, but is introduced as a new point to serve as the image of ; this makes the mapping continuous and onetoone. In the context of stereographic projection, the sphere is referred to as the Riemann sphere after the prominent mathematician Bernhard Riemann (1826–1866), who studied under Steiner and earned his PhD degree under Gauss [11, 12].
Similar triangles give
By using these formulas, it is possible to prove that the stereographic projection of a circle on is a generalized circle in the plane. (Circles through go to lines and other circles go to circles.) So inversion in the unit circle in the plane induces a map from circles to circles on .
This defines stereographic projection.
An inversive pair of points maps to points reflected in the  plane.
Moreover, as the following Manipulate shows, an inversive pair of generalized circles maps to circles on that are reflections across the  plane. You can vary the radius and center of the control circle in the  plane. The circle inverts to in the unit circle, which is the equator of . You can show the stereographic cone joining to through its stereographic projection and the stereographic cone joining to through its stereographic projection , which is the reflection of in the  plane [13].
Amusing applications related to stereographic projection are found in [14, 15], and interesting Demonstrations in [16–18].
Consider a ring of four cyclically externally tangent circles as in the following Manipulate. The four points of tangency are concyclic (pink circle), even when some of the circles are internally tangent. You can drag the four points around this circle to alter the shape of the arrangement. To see many other similar patterns, see [19]. In contrast to the case of three tangent circles, the circle through the points of tangency is not necessarily orthogonal to the other four circles.
The function perturb slightly varies three points that are coincident or collinear.
Let us apply inversion to deduce the concyclic property. Invert in a circle centered at one of the tangent points, for instance, the red dashed circle in the following Manipulate. That inversion transforms the circles centered at and into two parallel lines and two circles in between these lines. These four inversions are sequentially tangent in three points. The problem then reduces to show that the three points of tangency lie on the same line (shown in blue).
Let us take a closer look at the arrangement formed by the inversion of the chain of circles, starting from a different point of view. The following Manipulate shows an arrangement of two parallel lines, one slanted line, and two circles with the tangencies indicated. The two orange disks indicate adjustable tangency points and of the circles and the parallel lines. It is easy to show that the slanted line always passes through the tangency point of the circles regardless of the positions of and and the value of one of the radii, say, of the lower circle. In fact, the sum of the radii remains constant for fixed positions of and . For some values of the radii there exists a third circle also tangent to the parallel lines and the circles. Selecting the appropriate option in the Manipulate, the radii are adjusted according to the position of the slanted line for this third (blue) circle to exist. When the slanted line is vertical and the circles are congruent, three of those third circles exist.
The next Manipulate inverts the previous arrangement. It shows that the tangent points of the set of (blue) pairwise tangent circles are concyclic. It also shows that there exist up to three circles tangent to the circles of , and sets of up to four circles orthogonal to the circles of . You can vary the center of the inversive circle. Similar examples can be found at [20, 21].
The quadrilateral joining the centers of the circles forming the ring has an inscribed circle. This is easily seen using Pitot’s theorem (1695–1771): a convex quadrilateral with consecutive side lengths , , , is tangential, that is, it has an inscribed circle, if and only if . Oddly enough, in the case of the quadrilateral, its incircle does not necessarily coincide with the circle passing through the four points of tangency, as the following result shows.
The Sierpinski sieve (or Sierpinski gasket, or Sierpinski triangle) [5, 6], named after the prolific Polish mathematician Waclaw Sierpinski (1882–1969), is a selfsimilar subdivision of a triangle [22, 23]. The function Sierpinski constructs the iteration corresponding to the recursive definition of this subdivision.
Inverting the vertices of the triangles from the first four iterations in the unit circle gives rise to stunning patterns.
The following Manipulate lets you vary the radius and center of the inverting circle, the number of iterations, and the type of transformation applied to the vertices forming the triangles forming the Sierpinski sieve.
I would like to thank the anonymous referee whose thoughtful and detailed comments on this article greatly improved its presentation, and also Queretaro’s Institute of Technology, which provided essential support for the completion of this second part.
[1]  H. W. Eves, A Survey of Geometry, Boston: Allyn and Bacon, 1972. 
[2]  B. C. Patterson, “The Origins of the Geometric Principle of Inversion,” Isis, 19(1), 1933 pp. 154–180. www.jstor.org/stable/225190. 
[3]  J. RangelMondragón. “Inversive Geometry II: The Peaucellier Inversor Mechanism” from the Wolfram Demonstrations Project—A Wolfram Web Resource. demonstrations.wolfram.com/InversiveGeometryIIThePeaucellierInversorMechanism. 
[4]  J. RangelMondragón, “Inversive Geometry: Part 1. Inverting Generalized Circles, Ellipses, Polygons, and Tilings,” The Mathematica Journal, 15, 2013. doi:10.3888/tmj.157. 
[5]  R. Todev, Geometry Problems and Solutions from Mathematical Olympiads, MathOlymps, 2010. 
[6]  D. Djukić, V. Janković, I. Matić, and N. Petrović, The IMO Compendium, 2nd ed., New York: Springer, 2011. 
[7]  D. E. Blair, Inversion Theory and Conformal Mapping, Providence, RI: The American Mathematical Society, 2000. 
[8]  H. Fukagawa and D. Pedoe, Japanese Temple Geometry Problems: San Gaku, Winnipeg, Canada: The Charles Babbage Research Centre, 1989. 
[9]  A. S. Posamentier, Advanced Euclidean Geometry, Emeryville, CA: Key College Publishing, 2002. 
[10]  E. W. Weisstein. “Ptolemy’s Theorem” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/PtolemysTheorem.html. 
[11]  D. Pedoe, Geometry, A Comprehensive Course, New York: Dover Publications Inc., 1988. 
[12]  E. W. Weisstein. “Riemann Sphere” from Wolfram MathWorld—A Wolfram Web Resource. mathworld.wolfram.com/RiemannSphere.html. 
[13]  T. Needham, Visual Complex Analysis, Oxford: Claredon Press, 1997. 
[14]  D. Gehrig, “The Orloj,” The Mathematica Journal, 7(4), 2000. www.mathematicajournal.com/issue/v7i4/features/gehrig. 
[15]  P. W. Kuchel, “Spatial Inversion: Reflective Anamorphograms,” The Mathematical Journal, 9(2), 2004. www.mathematicajournal.com/issue/v9i2/SpatialInversion.html. 
[16]  J. RangelMondragón. “A General Cone” from the Wolfram Demonstrations Project—A Wolfram Web Resource. demonstrations.wolfram.com/AGeneralCone. 
[17]  P. W. Kuchel, “Anamorphoscopes: A Visual Aid for Circle Inversion,” Mathematical Gazette, 63(424), 1979 pp. 82–89. 
[18]  E. Mahieu. “Inverse Stereographic Projection of Simple Geometric Shapes” from the Wolfram Demonstrations Project—A Wolfram Web Resource. demonstrations.wolfram.com/InverseStereographicProjectionOfSimpleGeometricShapes. 
[19]  A. Akopyan, Geometry in Figures, CreateSpace Independent Publishing Platform, 2011. 
[20]  J. RangelMondragon. “Problems on Circles IV: Circles Tangent to Two Others with a Given Radius” from the Wolfram Demonstrations Project—A Wolfram Web Resource. demonstrations.wolfram.com/ProblemsOnCirclesIVCirclesTangentToTwoOthersWithAGivenRadius. 
[21]  J. RangelMondragon. “Problems on Circles X: Tangent Circles Generate Ellipses” from the Wolfram Demonstrations Project—A Wolfram Web Resource. demonstrations.wolfram.com/ProblemsOnCirclesXTangentCirclesGenerateEllipses. 
[22]  E. W. Weisstein. “Sierpiński Sieve” from MathWorld–A Wolfram Web Resource. mathworld.wolfram.com/SierpinskiSieve.html. 
[23]  A. MarquezRaygoza. “Oftenpaper.net: The Sierpinski Triangle Page to End Most Sierpinski Triangle Pages.” (Dec 5, 2016) www.oftenpaper.net/sierpinski.htm. 
J. RangelMondragón, “Inversive Geometry,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.185. 
Jaime RangelMondragón received M.Sc. and Ph.D. degrees in Applied Mathematics and Computation from the School of Mathematics and Computer Science at the University College of North Wales in Bangor, UK. He was a visiting scholar at Wolfram Research, Inc. and held positions in the Faculty of Informatics at UCNW, the Center of Literary and Linguistic Studies at the College of Mexico, the Department of Electrical Engineering at the Center of Research and Advanced Studies, the Center of Computational Engineering (of which he was director) at the Monterrey Institute of Technology, the Department of Mechatronics at the Queretaro Institute of Technology and the Autonomous University of Queretaro in Mexico, where he was a member of the Department of Informatics and in charge of the Academic Body of Algorithms, Computation, and Networks. His research included combinatorics, the theory of computing, computational geometry, and recreational mathematics. Jaime RangelMondragón died in 2015.
]]>We describe efficient algorithms for working with subgroups of . Operations discussed include join and meet, congruence testing, congruence closure, subgroup testing, cusp enumeration, supergroup lattice, generators and coset enumeration, and constructing a group from a list of generators.
The set of linear fractional transformations of the form
(1) 
known as Möbius transformations, has several interesting properties. First, the composition of functions of this form is still of the same form, as
Since the coefficients appearing in the composition are exactly those of the product of the two matrices and , it is most convenient to represent transformations of the form (1) by the matrix . A matrix and any nonzero scalar multiple of itself represent the same Möbius transformation, so we can consider only matrices with determinant 1 without loss of generality. Since the product of two matrices with determinant 1 also has determinant 1, such a set of matrices (or Möbius transformations) forms a group, where the group operation is matrix multiplication (or composition). If we further restrict the coefficients , , , and in (1) to be integers, the resulting group is known as . Now, if the matrix is in , then is also in and represents the same Möbius transformation. For this reason, we consider , known as the modular group, which is with each matrix identified with its negative. That is,
It is possible to show that every transformation in the modular group can be obtained as a combination of the two fundamental transformations
with corresponding matrices and . Another way of stating this fact is that is generated by and .
For example, the matrix may be obtained as the product .
The modular group is important because of the existence of modular functions, which are functions that have simple transformation laws under the action of the modular group. A prototypical modular function is the modular discriminant function, which may be defined for by
Since this product has zeros at every rational number, the real axis becomes a natural boundary of the domain of . Using the methods of analysis, it is possible to show that
(2) 
for every matrix in the modular group. Two observations are in order concerning the transformation formula (2). First, as
we see that the transformed value of still has a positive imaginary part, so it still lies in the domain of . Second, the values of where ranges over the whole upper halfplane are related to the values of where is restricted to the region shaded blue here.
This region is known as the fundamental domain for . This is because the transformations and can be used to bring any point in into this region, and no two points inside it differ by a Möbius transformation in . The transformation pairs the left edge with the right edge, while the transformation pairs the arc from to with the arc from to .
In the theory of modular functions one often wants to know what transformations leave a given function unchanged. For example,
will not be unchanged by all the transformations in , since the numerator does not have a transformation formula under all elements of . However, if is in and is divisible by 5, then
Thus, we are naturally led to the subgroup of given by
The package ModularSubgroups.m addresses the computational problem of working with such subgroups of the modular group. However, only certain subgroups of can be identified by congruence conditions on their entries, as is the case with . Such subgroups are called congruence subgroups and are discussed more later. For this reason, we need a better way to represent subgroups. The key to this lies in the matrices and . Since and generate
, and are also generators. However, it can be shown that is the free product of and ; that is, every matrix in can be written uniquely as a word in and as long as no two consecutive ’s appear and no three consecutive ’s appear in the word. This last condition is necessary because of the relations , where (the identity matrix) should be thought of as the empty word.
A subgroup of the modular group is said to have finite index (in ) if can be written as a disjoint union
of left cosets , where the left coset is defined as . In this way, the group is partitioned into several “copies” of , and the number of copies of that fit inside is called the index . If is a finite index subgroup of with index and left cosets (with a distinguished coset ), the matrices and permute the left cosets when acting by multiplication on the left; that is, we have equations
where and should be viewed as some permutations of the set , that is, elements of the symmetric group .
This identification of the matrices and as permutations gives rise to the permutation representation of , which we use to represent any subgroup of with finite index. Specifically, a subgroup is identified by: (1) its index ; (2) the permutation ; and (3) the permutation . The permutations and are not arbitrary. The following two conditions are necessary and sufficient for a given to appear as the representation of some group .
(1) in , where is the identity in . This condition arises from the fact that as matrices, we have .
(2) and generate a transitive subgroup of , or equivalently, the Schreier cosets graph discussed later is connected. This condition arises because the matrix sends the coset to the coset , hence the action of on the left cosets “connects” all of the cosets together.
If these two conditions are satisfied, the group may be identified as , where the condition needs to be evaluated after thinking of as a permutation by converting and into their corresponding permutations.
Since our representation of by two permutations and involves an arbitrary ordering of the nontrivial cosets , two different representations and represent the same group precisely when there is a relabeling of the indices in the permutations of that simultaneously converts into and into . For, example the two representations
represent the same group, as the relabeling converts to .
Another important combinatorial object attached to a subgroup of finite index is the Farey symbol for , as described in [1]. This symbol directly encodes a fundamental domain for as well as the edgepairing matrices for this fundamental domain. Equivalently, it encodes independent generators for . However, since the equivalence of two representations of by two different Farey symbols is not as straightforward to detect, the permutation representation as described was chosen for the underlying representation for .
A subgroup of is called a congruence subgroup if it contains the principal congruence subgroup of level ,
for some natural number . If this is the case, we can describe as those matrices whose entries satisfy certain congruences modulo . For example, two families of congruence subgroups are and , which are defined as
Recently in [2], Hsu gave a simple test for determining if a given subgroup of is a congruence subgroup, based on a presentation for . This algorithm is implemented here and generalized to compute the congruence closure of a subgroup , which is the smallest congruence subgroup that contains .
The Schreier cosets graph of is of fundamental importance to several of the algorithms in the package. Given a subgroup with index and permutations and , the Schreier coset graph is the connected graph with vertices and labeled edges
Such a graph has the property of being folded. A graph is said to be folded if every vertex has at most one edge of a given orientation and label incident with it. If there is a vertex with two or more edges of the same label and orientation, then the graph is said to be unfolded. One property of the Schreier cosets graph is that the subgroup of the modular group consists of all words in and such that, when starting at vertex 1, the path that follows word must terminate at vertex 1. For example, take the subgroup with the following Schreier cosets graph.
The word , which corresponds to the matrix
is in as the path traced out by is given by ( must be read right to left since we are dealing with left cosets). Since the graph is folded, the process of tracing a path given by a word in and is deterministic. The group corresponding to this Schreier cosets graph turns out to be a congruence subgroup, and its defining congruences are given in Example 1.
All of these examples were tested in Mathematica 10.
Set the directory to be able to load the package and then evaluate the Needs.
Subgroups of the modular group are maintained in the container , and the names of the functions that operate on these groups start with a lower case “m” in order to avoid possible conflicts with builtin symbols. The matrices mS, mO, mT, mR of the package are set as follows.
Here is the group from the section on Schreier cosets graphs. The permutations are listed so that and , where and are the last two arguments of the mGroup container.
This group turns out to be a congruence subgroup of level 3, and it consists of those matrices that are congruent modulo 3 to one of the following matrices.
So, for example, the group has the description
The group generated by and is of finite index only for . Similarly, the group generated by and is of finite index only for .
There are two conjugacy classes of congruence subgroups of index 7, which we define here by
their generators.
In the printed form of the group :
These numbers are related by .
They are indeed not conjugate.
The intersection of these two groups turns out to have index 28, while the group generated by these two groups turns out to be the full modular group.
A fundamental domain may be obtained with mDomain.
The edge pairings for this fundamental domain are given in the Farey symbol. Edges between two rational numbers with the same integer label are paired together, while edges with the label ● or ○ are paired with themselves.
The matrices returned by mGenerators are the matrices responsible for pairing the edges in this way.
The congruence closure of a subgroup is the smallest congruence subgroup that contains . We first start with the congruence subgroup of index 7 from the previous example and a noncongruence subgroup of index 9.
Its congruence closure is the theta subgroup.
One can also compute the congruence closure of a group by joining it with the principal congruence subgroup of the same level, but the package uses a much more efficient method. Membership in the groups , , and can be tested with , , and , respectively. The function mFromMember constructs the internal representation of the group given the group’s membership function.
We also have the following property, since g itself is congruence.
Next, we compute the congruence closure for the group given in Example 1.1 of [2]. It turns out to be the full modular group.
First get the principal congruence subgroup of level 5, which has index 60.
Generators may be computed quickly from this permutation representation, and we can also efficiently reconstruct the group from a list of generators.
We will graph the supergroup lattice for the principal congruence subgroup of level 4. The mSupergroups function is used to make the supergroup lattice. The index of each group (in )
is displayed in the lattice, and the actual group is displayed as a tooltip. Every subgroup of whose matrices can be described by congruence conditions modulo 4 appears somewhere on this lattice.
If is a subgroup of the modular group, then every matrix in acts on (the set of rational numbers with included) and partitions into equivalence classes. We say that two rational numbers and are equivalent under if there is an element of that sends to . The set of equivalence classes of under the action of is known as the cusps for , and there are finitely many cusps if has finite index in the modular group. The width of a cusp with respect to is defined to be the index of the stabilizer of inside the stabilizer of .
Let g and h be the subgroups
Here is a list of inequivalent cusps of h and their widths.
Here we reduce a list of random cusps to one of these four. The frequency of each cusp in the list should be proportional to its width.
The intersection of g and h may be computed.
Of course, the implementation in the package is more efficient.
If denotes the total number of subgroups of the modular group of index , then with so that it is possible to show that
and that
Since the radius of convergence of the power series is zero, this differential equation must be treated formally as a recurrence relation for the coefficients of . See Section 1 of [3] for details.
Caranica [4, Table 3.1] has computed the conjugacy classes of noncongruence subgroups of index 9. However, this table incorrectly claims that there are 11 conjugacy classes. In fact, there are 108 noncongruence subgroups of index 9 and 12 conjugacy classes. Vidal [5] has given a formula for the generating function of the total number of conjugacy classes of subgroups (congruence or not) of a given index.
Fortunately, the Farey symbols for the claimed groups are provided by Caranica, so we can recover the source of the error. First we verify that there are indeed 11 conjugacy classes of noncongruence subgroups in the table.
This is the group that is missing from the table.
The Mathematica builtin function KleinInvariantJ is invariant under and takes each complex value exactly once inside a fundamental domain for . A plot of this function and an outline of its fundamental domain are shown.
The upper halfplane, which is parameterized by , has been mapped into the unit disk, which is parameterized by , by the relation .
The hue of the color plotted at a point in the disk is the argument of the complex number , where is the point in the upper halfplane corresponding to .
Similarly, the builtin function DedekindEta can be used to construct such a function for , which is a congruence subgroup of of index 3. A plot of this function and a fundamental domain for are shown next.
Here is a similar construction for . This plots the fundamental domain for in the halfplane.
In the disk, such a region becomes a diamond shape. A univalent function on must be invariant under the generators of ; that is, , and such a function is provided by the builtin function ModularLambda.
In this case, the zero of occurs on this diamond where the colors coalesce. In the previous example, the zeros of are not visible in this way because they occur on the boundary of the
disk.
Many operations on subgroups of the modular group depend on operations on graphs. Several of the algorithms used here encounter unfolded graphs, and we use the efficient folding algorithm described in [6] to implement Stallings’s folding process, which converts any graph to a folded graph.
As described in the introduction, it is straightforward to convert a group described by the permutations and to the Schreier cosets graph. In order to reverse this procedure, it is necessary that each edge labeled either have the same initial and terminal vertex or be part of a threecycle. Similarly, each edge labeled needs to occur in either a loop or a twocycle. All of the graphs used here have this property. However, when building a group from a list of a generators, we may encounter a folded graph in which some vertex does not have valence four. Such graphs do not correspond to subgroups of of finite index. Let us illustrate the folding procedure on the following graph.
Such a graph represents the subgroup that is generated by the two words and , since, except for the trivial cycles induced by the relations , these are the only cycles in the graph. Whenever there are two edges and incident at the same vertex with the same orientation and label, causing the graph to be unfolded, the edge may be deleted and the vertex may be merged with without changing the subgroup represented by the graph. The progression of the graph folding procedure shown is left to right, top to bottom, with the edges to be deleted shown between graphs.
The subgroup of represented by the final folded graph has index three and is determined by the permutations and . This is also known as the theta subgroup. The reader is urged to work through the folding process for the group generated by and to see that it does not have finite index in . The starting graph is shown here.
It is useful to have a notion of a standard representation (in terms of the permutations and ) of a group whereby two groups are the same if and only if their permutations and are identical. This can be accomplished by visiting the coset first (denoted by the index 1 in the permutations). Once we have visited a coset , we then recursively visit the coset (assuming this has not been visited yet), and once this trip has returned to the coset , we visit the coset (also assuming this coset has not been visited yet). The standard labels for the indices for the nontrivial cosets may then be determined by the order in which that coset was visited.
In the case of testing a matrix for membership in a group , write as a word in and , then set and check if . Specifically, for a given matrix whose entries in the left column are nonnegative, multiply on the left by the matrices
until the left column contains a zero. The variable holds the current coset, so every time is multiplied by , for example, needs to be updated to .
Given the membership function on matrices for a group, we construct the group coset by coset. Assume that has index at least three in . If , start with the four cosets ; otherwise, start with the three cosets . Proceed by adding either one or three cosets to at a time. If is such that:
Where there is no such coset satisfying either of these conditions, we have found all of the cosets of . A naive implementation of this procedure would have worstcase running time , where is the index of the resulting group. The worstcase running time may be reduced to by keeping track of which cosets actually need to be checked.
We may compute coset representatives, generators, and a Farey symbol in operations for a subgroup of index . This works as follows. Let be the graph corresponding to a subgroup of index . First, the graph is cut into a tree so that the coset labeled is given by the resulting unique path from the vertex 1 to the vertex . Any time a cut is made or a fixed point is encountered, the corresponding matrix is added to the list of generators. Finally, after the cosets and generators are computed and the cuts have been recorded, the Farey symbol is computed by a clockwise traversal of the tree.
The cusps of a given subgroup are also important. The action of on the upper halfplane is given by
and this action extends to . The equivalence classes of under the action of , namely , are finite, and we may choose a representative for each one as follows. The stabilizer of in is generated by (or ). Therefore, any two cusps (say and for ) are equivalent under whenever there is an integer such that ; that is, and belong to the same cycle of the permutation . The width of a cusp can then be defined as the length of the cycle (of ) that contains the coset .
Joining and intersecting two groups is surprisingly simple. To compute the group that is generated by and , we can form the graph for . Then, for each generator of , merge the vertices in corresponding to the cosets and . This will in general result in a unfolded graph, which we can then fold and convert back to a permutation representation. In order to compute the permutation representations for the intersection of and , first find the orbit of under the action of and in terms of cosets of the form . A permutation representation of may then be obtained by the action of and on the cosets in this orbit of .
It is also straightforward to check if two groups are the same or conjugate, or if one group is contained in another. To test if two groups and are the same, we employ a strategy similar to the process for standardizing the representation. The cosets of and are visited simultaneously, starting with the pair . If we are currently visiting the pair , then we visit the pairs and as described in the standardization process. If the two paths ever become out of sync, that is, if cosets are visited in a different order, then we know the groups are not the same; otherwise the two paths will return back to and we know that and are the same. Checking if and are conjugate can be accomplished by the same procedure. We need to check if the path stays in sync when starting at some pair for .
The congruence functions use the list of relations of Hsu [2]. Recall that the congruence closure of a group is the smallest congruence subgroup that contains . We compute the congruence closure of as follows. Hsu gives a list of relations that are satisfied if and only if is a congruence subgroup. Let be the list of the permutations where is a relation in Hsu’s list. If contains a nonidentity permutation , this represents an obstacle to being a congruence subgroup. Let denote the level of , which is defined as the order of the permutation . As it is known that contains , the set of relations for is also satisfied by . Let be any permutation in and an index of any coset in . Since must act trivially on the cosets of and is a subgroup of , the group obtained from by merging cosets and must also be contained in . Therefore, merging the cosets and of for all and must give .
We have described an efficient package for manipulating and constructing subgroups of the modular group. It is hoped that this will further interest in these groups and facilitate research dealing with these subgroups.
The author would like to thank Junxian Li of the University of Illinois at UrbanaChampaign for many helpful discussions on improvements and corrections to initial implementations of the algorithms.
[1]  C. A. Kurth and L. Long, “Computations with Finite Index Subgroups of Using Farey Symbols.” arxiv.org/abs/0710.1835. 
[2]  T. Hsu, “Identifying Congruence Subgroups of the Modular Group,” Proceedings of the American Mathematical Society, 124(5), 1996 pp. 1351–1359. www.ams.org/journals/ proc/199612405/S000299399603496X/S000299399603496X.pdf. 
[3]  A. Lubotzky, “Counting Finite Index Subgroups,” London Mathematical Society Lecture Note Series 212: Groups ’93 Galway/St Andrews, Vol. 2 (C. M. Campbell, E. F. Robertson, T. C. Hurley, S. J. Tobin, and J. J. Ward, eds.), Cambridge: Cambridge University Press, 1995 pp. 368–404. 
[4]  C. C. Caranica, “Algorithms Related to Subgroups of the Modular Group,” Ph.D. thesis, Louisiana State University, 2009. etd.lsu.edu/docs/available/etd07092009200839/unrestricted/Caranica_diss.pdf. 
[5]  S. A. Vidal, “Sur la classification et le dénombrement des sousgroupes du groupe modulaire et de leurs classes de conjugaison,” Publications IRMA, 66(II), 2006 pp. 1–35. Preprint: arxiv.org/abs/math.CO/0702223. 
[6]  N. W. M. Touikant, “A Fast Algorithm for Stallings’ Folding Process,” International Journal of Algebra and Computation, 16(6), 2006 pp. 1031–1045. doi:10.1142/S0218196706003396. 
D. Schultz, “Manipulating Subgroups of the Modular Group,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.184. 
Daniel Schultz is a postdoctoral researcher at Pennsylvania State University who is interested in modular functions and automorphic forms in general.
Daniel Schultz
Pennsylvania State University Department of Mathematics
University Park
State College, PA 16802
dps23@psu.edu
We present an implementation of the PoissonInfluenced Means Algorithm (PIKA), first developed to characterize the output of a superconducting transition edge sensor (TES) in the fewphotoncounting regime. The algorithm seeks to group data into several clusters that minimize their distances from their means, as in classical means clustering, but with the added knowledge that the cluster sizes should follow a Poisson distribution.
The algorithm proper is run when it is submitted using the button near the lowerright corner of the form. You also have the option to use a separate input file to manually override the form, which may be more useful for automated runs on multiple datasets. This function launches the program; evaluating it generates the form and then executes PIKA. The function is defined and documented in Section 7.9. After taking input through the form, pika calls runPIKA, the de facto main function for the program, defined in Section 3. You can also specify a separate options file in the form that overrides the variable assignments that the form makes. The first command ensures that the paclet for forms is current.
If you get a popup that asks, Do you want to automatically evaluate all the initialization cells…?, answer yes. When you get the form, it is necessary to replace [directory] with the pathname of the location of the data; you might also have to change the backslash to a forward slash.
The PoissonInfluenced means Algorithm (PIKA) was first described in [1] as a way of calibrating a transition edge sensor (TES), a superconducting fewphoton detector. A TES can discern the number of photons in a very weak pulse of light, but it must be calibrated in order to do so. Our implementation deals with photon counting, but many of its features are applicable to more general probabilityassisted means clustering situations.
A TES is a superconductor kept in its transition from its superconducting phase to the normal regime, where it loses its superconducting properties. Photons incident on the sensor heat it, causing its resistance to rise sharply and then slowly fall to superconducting levels as the heat dissipates. A current is run through the TES, and the change in resistance is captured by the voltage signal of a superconducting quantum interference device (SQUID) inductively coupled to the TES circuit.
Several groups of TES signal waveforms are shown in Figure 1 (each graph shows the set of signals elicited by an ensemble of laser pulses with an average number of photons per pulse given by ). For , one can clearly distinguish the different photon numbers and their relative frequencies; for higher numbers this is harder. Higher photon numbers create higher signal amplitudes, but at a certain point the TES saturates in the normal regime and additional photons change the signal maximum very little.
Figure 1. Several collections of TES waveforms resulting from pulses with particular mean photon numbers given by , from [2].
The goal of PIKA is to characterize individual TES waveforms by the integer photon numbers of the pulses that cause them. The photon numbers of individual pulses cannot be determined directly; we can only estimate the average photon number of all of the pulses, based on the nominal laser and attenuator parameters of the light source.
means clustering, upon which PIKA is based, is a fundamental part of unsupervised machine learning. PIKA extends the means algorithm to scenarios in which the ideal distribution that the clusters should follow is known, and though some of the implementation is specific to the context of TES calibration (e.g. the use of the Poisson distribution, the idea of ordering observations by photon number), much of it can be generalized without much difficulty to other situations with known probability distributions.
Traditional means clustering consists of taking some amount of data and organizing it into clusters that minimize their members’ distance from the cluster mean. Essentially this is a minimization of an objective function, the sum over each piece of data of its deviation from its cluster mean (where deviation is measured by some relevant definition of distance). We can use a similar approach by considering each signal as a highdimensional vector and its deviation from some mean as squared Euclidean distance. Then the means component of the objective function becomes
(1) 
where is the signal vector for observation ( is the element of the vector ), is the mean of the cluster , and is the number of time points; and give the first cluster’s photon number and number of clusters, respectively, and are determined by which photon numbers we expect to be associated with at least one pulse based on the Poisson distribution. More physically, is an individual waveform and is the average of the waveforms with an assigned photon number .
To account for the Poissondistributed cluster sizes, we introduce another term, , where is the likelihood, according to the Poisson distribution, of a group of clusters associated with a group of photon numbers having the particular sizes that a given clustering asserts that they do, given the mean photon number of the ensemble of pulses. The likelihood of a particular sequence of photon numbers occurring in an ensemble with mean is
(2) 
where is the number of waveforms in cluster . We also need a combinatorial component, since different photonnumber sequences can yield the same eventual cluster sizes:
(3) 
where . Then , and the PIKA objective function is
(4) 
The constant relating the two terms can be estimated from the data, since the objective function is itself the negative loglikelihood of a normal distribution. (This also means that minimizing is equivalent to maximizing the product of two likelihoods.) PIKA minimizes the objective function by moving waveforms to neighboring clusters.
Once the clusters are optimized, each waveform is assigned an effective photon number by a linear interpolation between the two closest cluster means. First, we find the value that minimizes the root mean square deviation of from , where and are the closest and secondclosest mean waveforms to . In practice, and . One can easily show that
(5) 
for each . The effective photon number is then given by
(6) 
PIKA needs an initial clustering upon which to improve. Random cluster assignment is an option, but a better alternative is to give the observations a rough order by photon number, so that our initial guess is actually a meaningful estimate. This is done via the dot product method: we assign each observation an initial effective photon number
(7) 
where is the entire ensemble’s mean, not a cluster mean . The initial clusters are sized to fit each observation and conform to the Poisson distribution, and the observations are placed in the clusters by order of effective photon number.
The geometric interpretation of PIKA and the dot product method is a curve and a line, respectively, evolving through hyperspace (shown in Figure 2). The dot product method projects each observation onto the mean waveform vector (a line) and then assumes that photon number scales linearly with distance along the mean vector (which is not actually true, but suffices for a first guess); that converts distance relative to the mean to photon number relative to the mean. PIKA, in contrast, finds a piecewise linear approximation of a curve that passes through the cluster means and projects each observation onto that. Both essentially measure photon number by progress along a onedimensional path through highdimensional space.
Figure 2. An illustration of the geometric differences between the dot product method (green arrow) and PIKA (blue curve). The 3D space here stands in for a highdimensional space.
PIKA requires knowledge of the ensemble’s mean photon number in addition to an initial clustering. There are two ways of supplying that knowledge. The first is simply to give the exact mean photon number of the incoming pulses, if it is known; then PIKA clusters the data accordingly.
The second is to test several mean photon numbers on the data if the true mean is not known exactly. The test means should be close together in some range around a rough estimate of the true mean; PIKA clusters the data once for each value, returning a new (usually better) estimate of the mean based on each optimized clustering, as well as the value of the objective function associated with each new estimate. (In addition, since the test means are close together so that adjacent distributions should be very similar, the effective photon numbers for the waveforms from one round are used as the initial ordering for the next, instead of the dot product method.)
The optimized means are usually closer to the true mean than their initial seeds, but, depending on the structure of the probability distribution underlying the objective function, some optimized means may be moved farther away due to attraction to local minima. In the Poisson case, we have observed the objective function to have secondary minima at integer differences in mean photon number from the primary minimum and convex regions of width 1 centered around each minimum. Thus, test means that land in the same region as the true mean should be optimized toward the true mean, but those outside are diverted by secondary minima. Reference [3] estimates the range of test means to within half of a photon number from the nominal laser power and attenuation.
The following results and images (in Section 2.5) except Figure 6 come from [1].
Figure 3(a) shows the optimized cluster means from two PIKA runs (, solid red, and , dotted blue). The shapes of the mean waveforms appear independent of —reassuring, since the average photon number of an ensemble should not affect the shape of individual waveforms. This independence indicates that PIKA is properly identifying actual photon numbers in the data, and that the average photon number with which it is supplied is not unduly affecting the results.
Figure 3(b) is a histogram of the optimized effective photon numbers from (bin widths of 0.05), which follow a Poisson distribution but with some Gaussian spread around the integers (the red curves are Gaussians centered on the integers and fitted to the data), resulting in a comblike shape.
Figure 3. (a) Optimized cluster mean waveforms for (solid red) and (dotted blue). (b) Optimized effective photon numbers for (black) and integercentered Gaussians fitted to the data (red).
Figure 4 shows a similar comb structure for for both the dot product method and PIKA. As the photon number increases, the teeth of the comb grow less defined—that is, the peak visibility falls, and with it the photonresolving capability. Figure 5 shows this drop in visibility. The power of PIKA is that it retains nonzero visibility (i.e. the uncertainty does not include 0) through , whereas the dot product method alone loses it after . PIKA has been used to explore the regime just beyond the loss of peak visibility [4].
Figure 4. Effective photon numbers from the dot product method and from PIKA for .
Figure 5. Peak visibility for the dot product method (blue) and PIKA (red) through (both lose visibility altogether after that point). Uncertainties are given at two standard deviations and are purely statistical.
Figure 6 illustrates the loss of visibility with high photon numbers. As increases, cluster means become closer to each other (saturation in the normal regime) and individual waveforms overlap with each other more and more, making it more difficult to differentiate between adjacent photon numbers.
Figure 6. Illustration of the loss of visibility as increases. Cluster means become closer together, so individual waveforms overlap more and more.
The code and documentation are written in the context of fewphotoncounting with a TES, so variables and functions are often named for physical quantities and concepts specific to the original purpose of the algorithm instead of more abstract ones (e.g. readTES instead of readData). In addition, the terms “trace,” “observation,” and “waveform” are used interchangeably and synonymously to mean a single detector response in the set of signals—that is, a regular sampling of voltage over time that records the response to a single optical pulse fired at the TES. In the program a trace is a list of voltage values.
Variable and function names also follow certain conventions:
This section describes the overarching form of the algorithm. Here, the process is broken into a series of more isolated procedures, whose implementation details are described in the next section. Note that the functions referenced in this section and defined in the next take no arguments and return no output—they merely break the whole algorithm into smaller pieces of code that operate on global variables.
Below is the skeleton of the process. We begin by defining constants and options (which can be overridden as necessary by the user), and then reading in the data with a noise filter. One can optionally filter out observations contaminated by background radiation; by default the algorithm does not do this. We then proceed to iterate (Loop 1) over the elements of nPhotonAvgList, a list of hypothetical mean photon numbers for the set of pulses incident on the sensor, running PIKA once on the dataset for each element. This lets us compare the optimized objective functions for each hypothesized mean and estimate the actual mean from the runs with the smallest objective functions. The dotproduct method supplies the initial effective photon numbers (which create the initial clustering) for the first mean photon number; after that we use the optimized effective photon numbers from the previous mean on the list as our initial guess for the next mean (adjacent elements of nPhotonAvgList should be close together, so one round’s optimized ordering of the observations should be a reasonable seed for the next).
In Loop 1 (over nPhotonAvgList), we organize the traces by their clusters and then find the initial value of the objective function to be minimized. We then enter the actual optimization loop (Loop 2), which repeats some preset number of times, trying to lower the objective function by moving observations between clusters. For each iteration of Loop 2, the nested Loop 3 examines each observation and considers moving it to a neighboring cluster (one with a photon number one greater or less). The greedy algorithm [5] (the default option) accepts any move that reduces the objective function and rejects any that does not, while the simulated annealing algorithm [6] may accept disadvantageous moves so that it can find global minima instead of just local ones. If a move is accepted, the relevant pieces of the objective function are updated. When Loop 2 finishes, the observations have been organized into optimized clusters, and we create various graphical and textual outputs.
Here we define the functions referenced in the previous section. The structure of the following subsections mirrors that of the algorithm skeleton above: Subsection 1 defines the procedures called before Loop 1 begins, Subsections 2 through 4 define those used in Loop 1 (with Subsection 3 containing the functions for Loops 2 and 3), and Subsection 5 contains the (very minimal) end of the algorithm after exiting Loop 1. The functions are defined in the order in which they are called, and each is called exactly once. The functions called inside the bodies of these skeleton functions (i.e. those that take arguments, unlike the functions defined in this section) are defined in Sections 5
and 7.
The main execution function, pika, assigns values to the following variables based on the user’s response to the input form:
runPIKA (defined in Section 3 with implementation in this section) begins with some initial setup, processing some of the input given in pika and setting up tables to store graphical output.
Now we read in the data and apply a noise filter. readTESandFilter, defined in Section 7.7, takes the data from the set of files specified in the options above and applies a Hanning filter to it. We can then take a random sample and reduce the dataset to a smaller size, if desired, but by default we keep all of the observations. dat is a list of lists—each sublist corresponds to a waveform in the data and lists its values at regular time points.
In addition to the noise filter, we can remove observations from the dataset that have been contaminated by background radiation. getIObsKeep returns the indices of the observations in dat that should not be thrown out, based on the parameters peakPosCut, peakValCut, and peakNumCut that characterize waveforms unduly influenced by background radiation. The graphical output is a random sample of rejected observations.
meanTrace is the average of all of the waveforms in dat. The output here is a graph of meanTrace with a sample of accepted traces underlaid.
For the first mean photon number in nPhotonAvgList, the effective photon number for observation i is found via the dotproduct method. For subsequent means, the optimized effective photon number from the previous item on nPhotonAvgList serves as the initial estimate for the next mean, since the elements of nPhotonAvgList are expected to be in order and close together. In the beginning of Loop 1, we create the initial clustering based on these effective photon numbers.
Now we enter Loop 1, whose index, iNPhotonAvgList, is the position in nPhotonAvgList of the current mean photon number. Each iteration of Loop 1 runs PIKA on the given dataset with the assumption that the waveforms in the data come from pulses incident on the sensor with an average of nPhotonAvgList photons per pulse. The runs are independent of each other apart from the fact that one round’s optimized effective photon numbers are used as initial effective photon numbers for the next.
The initial clusters are formed from the effective photon numbers of the individual observations. The effective photon numbers order the observations, but the relative sizes of the clusters and the photon number to which each one corresponds are determined by the mean photon number, nPhotonAvg, and the Poisson distribution. groupProbability, defined and described in more detail in Section 7.4, returns:
nClust is then the number of clusters, which groupProbability decides based on which photon numbers we expect to see in at least one observation.
The following code graphs samples of waveforms from the middle three clusters.
Here we graph the average waveform of each cluster.
The initial, unoptimized value of the objective function, from equation (4) above, is . Here, sqDevOfClust is a table containing the square deviation of each cluster from its own mean (each element is for cluster ), so the table’s sum is , the means term of the objective function. nOfClust is a list of cluster sizes, and logLikeProb is , the Poissoncombinatorial term. sigmaObjFtn is , the regularization parameter for the objective function based on the deviations of the initial clusters (unlike , does not change as the objective function is minimized).
This graphs the root mean square (RMS) deviations of each cluster to its mean.
The optimization loop (Loop 2) is now ready to begin, apart from a few organizational tasks. First, since the optimization moves waveforms between clusters many times, we need a more convenient way to keep track of the clusters. getIClustOfObs converts iObsOfClust, a list of lists, into iClustOfObs, a onedimensional list, each element of which corresponds to an observation in dat and gives the index of the cluster to which that observation belongs.
Next, for each cluster, neighborOfClust lists the acceptable clusters to transfer to during the optimization loop. By default, the neighbors for cluster are the clusters and , since the initial clustering (from the dotproduct method or the previous round of PIKA) is expected to be at least a somewhat accurate arrangement, and so longdistance transfers should not be necessary.
Finally, a system for keeping track of transfer history lets us reduce unnecessary computation. kMove counts the number of transfers that have been made in the loop. birthOfClust is a table that records the move number when each cluster was last changed, and birthOfObsClust is a twodimensional table that records, for each observationcluster pair, the last time that the deviation from the observation to the cluster mean was calculated. sqDevOfObsClust is a table with the same dimensions that records the actual deviations so that they can be used later in the loop if they are still up to date.
Loop 2, nested inside Loop 1, begins here. Its index, iCool, specifies how many rounds of optimization have already passed. The optimization loop runs for a preset number of iterations (given by nCool in the initial setup), the first nGreedy of which employs the greedy algorithm. The remaining iterations are run with the simulated annealing algorithm. By default, nGreedy is equal to nCool, so all of the iterations use the greedy algorithm (i.e. simulated annealing is never used).
The bulk of Loop 2 is contained in the nested Loop 3, the waveform transfer loop that passes over all of the observations in dat and decides whether to move them to neighboring clusters. Before starting Loop 3, we randomly order the indices in dat of the waveforms upon which Loop 3 is to operate, so that the original ordering of the data does not bias the algorithm in any systematic way.
The index of Loop 3, jObs, is the current position of the loop in iObsAll, the random permutation of the observation indices. iObs is then set to the index of the observation itself, rather than the index of the index. iClust is the index of the associated cluster. If the cluster has only one observation then we skip to the next iteration of Loop 3, since no cluster is allowed to become empty. We then randomly pick a cluster jClust from the list of neighbors, and we consider moving the waveform from its current cluster to the new one.
To reduce the computation time, birthOfObsClust and birthOfClust record the “ages” of observationtomean deviation calculations and changes to cluster contents, respectively, so this step checks whether iClust and jClust have been modified since we last calculated the deviation of iObs from their means. If they are unchanged, then sqDevOfObsClust is up to date; this condition becomes increasingly common as the calculation approaches convergence. If they are changed, we need to recalculate one or both deviations and store them in sqDevOfObsClust.
We need to know how a proposed move would change the objective function before deciding whether to accept it. Section 5 gives more detail on the functions called here that efficiently compute the change in the objective function.
If the transfer would decrease the objective function, we automatically accept it, and if we use the greedy algorithm, we reject any transfer that does not decrease the objective function. If we use simulated annealing, however, we may accept a transfer that increases the objective function for the purpose of exploring a greater number of assignments and possibly avoiding local minima that would trap the greedy algorithm. In simulated annealing, it grows harder for a disadvantageous transfer to occur as time goes on and the “temperature” drops.
If the move is accepted, then we need to move the observation from one cluster to the other and update the clusters’ deviations and mean traces. Both clusters have been changed in this move, so we update birthOfClust accordingly.
After making a transfer or deciding not to, Loop 3 moves to the next waveform in the randomly ordered list or finishes if there is none.
Loop 3 is now over, after examining all of the observations and moving some to neighboring clusters. If we use the simulated annealing algorithm, the annealing temperature decreases after Loop 3, making it more difficult for a move that increases the objective function to be accepted.
Loop 2 repeats a preset number of times, randomly ordering the waveforms, running Loop 3 on each one, and then decreasing the annealing temperature in each iteration.
When Loop 2 is finished, PIKA has run for many rounds on the data and the observations should be arranged into optimal clusters. This subsection concerns itself primarily with creating graphical and numerical outputs to understand and visualize the clustering. Now that the optimization is finished, iObsOfClust (the list of lists) is a more useful format than iClustOfObs (the simple list). In order to compare the new clustering to the old, we generate iObsOfClustNew from iClustOfObs, which reflects the optimized clusters, and simply copy iObsOfClustOld from iObsOfClust, which has not been changed since before Loop 2. freqNew and freqOld are lists of the new and old cluster sizes, respectively, and nPhotonAvgNew is an estimate of the actual mean photon number based on the results of PIKA.
As before, we graph some sample waveforms from middle clusters.
Here we numerically output and then graph the optimized cluster mean waveforms, with a graph of both the optimized and initial means as well.
This graph compares the initial and final cluster counts to a Poisson distribution.
is the average mean square deviation of the observations in cluster i to the mean waveform in cluster j. This graph is of the RMS deviations of the traces in clusters to adjacent means.
is the mean square deviation of the mean of cluster i to that of cluster j. This graph is of the RMS deviations of cluster means to means one, two, and three clusters away.
nPhotonEff lists the effective photon number of each observation (getNPhotonEff is implemented in Section 7.2). The rest of this code creates a histogram of nPhotonEff with bin widths of binFract: binEdge specifies where the edges of the bins should be, binCnt counts the number of waveforms in each bin, and binCtr lists the centers of the bins.
Loop 1 runs PIKA on the data once with each mean photon number in nPhotonAvgList and then exits.
After Loop 1 finishes, the algorithm outputs the computation time used and displays the output from PIKA.
A naive implementation of the transfer of waveforms between clusters would recalculate the various components of the objective function from scratch with each change, resulting in a computationally expensive process whose running time would depend on the size of the clusters in question. We can avoid this excessive calculation by noticing that, given some basic summary information about the clusters and waveform to which a given transfer pertains, the changes to the objective function and the cluster means are very easy to compute without any knowledge of the other waveforms in the clusters. The running times of the updates in this section are independent of cluster size.
From equation (1), we can decompose the means term of the objective function as
(8) 
where
(9) 
If we transfer a waveform into the cluster , the new cluster members form the set . It can be shown that
(10) 
If we transfer a waveform out of the cluster , the new cluster members form the set
, with
(11) 
In equations (10) and (11), and refer to the original clusters and , before the transfer of . See below for a proof that these equations indeed give the proper changes in cluster deviation. Equation (11) corrects equation (A4) in [1]. After the transfer, the cluster means for and become
(12) 
with plus signs for and minus signs for . No cluster is ever allowed to become empty (in such a case we would have not but means), so the denominator never becomes zero. Therefore, we can determine the appropriate change to the means and square deviations of the clusters between which the transfer takes place.
Suppose that cluster has nine waveforms and cluster has 14 (and each has some mean waveform in addition), and we want to move a waveform from to . When we make this transfer, changes by the following amount.
The mean waveforms of clusters and become the following.
Of course, the cluster sizes change as well: now has 10 waveforms, while has 13.
The derivation for equations (10) and (11) is slightly involved and requires some preliminaries. Voltage signals are considered as vectors for the derivation, denoted , where the discrete time index is not given explicitly in this section.
Lemma:
Proof:
which is equivalent to the first expression. (The final step is valid because .) □
Lemma:
Proof:
From the proof of the previous lemma, we have
and thus
Finally,
and the lemma follows. □
Now we can prove that equations (10) and (11) hold true.
Theorem:
Proof:
(Note the summation over , not .) Since is the observation to add to or remove from , we can write
By the second lemma, we have
and the theorem follows. □
From equation (2), the Poisson loglikelihood is
(13) 
where . If some waveform is transferred from the to the cluster, then the change in the objective function consists of a term from the Poisson factor and a term from the combinatorial factor from equation (3). The term due to the Poisson factor is
(14) 
(The notation suggests , but the same formula applies for the case.)
The combinatorial factor’s treatment is similar. The logarithm of the factor is
(15) 
does not change with a transfer, so only the second term contributes to the change:
(16) 
The algorithm only considers moves to adjacent clusters, so we only need to calculate the change due to transfers to immediately preceding and succeeding clusters. These two functions give the change in the probabilistic/combinatorial component of the objective function for moves to higher and lower photon numbers.
Bear in mind that the term that contributes to the overall objective function is the negation of its logarithm, so a positive decreases .
Suppose we have a set of clusters, among them an cluster with five waveforms and an cluster with seven waveforms, with a mean of 4.4 photons. We are interested in moving a waveform from the cluster to the cluster. Then the change in the Poisson loglikelihood is given by deltaPoissonUp.
We would like to thank Boris L. Glebov and Alan L. Migdall for providing the TES data. Figure 1 courtesy Thomas Gerrits and the Optical Society of America.
The functions in this section either are used often, and so are given their own names, or perform very particular procedures (such as data reading) whose implementation details are tangential to the core operation of the algorithm, and so have been separated from the main body of the code.
boundList confines the elements of a list between two inclusive bounds.
getDotToIdeal takes a single trace or a set of traces and finds its or their dot product with an ideal trace, normalized by the square of the ideal trace’s magnitude. If the function is given a single trace as its first argument, the operation in the numerator is a simple dot product; if it is given a list of traces, the operation is matrixvector multiplication, which returns a list of dot products.
getMeanOfEachTime averages all of the traces in dat at each time point, returning the overall mean waveform.
getSquareDiff returns a table that lists the squared magnitude of the difference between each observation and each cluster mean.
groupCommon is a small function that rearranges the result of GatherBy. It is best understood from its definition.
meanSquare takes a vector and returns its mean square (magnitude squared divided by length).
pair takes two lists (of equal length) and pairs elements at the same position, returning a list of pairs.
randomSample is a wrapper for RandomSample that returns a random sample of some size from a list, reducing the sample size if it is greater than the length of the list.
twoMin takes a list and returns the positions of the smallest and secondsmallest elements, in that order.
The threeargument version of getNPhotonEff takes a list of observations, a list of photon numbers associated with clusters, and a list of cluster means. It returns a list specifying the effective photon number of each observation, found via a linear interpolation between the two cluster means closest to a given observation.
alphaByClustPair is a list that gives the value of the interpolation parameter for each observation, organized by the two cluster means to which each observation is closest. getNPhotonEff (the twoargument version) takes that information and finds the effective photon number for each observation.
getAlphaByClustPair returns a list of ordered triples that give the cluster means closest to a group of observations and the alpha values for those observations. is the mean square deviation of observation i to cluster mean j, and is a pair of indices giving the two cluster means with the lowest mean square deviations to observation i (i.e. a pair of indices giving the two columns with the smallest elements in row i of msDev). iObsTwoMinIndex pairs the relevant observation index with each pair in twoMinIndex. In the next line, the GatherBy groups the elements of iObsTwoMinIndex by their lowestdeviation index pairs, so that each element of iObsTwoMinIndexGroup is a list of elements of iObsTwoMinIndex that all have the same pair of nearest cluster means. Mapping groupCommon onto each of these lists in iObsTwoMinIndexGroup converts them into a more useful form; each element of twoMinIndexIObsList is now a list of length two, with the first entry being a pair of nearestmean indices and the second being a list of observation indices whose nearest two means correspond to the pair in the first entry. The output is a list with an entry for each distinct ordered pair of nearest means. Each entry contains a pair of nearestmean indices, a list of observations whose nearest means are those two, and a list giving the alpha value for each of those observations.
getAlpha takes a list of observations and a pair of ideal waveforms and finds the value of alpha for each observation. Alpha for a waveform is the value that minimizes the RMS deviations of from , where is the closest mean waveform (the first entry in the second argument of getAlpha) and is the second closest. getAlpha finds the minimizing value from equation (5) for each observation in traces. A value of 0 indicates perfect agreement with the first ideal trace, and a value of 1 indicates perfect agreement with the second.
The twoargument getNPhotonEff takes the formatted list of triples alphaByClustPair and a list of cluster photon numbers and determines the effective photon number of each observation. alpha is a list of alpha values for the observations, iObs is a list of observation indices corresponding to those alpha values, and clustPair is a list of nearestmeanindex pairs for those observations. alpha and clustPair are in the same order as iObs, but we want them in the same order as the waveforms from the original dataset, so we reorder them with iiObs. nPhotonPair takes the cluster indices in clustPair and converts them to the actual photon numbers associated with those clusters. The output is a list specifying the effective photon number for each observation, derived from a linear interpolation between the closest () and secondclosest () photon numbers.
If background rejection is enabled, getIObsKeep returns the indices of the observations that should stay in the dataset, based on the preset parameters for rejection. peakPosList and peakValList are lists specifying the position and value of the maximum in each waveform in dat. getIObsDrop returns the indices of the waveforms that should be removed because their peak positions and values exceed their respective cutoff values, and getIObsDropA lists those whose endpoints are greater than the value cutoff, indicating that the sensor was registering a background photon at the beginning or end of the pulse. These two lists to remove are combined into iObsDrop. Each element of datPeakLoc lists the positions of the local maxima of a waveform in dat, datPeakLocVal pairs each of those positions with the waveform’s value there, and datPeakLocValBig reduces datPeakLocVal to the maxima that exceed the peak value cutoff. datPeakNum gives the number of maxima in each waveform, and getIObsKeepA returns the indices that should stay in the dataset (i.e. those that are not in iObsDrop and that have fewer maxima than peakNumCut).
posMax returns the position of the first occurrence of a list’s maximum.
getMaxAndPosList returns two lists, the first giving the position of each waveform’s maximum and the second giving the maxima themselves.
getIObsDrop returns the indices of the waveforms whose peaks have values and positions greater than the cutoffs.
getIObsDropA returns the indices of the waveforms whose endpoints have values greater than the cutoff, indicating that a background photon may have registered with the sensor at the beginning or end of the pulse.
peakList finds the local maxima of a list by comparing each element to the ones before and after it.
getDatPeakLoc takes a list of waveforms and returns the locations of the maxima for each.
getDatPeakLocVal pairs each of the locations returned by getDatPeakLoc with the value of the corresponding waveform at that point.
getDatPeakLocValBig finds the maxima that exceed the cutoff among those returned by getDatPeakLocVal.
getIObsKeepA takes the list of waveforms rejected because of misplaced peaks and toolarge endpoints and combines that with the information on each waveform’s number of maxima to return a list of the indices of all of the waveforms that should be rejected due to background radiation.
These two functions convert between iObsOfClust and iClustOfObs, two different ways of organizing clusters of waveforms. iObsOfClust is a list of lists. Each sublist is associated with a cluster and has the indices (in dat) of the traces in that cluster. iClustOfObs is a simple list, each of whose entries corresponds to a trace and states what cluster that trace is in (0 if the trace is not in any cluster). getIClustOfObs takes iObsOfClust and returns iClustOfObs, and getIObsOfClust is its inverse function.
getIClustOfObs creates a table with an entry for each observation and then iterates through the clusters. In the table, it assigns the contents of each cluster to the appropriate cluster number.
getIObsOfClust creates a list of observation number/cluster number pairs and then applies a GatherBy to group waveforms in the same clusters together. Mapping groupCommon onto the list creates a list of lists, each of whose first element is a cluster number and each of whose second element is a list of the indices of the waveforms in that cluster. We then sort the overall list by cluster number and extract the second entries in the sublists to create a list of lists of observation indices.
groupProbability takes the data, the hypothesized mean photon number for the current round, and the effectivephotonnumber ordering of the data, and returns a list of four things: the Poisson probability distribution (ordered pairs of photon numbers and probabilities), a list of photon numbers for the clustering it creates, a list of lists of constituent waveforms’ indices for the clusters (iObsOfClust), and a list containing the average waveform in each cluster. groupProbability finds the photon number/value pairs for the Poisson distribution with the given mean photon number and then calls groupProb to organize the clusters.
groupProb processes the data into clusters based on the given effective photon number list and the Poisson distribution. First we find the proper ordering of the effective photon numbers by size (nPhotonEffSortIndex); this will be needed once we have determined the sizes of the clusters to be created. The probability distribution prob is broken up into its component parts: nPhoton lists photon numbers, and probList lists their corresponding probabilities. probCum is the cumulative distribution function of probList, which we normalize to ensure that the final element is one. gives the index (in the sorted list of effective photon numbers, nPhotonEffSortIndex) where cluster i should start; gives the index where it should stop. Both indices are inclusive, which is why probIndexStart is offset by one from probIndexStop. (boundList ensures that the starting and stopping indices stay within the confines of the number of observations available to group.) keep weeds out the zerolength bins, and we use it to reduce nPhoton, probIndexStart, and probIndexStop down to only the nontrivial clusters. We then form iObsOfClust by finding the range of sorted effective photon numbers that each cluster should encompass and then using nPhotonEffSortIndex to find the observation numbers to which that range corresponds. clustMeanTrace is formed by taking the waveforms in dat of each cluster in iObsOfClust and averaging them. We then return nPhotonUse (the photon numbers of the nonempty clusters), iObsOfClust, and clustMeanTrace.
This is the combinatorial term of the loglikelihood objective function. From equation (15) above, , where the are the cluster sizes and .
The Poisson loglikelihood term is .
probComboLogLikelihood finds the Poisson/combinatorial loglikelihood term of the objective function.
poisson generates about nSigma standard deviations of a Poisson distribution on either side of the given mean. The output is a list of ordered pairs of photon numbers and associated values of the Poisson probability mass function. The output is normalized so that it sums to one.
getSqDevClustObs takes a list of observation traces and a list of cluster mean traces and returns a table whose element in position is the mean square deviation of observation j from the mean trace of cluster i.
getSqDevClustClust returns a table whose element in position is the average of the mean square deviations of the traces in cluster i from the mean trace of cluster j.
getObjFtn takes a list of each cluster’s square deviation from its mean, the constant relating the means and probabilistic components of the objective function (which can be a scalar or a list that takes on a different value for each cluster), and the loglikelihood of the Poisson distribution (with the combinatorial term included). It returns the value of the objective function.
getSqDevInClust takes a list of observations and a mean waveform and totals up the mean square deviations of the observations to the mean. (It returns , where is the list of observations, is the number of time points, the are the observations, and is the mean.)
getSqDevInClust returns a list that gives , the total mean square deviation of a cluster to its mean, for each cluster . The list returned becomes the first argument of getObjFtn.
readTES reads in the data from a set of files specified in the options, returning dat, a table whose rows give the values of particular waveforms at regular time intervals. readTES assumes that several different datasets may share a directory and that each dataset is split over some number of files, each of which consists of an equal number of unsigned 16bit integers concatenated together in string form. iNPhoton gives the numeric label of the dataset to read (it is not equal to the mean photon number of the dataset). iDataSet lists the indices of the particular files in the dataset that should be read in. fileInfo is a list of three things: fileNamePart, a list of two strings that combine with the dataset label and the file index to create the whole filename; nSamplePerTrace, the number of time points in each waveform; and nTracePerFile, the number of waveforms in each file of the dataset.
For example, if iNPhoton were 7, iDataSet were , and fileInfo were , then we would be looking for the files , , and in , with 200 time points in each waveform and 512 waveforms in each of the three files.
For each file, we import the data as a list of samples, organize the samples into sublists (waveforms) of length nSamplePerTrace, and assign the list of waveforms to the appropriate section of dat. When all of the files have been read, dat is full and properly formatted.
We can also filter the data as we read it in. readTES applies a “short” filter to a list, one that omits half of the frequency domain.
This generates a Hanning filter of a given length.
readTESandFilter reads in the dataset, applying a filter to each file’s data before it is stored. This halves the memory consumption of the reading process, since filtration reduces the amount of information to store by half.
outputCreateTabs creates several list structures (global variables) of the appropriate dimensions to accommodate output from the specified number of runs of PIKA. The output is organized into a TabView, with each run receiving a tab that contains a nested view with graphical and textual output. outputNameList stores the names of the labels in the subview (which are the same for all runs), and outputContentTable stores the output content for each run and sublabel. (Content is stored as a list that later becomes a Column.) Each run’s tab also has a space above its subview for other information, which is stored in outputTopList. Finally, there is another tab on the level of the runs with general textual (outputGeneralLog) and graphical (outputGeneralGraphics) information about the data read in.
outputAdd adds some expression to the content list for a particular name and run, creating the name’s content list if it has no content already. A run number of 0 indicates output to the general information tab, and a blank name indicates output to the top space of a run’s tab.
outputShowTabView is called at the end of PIKA, taking the name and content lists and formatting them into a nested TabView.
print takes any number of arguments (args), turns them into strings, concatenates them, and prints the result to the log. printSp prints to an arbitrary section of output, not just the log. iRun and name indicate the run number and tab name, respectively, to which to print.
pika creates a form with a field for each constant and option that needs to be set for PIKA to run. It then uses runFromAssociation to assign the proper values to the proper variables and run PIKA.
makeField creates a rule that, when used in a FormObject, generates a field to set the variable with name name and type type to the entered value, with descriptive text labelPart and default value defaultContents. args takes optional extra arguments that specify additional options for the field.
runFromAssociation takes an association between variable names (strings) and the values that should be assigned to them and assigns the proper value to the symbolic variable associated with each name. It then runs PIKA (using the main function from Section 3) with those global variables set.
[1]  Z. H. Levine, T. Gerrits, A. L. Migdall, D. V. Samarov, B. Calkins, A. E. Lita, and S. W. Nam, “Algorithm for Finding Clusters with a Known Distribution and Its Application to PhotonNumber Resolution Using a Superconducting TransitionEdge Sensor,” Journal of the Optical Society of America B, 29(8), 2012 pp. 2066–2073. doi:10.1364/JOSAB.29.002066. 
[2]  T. Gerrits, B. Calkins, N. Tomlin, A. E. Lita, A. Migdall, R. Mirin, and S. W. Nam, “Extending SinglePhoton Optimized Superconducting Transition Edge Sensors beyond the SinglePhoton Counting Regime,” Optics Express, 20(21), 2021 pp. 23798–23810. doi:10.1364/OE.20.023798. 
[3]  Z. H. Levine, B. L. Glebov, A. L. Pintar, and A. L. Migdall, “Absolute Calibration of a Variable Attenuator Using FewPhoton Pulses,” Optics Express, 23(12), 2015 pp. 16372–16382. doi:10.1364/OE.23.016372. 
[4]  Z. H. Levine, B. L. Glebov, A. L. Migdall, T. Gerrits, B. Calkins, A. E. Lita, and S. W. Nam, “PhotonNumber Uncertainty in a Superconducting Transition Edge Sensor beyond ResolvedPhotonNumber Determination,” Journal of the Optical Society of America B, 31(10), 2014 pp. B20–B24. doi:10.1364/JOSAB.31.000B20. 
[5]  P. E. Black, “Greedy Algorithm,” in Dictionary of Algorithms and Data Structures (V. Pieterse and P. E. Black, eds.), Feb 2, 2005. www.nist.gov/dads/HTML/greedyalgo.html. 
[6]  S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi, “Optimization by Simulated Annealing,” Science, 220, 1983 pp. 671–680. doi:10.1126/science.220.4598.671. 
B. P. M. Morris and Z. H. Levine, “The PoissonInfluenced Means Algorithm,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.183. 
www.mathematicajournal.com/data/uploads/2016/03/TES_E001916.daq00.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001916.daq01.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001916.daq02.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001916.daq03.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001919.daq00.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001919.daq01.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001919.daq02.zip
www.mathematicajournal.com/data/uploads/2016/03/TES_E001919.daq03.zip
Brian P. M. Morris is a senior in the Science, Mathematics, and Computer Science Magnet Program at Montgomery Blair High School in Silver Spring, Maryland, was in the Summer High School Intern Program at the National Institute of Standards and Technology, and is a 2016 Intel Science Talent Search Semifinalist.
Zachary H. Levine is a physicist at the National Institute of Standards and Technology in the Infrared Technology Group and is a Fellow of the American Physical Society. He is a graduate of MIT and the University of Pennsylvania.
Brian P. M. Morris
Montgomery Blair High School
Silver Spring, Maryland 20901
Zachary H. Levine
National Institute of Standards and Technology
Gaithersburg, Maryland 208998441
zlevine@nist.gov
Picard’s iteration is used to find the analytical solutions of some Abel–Volterra equations. For many equations, the integrals involved in Picard’s iteration cannot be evaluated. The author approximates the solutions of those equations employing a semiimplicit product midpoint rule.The Aitken extrapolation is used to accelerate the convergence of both methods. A blowup phenomena model, a wave propagation, and a superfluity equation are solved to show the practicality of the methods. Programs offered solve the general equations. A user needs only to enter the particular forcing and kernel functions, constants, and a step size for a particular problem.
Abel–Volterra equations are normally represented by
(1) 
Equation (1) is called regular if and weakly singular (or of Abel type) if . Equation (1) is linear if ; otherwise, it is nonlinear. In most practical applications, is either 0 or 1/2. See [1–4] for conditions on existence, uniqueness, and continuity of a solution for equation (1). To solve equation (1) analytically, one normally employs the Picard method, a method of successive iterations, given by
(2) 
(3) 
From a theoretical point of view, the successive iterations given by equations (2) and (3) always converge for linear equations; see theorem 10.15, page 152 of [5] and pages 92–95 of [4]; see also solutions of some integral equations by the Picard method in [2]. In practice the convergence of successive iterations depends on the computability of the corresponding integrals in equation (3). One might be able to evaluate the integrals in some cases. Successive iterations may also be effective for some nonlinear equations. In what follows, we first introduce a simple program that implements the successive iterations and solve two examples using this program. For many integral equations whose exact solutions cannot be found by the Picard method, we approximate their analytical solution using a semiimplicit product midpoint rule. Two practical examples are solved to test the validity of this numerical approach.
The following program implements equations (2) and (3).
One needs only to introduce the kernel Ker, the forcing function g, and the real value α to solve the corresponding equation.
Example 1
To solve the equation
(4) 
note that
(5) 
where is the wellknown beta function defined by
(6) 
and is the gamma function.
Then it is easy to verify that the outputs produced by the Picard method for this example can be written as
(7) 
where , , , , , and .
Now substitute the corresponding kernel, forcing function, and the value of . The evaluation takes a minute or two.
Aitken’s method accelerates the convergence [6], but accelerators like this one are highly unstable numerically. One has to pay careful attention (in particular for division by zero) when working with such accelerators; it is important to use high precision. The following is a program to accelerate Picard’s iteration using the already developed PicardIteration program.
The analytic solution of is (see [3]). Use that to compare with the result of the Aitken accelerator.
Define res1 and plot the error.
Equation (7) can be used to find , for very large values of . This equation was obtained as a result of analyzing the corresponding Picard iteration. The same type of analysis will be used for our next nonlinear example.
Example 2
We solve a nonlinear blowup phenomena model,
(8) 
Equation (8) is mentioned as a a blowup phenomena model on p. 417 of [7]. The analytical solution exhibits blowup at finite time. A blowup means there is a finite time such that
(9) 
Our goal is to find the value of as accurately as possible.
Implementing a few steps of Picard’s iteration to equation (8) shows that the solution is of the form
(10) 
which implies that
(11) 
Make a change of variable from to in equation (8) and replace and using equations (10) and (11), respectively, on both sides of equation (8). We obtain
(12) 
where
(13) 
Equating coefficients corresponding to equal powers of from both sides, we get
(14) 
Note that
(15) 
and that is a power series in . Therefore, the radius of convergence for given by equation (10) is , which is the blowup number in equation (9).
All terms in equation (10) are positive, and the series is wildly divergent beyond 0.897677. For a series with all positive terms, the nonlinear accelerators are not useful in evaluating beyond the radius of convergence. For an alternating series, the situation is different. Nonlinear extrapolations are normally quite effective in evaluating the sums of alternating divergent series for variable values far beyond the radius of convergence.
We evaluate
(16) 
for large values of , where are given by equation (14).
For the following tables, the execution time is provided for cases and .
The following program is for the accelerated Picard method for example 2. All terms in equation (16) are positive, and the acceleration is not as helpful as in example 1, where the series was alternating.
For most integral equations, Picard iteration is not practical, and we approximate the solution of equation (1) using a semiimplicit product midpoint rule. To understand the method, we start by subdividing the interval of integration into equal subintervals using a step size . Equation (1) becomes
(17) 
note that . For , we only have the subinterval . Let (the middle variable of the kernel function ) be the midpoint of the interval ; that is, . Let (the third variable of ) be the midpoint of and ; that is, , and recall that . The denominator of the integral that contains the singularity stays as is, so
(18) 
We solve equation (17) for using Mathematica’s builtin function FindRoot with an initial guess of . The integral that contains the singularity is solved exactly and therefore does not introduce any inaccuracy in the method, which is why the word “product rule” is added to the name of this technique; see [8] for more details on product integration. Also, at each step only the very last variable (in this case ) needs to be found, which is why the name “semiimplicit” is used. We also use the midpoints of the intervals; hence the term “a semiimplicit product midpoint rule.” Now let to get
(19) 
Or
(20) 
Then equation (19) is solved to find an approximation for using FindRoot with an initial guess of from the previous step. Continuing the same procedure, we arrive at the following program.
Example 3
The nonlinear integral equation
(21) 
arises in the theory of superfluity [9].
We use the semiimplicit product midpoint rule program to approximate the solution of equation (1).
Graph of superfluity equation with no acceleration and a step size of 0.01 and .
Graph of superfluity equation with acceleration and a step size of 0.01.
This graph shows the difference between the accelerated and nonaccelerated methods for the superfluity equation. The difference is of order , which shows that iterating the extrapolation once more is not sensible. The execution time goes up exponentially, and not much is gained in accuracy.
For our next example, we were able to find the corresponding analytical solution. Therefore, we can demonstrate the efficiency of our method by comparing the numerical solution with the exact solution.
Before solving example 4, we prove the following theorem for a class of inhomogeneous equations to find its analytic solution.
Theorem 1
(22) 
(23) 
Proof
(24) 
(25) 
(26) 
(27) 
We now use Theorem 1 to find the exact solution of wave propagation for example 4 and test the efficiency of our midpoint rule and its extrapolated version.
Example 4
The equation
(28) 
represents wave propagation over a flat surface (see p. 229 and p. 235, exercise 3 of [2]). Using Theorem 1, the exact solution of equation (28) can be obtained as
(29) 
Example 4 is the threedimensional case. Therefore, the program for the general midpoint rule given for equation (1) must be adjusted a little.
This plots the error in applying the midpoint rule with a step size ; for and , with , the maximum absolute error is of order 0.001.
This plots the error in applying the Aitken extrapolation with a step size ; for and , , the maximum absolute error is of order .
We studied two methods: Picard’s iteration and a semiimplicit product midpoint rule. The Aitken extrapolation was used to accelerate the convergence of these methods. In some cases, the extrapolation demonstrated a significant improvement. A blowup phenomena model, a wave propagation, and a superfluity equation were solved, and the efficiency and the practicality of the methods were established. The userfriendly programs created here solve the general equations. One only needs to enter a forcing function, a kernel function, the value, and a step size for a particular problem. We used Mathematica 10.2 on the Mac OS X operating system with 16 GB RAM and a 2.8 GHz processor. The execution time was always under five minutes, and the vast majority of problems were executed in less than five seconds.
I am grateful for constructive suggestions by a reviewer, resulting in more transparent coding.
The author dedicates his work to Mahshid, Arman, and Ida; wife, son, and daughter.
[1]  W. Hackbusch, Integral Equations: Theory and Numerical Treatment, Boston: Birkhäuser Verlag, 1995. 
[2]  R. P. Kanwal, Linear Integral Equations, 2nd ed., Boston: Birkhäuser, 1997. 
[3]  R. K. Miller, Nonlinear Volterra Integral Equations, Menlo Park, CA: W. A. Benjamin, Inc., 1971. 
[4]  A. Pipkin, A Course on Integral Equations, New York: SpringerVerlag, 1991. 
[5]  R. Kress, Linear Integral Equations, Berlin: SpringerVerlag, 1989. 
[6]  J. Stoer and R. Bulirsch, Introduction to Numerical Analysis, 2nd ed., New York: Springer, 1996. 
[7]  H. Brunner, Collocation Methods for Volterra and Related Functional Equations, Cambridge: Cambridge University Press, 2004. 
[8]  P. Linz, Analytical and Numerical Methods for Volterra Equations, Philadelphia: SIAM, 1985. 
[9]  N. Levinson, “A Nonlinear Volterra Equation Arising in the Theory of Superfluidity,” Journal of Mathematical Analysis and Applications, 1(1), 1960 pp. 1–11. doi:10.1016/0022247X(60)900287. 
J. Abdalkhani, “Exact and Approximate Solutions of the Abel–Volterra Equations,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.182. 
Javad Abdalkhani is an associate professor of mathematics at the Ohio State University, Lima campus and a Distinguished Alumni teacher at the Ohio State University. His area of research is numerical analysis. His hobbies are reading and cycling.
Javad Abdalkhani
Department of Mathematics
The Ohio State University, Lima
4240 Campus Drive
Lima, Ohio 45804
abdalkhani.1@osu.edu
Biological systems possess traits that are a product of coordinated gene expression, heavily influenced by environmental pressures. This is true of both desirable and undesirable traits (i.e. disease). In cases of unwanted traits, it would be tremendously useful to have a means of systematically lowering the expression of genes implicated in producing the undesirable traits. This article describes the implementation of a computational biology algorithm designed to compute small interfering RNA sequences, capable of systematically silencing the expression of specific genes.
Small interfering RNAs (siRNAs) are short, singlestranded fragments of RNA, typically 20–30 nucleotides in length, that have created tremendous excitement in the biological sciences because of their ability to downregulate the expression of genes in a targeted, systematic manner. siRNAs mediate their knockdown effects by activating a gene expression surveillance mechanism widely conserved in eukaryotic cells, known as RNA interference (RNAi). Once activated by siRNA, the RNAi machinery of the cell suppresses the expression of a specific gene by blocking translation (the synthesis of protein by ribosomes) or by promoting the degradation of mRNA needed to synthesize a protein. Figure 1 highlights the basic interplay between siRNA and RNAi. Further excellent reviews of siRNA and RNAi can be found in [1–3]. Having the ability to purposefully engineer siRNAs to “turn off” (or “turn down”) the expression of genes promoting disease phenotypes like cancer, multiple sclerosis, diabetes, inflammation, etc. is of immediate and widespread interest to scientists, clinicians, and patients living with the burden of disease. Indeed, several clinical trials are currently evaluating the effectiveness of siRNA therapy for conditions such as transthretin amyloidosis [4], elevated intraocular pressure [5], and nonsmall cell lung cancer [6].
Figure 1. (A) A genetically modified retroviral vector is used to insert DNA instructions into the nucleus of a eukaryotic cell. (B) The cell uses the inserted DNA instructions to manufacture short hairpin RNA (shRNA). (C) Following enzymatic trimming in the nucleus, the shRNA is exported from the nucleus. (D) An enzyme named Dicer removes the loop from the shRNA hairpin, releasing a doublestranded version of siRNA. (E) A complex of proteins named RISC (short for RNAinduced silencing complex) enzymatically removes the passenger strand of the siRNA duplex. (F) The mature singlestranded siRNA binds to its target—a specific mRNA from a gene being expressed by the cell. (G) The presence of siRNA attached to the mRNA interferes with translation (protein production), effectively shortcircuiting the cell’s intention to fully express the information contained in its DNA.
This article describes a Mathematica implementation (named siRNA Silencer) of the seminal algorithm created by Naito et al. [7] to design siRNA sequences for the express purpose of silencing a specific gene. Like Naito’s algorithm, siRNA Silencer (SRS for short) uses “rules” of effective siRNA design that have been elucidated through meticulous experimentation [8]. To generate viable siRNA candidates, SRS progressively works through the four steps described in Table 1.
These 23mer sequences correspond to the complementary sequence of 21nucleotide guide strand and 2nucleotide 3′ overhang of the passenger strand within the target sequence (Figures 1 and 2).Step 2 (select functional siRNAs)
Rule C: Choose 23mers that have at least 4 A or U residues in the 5′ terminal 7 base pairs of the guide strand.Step 3 (reduce seeddependent offtarget effects)
Calculate the melting temperature (Tm) of the seedtarget duplex (i.e. nucleotide pairs 2–8 of the siRNA guide and passenger strands as bound to their target sequence) for the 23mers that survive Step 2. Accept only those sequences with a Tm less than 21.5 °C.
As an example of how the steps in Table 1 are applied, Figure 2 presents a detailed illustration of some of the output generated by SRS when it is tasked with silencing the human gene IFNβ1. (Interferon beta 1 is a powerful antiviral protein, manufactured from instructions in the IFNβ1
gene [9].)
Figure 2. One of 23 siRNA candidates generated by siRNA Silencer to silence IFNβ1 expression is presented in detail here. The middle strip of letters contains a portion of the mRNA sequence manufactured from instructions in the human IFNβ1 gene. The black box surrounds nucleotides 591 to 613 of the mRNA target, which is precisely 23 nucleotides in length, corresponding to Step 1 above. The sequence outlined in red is the guide strand of RNA that will eventually bind to the target mRNA and block protein synthesis (Figure 1). Notice how the guide sequence satisfies rules A, C, and D in Step 2 above. The passenger sequence (blue outline) satisfies rules B and D in Step 2. The predicted melting temperatures (details described below) of the guidetarget seed sequence (green bar) and the passengertarget seed sequence (purple bar) are 8.98 °C and 13.3 °C, respectively, satisfying Step 3 above. Step 4 output will be discussed below.
As the end user interprets the output of siRNA Silencer and ultimately makes a decision about which presented siRNA candidate to use, it becomes a (relatively) simple cloning project to engineer the guide and passenger sequences into an expression vector (Figure 1) to silence the gene in a living cell.
SRS is template driven, meaning the algorithm expects several pieces of userdefined information to be provided in a notebook cell that is used as a template for entering information. The features of SRS will be illustrated using the premRNA sequence of the human gene IFNβ1 (interferon beta 1) to design siRNA candidates capable of silencing IFNβ1.
The example above contains several variables that must be completed by the user to let SRS do its job. The reader may modify them to run the code.
The variables requiring user input are:
1. refseqlocation: This variable holds the directory location for finding preparsed RNA data files from specific organisms of interest. The preparsed RNA data files were generated by downloading transcript source files from NCBI’s Map Viewer FTP (ftp://ftp.ncbi.nih.gov/genomes/MapView). The data contained in these files is part of NCBI’s RefSeq database, which maintains a nonredundant catalog of all known biological molecules in a speciesspecific manner [10]. Original source files include:
Once downloaded, the source files were parsed by custom Mathematica code (not shown here) to generate expressions containing transcript accession values, annotation information, and sequence information for the genes contained in the source files. These expressions were saved and serve as the primary data source for the SRS program.
2. refdatabaseversion: This variable contains the name of the specific organism from which a gene is to be silenced. Options available include: "humantranscripts", "mousetranscripts", "rattranscripts", "dogtranscripts", and "cattranscripts". The option chosen must correspond to the name of a file present in defined above.
3. savelocationroot: This variable holds the location where the user would like the final results of the analysis to be saved.
4. studyname: This variable lets the user name the output generated by SRS. The output of SRS is saved using this name to the location provided in "saverootlocation" above.
5. query: This variable contains the transcript sequence, in string format, of the gene to be silenced.
SRS starts by loading the preparsed RNA data file from which the query gene sequence is to be silenced and then proceeds to create a preliminary list of siRNA candidates that are filtered through the four steps referenced in Table 1.
For the specific gene highlighted here (IFNβ1), successive pruning of the initial list of 818 23nucleotidelong sequences is highlighted by a decline in the length of variables containing the pruning results. The variables highlighted below contain the results of completing Step 1, Step 2 (Rules A and B), and Step 2 (Rules C and D).
A small portion of initial siRNA candidates is shown here, where each row contains a guide and passenger sequence (Figure 2) generated from IFNβ1 that may, if they pass additional requirements, prove useful in silencing the expression of IFNβ1.
The algorithm moves next to complete Step 3, in which the predicted melting temperatures of the guide and passenger seed sequences (Figure 2) are calculated and screened to accept only those sequences with predicted melting temperatures below 21.5 °C. Melting temperature is a reflection of thermodynamic stability, and for the purposes of gene silencing, experiments have suggested melting temperatures below 21.5 °C are optimal [11]. If no candidate siRNAs can meet that demand, the requirement is waived, but the algorithm will print a warning message that the requirement could not be met.
Inspection of the first five candidates that survived Step 3 screening reveals that none of them match the first five candidates that survived Step 2 screening. This means that none of the first five candidates from Step 2 screening had melting temperatures below 21.5 °C, causing SRS to remove them from further consideration.
Indeed, there is considerable reduction in the size of the candidate list from Step 2 to Step 3.
After the candidate list of potential siRNAs is pruned through Step 3, the algorithm presents a visual representation of the current list of siRNA candidates as they are mapped to their respective positions within the query sequence.
For publication purposes, the next output is reduced in size to allow the data to fit within printable margins.
Figure 3. A map of the siRNA candidates (shown in red) that have past the first three steps of Table 1, aligned to their positions within the target gene sequence (shown in black) to be silenced. The example here shows siRNA candidates suggested for silencing IFNβ1.
From here, SRS begins an exhaustive search for genes within <refdatabaseversion> that contain subsequences that match the siRNA candidates graphically presented above. In our current example, SRS searches the siRNA candidates generated from IFNβ1 that may also bind to alternative human genes with similar subsequences. The results of this search are presented for inspection so the user can decide which of the potential siRNA candidates is least likely to create undesirable effects if used to silence the gene of interest. Making this decision is largely based on human experience, and is an area that highlights the “art of doing science.”
Caution: Due to the sheer volume of computation that needs to be completed using the data described in this article, the next segment of code will likely take 20–40 minutes to complete (depending on the speed of your computer) and consume roughly 24 GB of RAM. Computations on machines with less RAM will finish, but will require significant use of the hard drive, slowing computation considerably.
For publication purposes, the next output is reduced in size to allow the data to fit within printable margins.
SRS then finishes by saving its results to the location specified by the user in the template described above. SRS will output a JPEG version of Figure 3, as well as Mathematica and Microsoft Excel versions of the data in Table 2. The Mathematica version of Table 2 is complete, including all the potential offtargets that were found. The Excel version of Table 2 lacks the data of the potential offtargets because there is no reasonable way to organize such a large volume of information within Excel.
Table 2 provides summary information about the siRNA candidates the algorithm is proposing as effector mediators of gene silencing. In the example of this article, the IFNβ1 gene generates 23 siRNA candidates that should be capable of effectively reducing the expression of the IFNβ1 gene; all 23 candidates have passed the four steps of Table 1. It is left to the end user to make the final decision about which siRNA sequence to use. While much is known about RNA interference and the use of siRNAs to downregulate the expression of a gene, there are still aspects that benefit from human experience and intuition, which is why Table 2 presents information that may be contextually important to the researcher.
The first column of Table 2 conveys the positions of the siRNA candidates in relation to the target gene sequence. This also corresponds to the same graphical information presented in Figure 3. These 23mer candidates represent subregions of the gene’s transcript that are likely to be suitable targets for the guide RNA (in combination with RISC, Figure 1) to actually silence a gene. While the position of the siRNA candidate may be important to effect gene silencing, there are no universally accepted rules to develop algorithmic prediction based on position alone, which is one reason the final decision about which siRNA candidate to use is best left to the end user. Maximal suppression of gene expression by the siRNA candidate list will likely require empirical comparison of several siRNA candidates.
The second column of Table 2 provides 23mer sequences (written 5′ to 3′) specific to the gene of interest to be silenced. The black box of Figure 2 highlights one example of a 23mer from within IFNβ1. Each 23mer represents a region of the mRNA transcript to which a guide sequence of RNA (i.e. the guide strand, Figure 1) can bind and mediate repression of protein synthesis.
The third column of Table 3 reports the 21nucleotide guide and passenger sequences (Figures 1 and 2) that are found originally in the small hairpin RNA (shRNA), itself manufactured from the vector referenced in Figure 1. DNA versions of these RNA sequences can be subsequently used to construct vectors for permanent or transient repression of a specific gene. As this sounds more complicated than it actually is, Figure 4 provides a graphical description of how to take the information from Table 2 and create a vector capable of using the information provided by SRS to silence a gene.
Figure 4. Basic workflow involved in using the output of SRS to silence expression of IFNβ1. Here, the siRNA candidate at position 591–613 of the target gene (referenced in Figure 2 and Table 2) is used to compute the passenger and guide strand RNA sequences that a cell would need to manufacture from instructions contained in a suitable expression vector (Figure 1). The sequences outlined in red refer to the passenger and guide strand RNAs, respectively, while the sequence outlined in green represents one of many possible loop structures that could be designed to ensure the expression vector contains the necessary information for a cell to manufacture short hairpin RNA (shRNA, Figure 1). Where indicated, “Column 2” and “Column 3” refer to the columns of Table 2 that contain information displayed here.
The fourth and fifth columns of Table 2 report the melting temperatures (Tm) for the seedtarget duplexes (Figure 2) of both the guide and passenger RNA sequences. The lower the Tm value, the more stable the seedtarget duplex is. A good rule of thumb for selecting siRNA candidates, which is computationally enforced by SRS, is to use candidates that have seedtarget melting temperatures below 21.5 °C. All things considered, the lower the Tm, the better.
The final two columns of Table 2 (columns 6 and 7) contain checkboxes that will, when selected, reveal the results of SRS’s attempt to find similar genes that may be inadvertently silenced by the siRNA candidates under consideration. Checkboxes are used because the amount of information to be displayed can be considerable. When a checkbox is selected, a popup window reveals the data that was discovered. Figure 5 shows a typical result displayed when selecting a checkbox in Table 2. SRS will discard any similar gene sequences with five or more differences, as the possibility of those sequences being repressed by the siRNA sequence is virtually zero.
Figure 5. Example output produced when the first checkbox contained in the third row of Table 2 is selected. The first column of output displays sequence alignments between the siRNA candidate chosen (in this case the 23mer sequence positioned between nucleotides 68–90 of IFNβ1) and other gene sequences contained within the RefSeq database used by SRS. Mismatch positions are highlighted in green. The second column contains annotation information about the genes that are aligned to the siRNA candidate. As it is possible that siRNA candidates with a limited number of mismatches to other gene sequences could inappropriately silence these other genes, SRS provides critical information to allow the end user to choose the siRNA candidate that best fits their needs.
To gauge the performance of SRS, two genes were chosen randomly from each of the five species the program can accommodate, and each of the genes was used as an input query to generate a list of siRNA candidates (Table 3). Timings came from running Mathematica 10 under Windows 7 (64 bit) using an Intel 2500K processor overclocked to 4.48 GHz. Total system memory is 32 GB. All reported timings use a fresh kernel.
As biologists continue to develop better methods to identify causal genes driving changes in phenotype, facile methods of altering the expression of those genes are necessary to interrupt undesirable phenotypes, most notably those of disease. SRS brings to the Mathematica community a contemporary algorithm enabling the end user to quickly design small interfering RNAs capable of suppressing a gene’s output, while simultaneously minimizing offtargeting effects that have challenged earlier attempts at using siRNA as a gene suppression technology.
[1]  E. M. Youngman and J. M. Claycomb, “From Early Lessons to New Frontiers: The Worm as a Treasure Trove of Small RNA Biology,” Frontiers in Genetics, Nov 27, 2014. doi:10.3389/fgene.2014.00416. 
[2]  D. Barbosa Dogini, V. D’Avila Bittencourt Pascoal, S. H. Avansini, A. Schwambach Vieira, T. Campos Pereira, and I. LopesCendes, “The New World of RNAs,” Genetics and Molecular Biology, 37(1 Suppl), 2014 pp. 285–293. www.ncbi.nlm.nih.gov/pmc/articles/PMC3983583. 
[3]  K. Gavrilov and W. M. Saltzman, “Therapeutic siRNA: Principles, Challenges, and Strategies,” Yale Journal of Biology and Medicine, 85(2), 2012 pp. 187–200. www.ncbi.nlm.nih.gov/pmc/articles/PMC3375670. 
[4]  T. Coelho, D. Adams, A. Silva, P. Lozeron, P. N. Hawkins, T. Mant, J. Perez, J. Chiesa, S. Warrington, E. Tranter, M. Munisamy, R. Falzone, J. Harrop, J. Cehelsky, B. R. Bettencourt, M. Geissler, J. S. Butler, A. Sehgal, R. E. Meyers, Q. Chen, T. Borland, R. M. Hutabarat, V. A. Clausen, R. Alvarez, K. Fitzgerald, C. GambaVitalo, S. V. Nochur, A. K. Vaishnaw, D. W. Y. Sah, J. A. Gollob, and O. B. Suhr, “Safety and Efficacy of RNAi Therapy for Transthyretin Amyloidosis,” New England Journal of Medicine, 369, 2013 pp. 819–829. doi:10.1056/NEJMoa1208760. 
[5]  J. MorenoMontañés, B. Sádaba, V. Ruz, A. GómezGuiu, J. Zarranz, M. V. González, C. Pañeda, and A. I. Jimenez, “Phase I Clinical Trial of SYL040012, a Small Interfering RNA Targeting Adrenergic Receptor 2, for Lowering Intraocular Pressure,” Molecular Therapy, 22(1), 2014 pp. 226–232. www.nature.com/mt/journal/v22/n1/full/mt2013217a.html. 
[6]  J. A. McCarroll, T. Dwarte, H. Baigude, J. Dang, L. Yang, R. B. Erlich, K. Kimpton, J. Teo, S. M. Sagnella, M. C. Akerfeldt, J. Liu, P. A. Phillips, T. M. Rana, and M. Kavallaris, “Therapeutic Targeting of PoloLike Kinase 1 Using RNAInterfering Nanoparticles (iNOPs) for the Treatment of NonSmall Cell Lung Cancer,” Oncotarget, 6(14), 2014 pp. 12020–12034. www.ncbi.nlm.nih.gov/pmc/articles/PMC4494920. 
[7]  Y. Naito, J. Yoshimura, S. Morishita, and K. UiTei, “siDirect 2.0: Updated Software for Designing Functional siRNA with Reduced SeedDependent OffTarget Effect,” BMC Bioinformatics, 10(392), 2009. doi:10.1186/1471210510392. 
[8]  K. UiTei, Y. Naito, F. Takahashi, T. Haraguchi, H. OhkiHamazaki, A. Juni, R. Ueda, and K. Saigo, “Guidelines for the Selection of Highly Effective siRNA Sequences for Mammalian and Chick RNA Interference,” Nucleic Acids Research, 32(3), 2004 pp. 936–948. doi:10.1093/nar/gkh247. 
[9]  J. Bekisz, H. Schmeisser, J. Hernandez, N. D Goldman, and K. C Zoon, “Human Interferons Alpha, Beta and Omega,” Growth Factors, 22(4), 2004 pp. 243–251. doi:10.1080/08977190400000833. 
[10]  K. D. Pruitt, T. Tatusova, G. R. Brown, and D. R. Maglott, “NCBI Reference Sequences (RefSeq): Current Status, New Features and Genome Annotation Policy,” Nucleic Acids Research, 40(D1), 2012 pp. D130–D135. doi:10.1093/nar/gkr1079. 
[11]  K. UiTei, Y. Naito, K. Nishi, A. Juni, and K. Saigo, “Thermodynamic Stability and Watson–Crick Base Pairing in the Seed Duplex Are Major Determinants of the Efficiency of the siRNABased OffTarget Effect,” Nucleic Acids Research, 36(22), 2008 pp. 7100–7109. doi:10.1093/nar/gkn902. 
T. D. Allen, “A Computational Strategy for Effective Gene Silencing through siRNAs,” The Mathematica Journal, 2016. dx.doi.org/doi:10.3888/tmj.181. 
Todd Allen is an associate professor of biology at HACC, Lancaster. His interest in computational biology using Mathematica took shape during his postdoctoral research years at the University of Maryland, where he developed a custom cDNA microarray chip to study gene expression changes in the chestnut blight pathogen, Cryphonectria parasitica.
Todd D. Allen, Ph.D.
Harrisburg Area Community College (Lancaster Campus)
East 206R
1641 Old Philadelphia Pike
Lancaster, PA 17602
tdallen@hacc.edu