The Mathematica Journal
Departments
Feature Articles
Columns
New Products
New Publications
Classifieds
Calendar
News Bulletins
Mailbox
Letters
Write Us
About the Journal
Staff and Contributors
Submissions
Subscriptions
Advertising
Back Issues
Home
Download this Issue


How Do I Write Code that Takes Advantage of Packed Arrays?

To take advantage of packed arrays, the most important thing to do is to make sure that you do operations on lists of elements all at once or in groups whenever possible. The good news is that this programming principle has not really changed from Mathematica 1.0. Efficient code in Mathematica has always used functional and list-based constructs. A few of particular importance to packed arrays are listed below.

Try to avoid using individual parts of lists.

Sometimes comparison of timings of commands with and without packed arrays is difficult because the times the commands take are orders of magnitude different. And often the actual timing for the commands that take less time are not precise because of system timing granularity. To work around this problem, here is a command which automatically repeats a given command until it has enough repetitions to assure a decent timing value.

[Graphics:../Images/index_gr_37.gif]

The following evaluations show that accessing parts is at least ten times slower than just using the listability of Sin.

[Graphics:../Images/index_gr_38.gif]
[Graphics:../Images/index_gr_39.gif]
[Graphics:../Images/index_gr_40.gif]
[Graphics:../Images/index_gr_41.gif]

But that is not the whole story. The Table command is automatically compiled, so that is much faster than it might be otherwise. If you put in an extra layer of definition, which is enough to prevent the automatic compilation from being used (due to the possibility of side effects on f which the compiler cannot detect), the timing is much slower.

[Graphics:../Images/index_gr_42.gif]
[Graphics:../Images/index_gr_43.gif]

Something important to keep in mind is that f will work on the packed array directly.

[Graphics:../Images/index_gr_44.gif]
[Graphics:../Images/index_gr_45.gif]
Use functional operations that operate on large chunks of your data.

This is really the same as the one above, but it is so important, it is worth listing twice! The functional commands Apply, Array, Fold, Map, and Nest can be automatically compiled for packed array input under certain circumstances. Inserting Map into the previous example is faster than the corresponding command with Table and Part, but it is much slower than it can be.

[Graphics:../Images/index_gr_46.gif]
[Graphics:../Images/index_gr_47.gif]

The reason, again, is that the auto-compile mechanism does not work when the function is not given explicitly. When the function is made explicit, as in the next example, the computation is much faster.

[Graphics:../Images/index_gr_48.gif]
[Graphics:../Images/index_gr_49.gif]

This is still not nearly as fast as simply using the listability property of Sin, which is always the best bet when it can be used.

Try to avoid copying.

This can be quite subtle, but can be very important. The example of LU factorization shown at the end of this article demonstrates the importance of this quite well. Here is a simple example to illustrate the point.

Suppose that we have a 100×100 matrix.

[Graphics:../Images/index_gr_50.gif]

And some parts (positions) we want to set to zero.

[Graphics:../Images/index_gr_51.gif]
[Graphics:../Images/index_gr_52.gif]

Here is one way to do it.

[Graphics:../Images/index_gr_53.gif]
[Graphics:../Images/index_gr_54.gif]

ReplacePart makes a new copy of the data with the chosen parts modified. That means to change 81 parts, you have to copy 10000 pieces of data.

Using Part on the left-hand side of an assignment allows you to avoid the copying, and consequently is much faster.

[Graphics:../Images/index_gr_55.gif]
[Graphics:../Images/index_gr_56.gif]

Be warned, however, that if you need the original data for some reason, it has been destroyed by this method unless you have kept a copy attached to another symbol.

Think carefully about how many things will have to be evaluated.

With packed arrays speeding up many numerical calculations, it becomes more important to think about how things are evaluated. Computing the sum of the first n numbers is a simple example which illustrates this.

Let n be a number large enough so that you can see the timing differences easily.

[Graphics:../Images/index_gr_57.gif]

Using For, each time through the loop requires at least four separate evaluations.

[Graphics:../Images/index_gr_58.gif]
[Graphics:../Images/index_gr_59.gif]

With Do, only one evaluation is required each time through the loop, since the incrementing and checking of the variable limits are handled separately.

[Graphics:../Images/index_gr_60.gif]
[Graphics:../Images/index_gr_61.gif]

The difference in timings shows the effect of the extra evaluation.

Of course, a better way to do this is to use functional constructs.

[Graphics:../Images/index_gr_62.gif]
[Graphics:../Images/index_gr_63.gif]

Or faster yet.

[Graphics:../Images/index_gr_64.gif]
[Graphics:../Images/index_gr_65.gif]

Or even faster yet.

[Graphics:../Images/index_gr_66.gif]
[Graphics:../Images/index_gr_67.gif]
When you cannot possibly avoid doing one of the first two, try to use the compiler.

Sometimes it is not possible to avoid accessing individual parts of an expression such as instances where the result being computed depends on previous computed values. Here is an example, which shows how asymptotically faster multiplication works at the same time.

[Graphics:../Images/index_gr_68.gif]

To simplify the discussion, I will use radix 10, although the code written below will work for any positive integer radix.

[Graphics:../Images/index_gr_69.gif]
[Graphics:../Images/index_gr_70.gif]
[Graphics:../Images/index_gr_71.gif]
[Graphics:../Images/index_gr_72.gif]

Here is a way to multiply the two numbers by using the discrete convolution theorem on the list of digits.

[Graphics:../Images/index_gr_73.gif]
[Graphics:../Images/index_gr_74.gif]

But there is a problem in that you need to deal with the carries.

[Graphics:../Images/index_gr_75.gif]
[Graphics:../Images/index_gr_76.gif]
[Graphics:../Images/index_gr_77.gif]

Now check the result.

[Graphics:../Images/index_gr_78.gif]
[Graphics:../Images/index_gr_79.gif]

Mathematica 4 provides a more convenient (and generally faster way) to do this via the discrete convolution command ListConvolve.

[Graphics:../Images/index_gr_80.gif]
[Graphics:../Images/index_gr_81.gif]

Using this, you can make a function to do the job.

[Graphics:../Images/index_gr_82.gif]
[Graphics:../Images/index_gr_83.gif]
[Graphics:../Images/index_gr_84.gif]

Or with another radix.

[Graphics:../Images/index_gr_85.gif]
[Graphics:../Images/index_gr_86.gif]
[Graphics:../Images/index_gr_87.gif]
[Graphics:../Images/index_gr_88.gif]

Now let us try this with larger numbers.

[Graphics:../Images/index_gr_89.gif]
[Graphics:../Images/index_gr_90.gif]
[Graphics:../Images/index_gr_91.gif]

IntegerDigits takes a noticeable amount of time for numbers this large because numbers in Mathematica are internally stored in binary and conversion to base 10 requires many operations.

[Graphics:../Images/index_gr_92.gif]
[Graphics:../Images/index_gr_93.gif]

This seems disappointing especially since this is supposed to be an asymptotically fast algorithm.

To see where most of the time is being spent, consider the convolution phase separately.

[Graphics:../Images/index_gr_94.gif]
[Graphics:../Images/index_gr_95.gif]

So this means that almost all of the time is being spent propagating the carries which (from a computational complexity point of view) is the simplest part of the computation.

Here is an alternate way to do the carries which avoids accessing individual parts by using functional programming.

[Graphics:../Images/index_gr_96.gif]

Not surprisingly, it is quite a bit faster.

[Graphics:../Images/index_gr_97.gif]
[Graphics:../Images/index_gr_98.gif]

However, because FixedPoint may need to do multiple iterations, this is doing more operations than necessary. For a sufficiently small radix, you can use the Mathematica compiler.

[Graphics:../Images/index_gr_99.gif]
[Graphics:../Images/index_gr_100.gif]
[Graphics:../Images/index_gr_101.gif]

Since this is so much faster, you can now redefine the command using the compiled version.

[Graphics:../Images/index_gr_102.gif]

Even with this work, it is still slower than the internal Mathematica Times, which is using basically the same algorithm.

[Graphics:../Images/index_gr_103.gif]
[Graphics:../Images/index_gr_104.gif]
[Graphics:../Images/index_gr_105.gif]
[Graphics:../Images/index_gr_106.gif]

In fact, even if you change Mathematica's settings so that multiplication does not use this advanced algorithm, internal multiplication is still faster.

[Graphics:../Images/index_gr_107.gif]
[Graphics:../Images/index_gr_108.gif]

The reason is not that the algorithm shown here is ineffective, but that its speed is sensitive to choice of radix. While radix 10 is nice for seeing input and output in a natural form, it is not very effective for working with numbers on a computer which uses a binary core representation for numbers.

If we use a larger radix, the length of the fast Fourier transform required goes down, and since this is the key contributor to the asymptotic complexity, the overall speed increases. For example, with radix 256:

[Graphics:../Images/index_gr_109.gif]
[Graphics:../Images/index_gr_110.gif]

Note that IntegerDigits is much faster because no base conversion is required for a power of 2 base.

[Graphics:../Images/index_gr_111.gif]
[Graphics:../Images/index_gr_112.gif]

In principle, it is possible to use a larger radix, but with this code there are two problems you may run into in practice. First, Mathematica's algorithm in ListConvolve for integers does use a fast Fourier transform, but it makes very conservative estimates as to how much precision is needed to be absolutely sure to avoid any roundoff error. If your radix is too large, ListConvolve will use higher precision, which will take much longer. Second, because of the size of the radix and the lengths of the numbers, the carries will be larger than machine integers, so CompiledPropagateCarries will run into a machine instruction.

There is a way to get around these problems, but it involves using ListConvolve with machine real input, which requires that you need to be concerned about roundoff error which may have occurred in the fast Fourier transform. Fortunately Mathematica's internal multiplication algorithm does all this for you.

Some helpful messages and tools.

The fact that packed arrays are integrated seamlessly into the Mathematica system sometimes makes it difficult to know when inefficiencies arise from unnecessary unpacking. We have included some print forms and messages to help with this kind of program debugging.

Developer`PackedPrintForm does not affect results, but changes the way that packed array expressions are printed, allowing you to see when unpacking has occurred.

[Graphics:../Images/index_gr_113.gif]
[Graphics:../Images/index_gr_114.gif]

To see when unpacking occurs without having to look at the output, there are messages which are by default off, but which can be turned on using a Mathematica system option.

[Graphics:../Images/index_gr_115.gif]

This message helps to explain why using Tr was so fast in the addition example above.

[Graphics:../Images/index_gr_116.gif]
[Graphics:../Images/index_gr_117.gif]
[Graphics:../Images/index_gr_118.gif]
[Graphics:../Images/index_gr_119.gif]
[Graphics:../Images/index_gr_120.gif]
[Graphics:../Images/index_gr_121.gif]


Converted by Mathematica      May 1, 2000

[Article Index] [Prev Page][Next Page]