2.4.4 Sequence-related functions

Sequence and string

esProc offers s.split() function and A.concat() function to switch between the sequence and the string. s.split(d) function splits string s to form a sequence with the dilimiter d, and identifies data types automatically if working with @p option. If d is omitted, split the string by characters. A.concat(d) function generates a string from sequence A with the dilimiter d, and it data types automatically. If d is omitted, use no dilimiter. Both funcitons can work with @c option to use the comma as the dilimiter. For example:

 

A

1

a,1,c,2011-8-11,false

2

=A1.split@c()

3

=A2.concat@c()

Results of A2 and A3 are as follows:

 

The resulting sequence of A2’s string splitting has string members. To automatically parse the data types, use @p option in the function.

The s.regex(rs) function matches string s with the regular expression rs and returns a sequence-type matching result, or null, if none of the characters in string s can be matched. The use of @c option in the function means case-insensitive, and an @u option makes the function perform matching according to Unicode.

The simplest use of the regex() function is to match the given regular expression with a certain string and get the result. The following example shows the way of matching up the numeric string:

 

A

B

C

1

="a12b".regex("(a[0-9])")

="a12b".regex("(a[0-9]*)")

="a12b".regex("^[0-9]b")

2

="a12b".regex("\\S*([0-9][a-z])")

'\S*([0-9][a-z])

="a12b".regex(B2)

Below are results of A1, B1 and C1:

  

For the regular expression "(a[0-9])" A1 uses, a matches a letter a, [0-9] matches any single-bit number from 0 to 9, and the parentheses () indicates finding a string “beginning with letter a that is followed by a single-bit number” and returning it. As is shown above, the result is a sequence with a single member a1. In B1’s regular expression, the asterisk * means matching the type of character preceding it, which is the number defined by [0-9] here, consecutively. So B1 returns a result of a12. In C1, symbol ^ heads a string and the whole regular expression needs to return a string beginning with a number followed by letter b, but the string a12b isn’t one “headed by a number”; so, the matching fails and C1 returns null.

In A2’s regular expression "\\S*([0-9][a-z])", [a-z] represents a character, i.e. any lower-case letter, from a to z; and \S* stands for any number of non-whitespace characters appearing in a row. Since the slash \ also means the escape character in a string, it needs to be escaped to be a literal – that is the \\S*. The parentheses specify that the returned substring should be the match of [0-9][a-z] while the match of \S* will be discarded. In B2, the regular expression is written as a string constant which doesn’t need to be escaped. C2 thus gets the same result as A2 does. Here’re results of A2 and C2:

 

esProc uses the parentheses () to define the substring to be returned when matching a string with the regular expression. The returned result is a sequence consisting of members defined by the parentheses. Without the parentheses, the substring itself will be returned if the matching is successful.

With the @c option, regex() function becomes case-insensitive when matching a string with the regular expression. For example:

 

A

B

C

1

="a12b".regex@c("(A[0-9])")

="a12b".regex@c("([A-Z][0-9])")

="a12b".regex("([A-Z][0-9])")

Here’re results of A1, B1 and C1:

  

A1’s regular expression includes the uppercase letter A, and B1’s matches any uppercase letter. Both return a result of a1 because of the use of @c option. C1 returns null because the function works alone without the @c option and can’t find a match for the regular expression.

In short, the regex() function returns a sequence consisting of the matching result when the regular expression rs finds its match in string s; and null when it can’t find a match.

esProc allows using the universal unicode symbols to display non-English characters in the regular expression so that the character set setting can’t intervene. To do that, remember adding @u option to the regex() function. For example:

 

A

B

1

="Gerente de Fábrica".regex(".* (.*á.*)")

="Gerente de Fábrica".regex@u(".* (.*\\u00e1.*)")

The regular expression .* (.*á.*) in A1 includes a dot (.) and .*. The dot represents any single character except carriage return and new line; .* represents any number of random characters. A1’s regular expression finds the last word substring containing the character á. So does B1’s. But the regex() function in B1 adds @u option to display the character á as \u00e1 according to Unicode. Usually the syntax of regex@u() is used to parse a string on the outside, keeping the regular expression from being affected by the character set setting. Both A1 and B1 get the same result:

 

Aggregate functions

Aggregate functions used to manipulate sequences include A.sum(),A.avg(), A.max(), A.min() and A.variance() etc., respectively for computing sum, average, maximum value, minimum value and variance. They have similar uses. Here’s an example:

 

A

1

[2,4,6]

2

=A1.sum()

3

=A1.sum(~*~)

A2 computes the sum of the sequence’s members. A3 computes the sum of the members’ squares. Results of A2 and A3 are as follows:

 

If members of a sequence are boolean conditions, we can use A.cand() and A.cor() functions to judge whether all conditions are true or at least one of them is true. For example:

 

A

B

C

D

1

=[1,2,3,4].cand(24%~==0)

=[12,-3,0].cor(24%~==0)

[]

 

2

for 500

=A2%3==2

=A2%5==3

=A2%7==2

3

 

if [B2:D2].cand()

>C1=C1|A2

 

A1 judges whether all members in sequence [1,2,3,4] are divisors of 24. B1 checks whether sequence [12,-3,0] contains any divisor of 24. Line 2 and line 3 calculates a famous mathematical problem in ancient China and store the results within 500 in C1. The problem is known as Chinese Remainder Theorem, which asks: There are certain things whose number is unknown; if we count them by threes, we have two left over; by fives, we have three left over; and by sevens, two are left over; how many things are there? Here are results of A1, B1 and C1.

  

For a sequence having duplicate members, two aggregate functions A.count() and A.icount() can be used to perform two types of count. For example:

 

A

1

[2,3,3,2,5,7,1]

2

=A1.count()

3

=A1.icount()

A2 counts the numbers of members in the given sequence, while A3 counts the number of its distinct members. Here are the results of A2 and A3:

 

We can use options in the function to increase computing efficiency. When all members of a sequence are integers, we add @n and the corresponding function is A.icount@n(); and when all members of a sequence are integers or long integers, we add @b and the corresponding functions is A.icount@b().

Some aggregate functions are order-related, like A.rank(y) function for data sorting and A.median(k:n) function for getting the median value.

 

A

B

C

1

[6,8,1,3,7,2,4,9,5]

 

 

2

=A1.rank(8)

=A1.rank@z(8)

 

3

=A1.median()

=A1.median(1:4)

=A1.median(:3)

A2 finds the rank of 8 in A1 sequence in an ascending order. B2 find the member’s rank in a reversed order. Here are results of A2 and B2:

 

A3 finds the median in A1’s sequence. A median is the member with a value lying at the midpoint of a sequence. If the number of members in a sequence is even, the median is the average of the two members at the midpoint. Using certain parameters, the median() function can get the member at the segmenting point specified by the parameters in an ascending order. If the segmenting point falls between two members, the function will return the average of the two members. B3 gets the member at the 1/4 length of the sequence. In A.median(k:n) function, the number of segments n must be an integer not less than 2 and k must be an integer not greater than n. When parameter k is absent, the function returns values at the points where the sequence is divided evenly into n segments. Here are results of A3, B3 and C3:

  

There are some functions for performing aggregation over multiple sequences. For example:

 

A

B

1

[[1,2,3],[3],[3,4],[6,5,3]]

 

2

=A1.conj()

=A1.union()

3

=A1.diff()

=A1.isect()

A given sequence over which A.conj(), A.union(), A.diff(), A.isect() functions are performing operations needs to be a sequences whose members are also sequences. A2, B2, A3 and B3 compute the concatenation, union, difference and intersection of A1’s member sequences. Their results are as follows:

 

 

Loop functions

Loop functions perform same computation on each member of a sequence. The complex loop statements can be replaced by the simple loop functions which cover loop operation, filter, locate, query, rank and sort, etc. For example:

 

A

B

C

1

[2,4,-6]

=A1.(~+1)

 

2

=A1.select(~>1)

=A1.pselect@a(~>1)

=A1.pos([-6,2])

3

=A1.ranks@z()

=A1.sort()

=A1.-sort(~)

B1 adds 1 to each member of the sequence. The resulting sequence is as follows:

A2 selects members that are greater than 1. B2 finds ordinal numbers of members that are greater than 1. C2 searches for ordinal numbers of member -6 and member 2 in sequence A1. The results of A2, B2 and C2 are as follows:

   

A3 uses @z option to get rankings of members in the sequence in a descending order. B3 sorts members of the sequence in ascending order, and C3 does it in descending order. The results of A3, B3 and C3 are as follows:

   

By the way, we can specify the starting and ending positions for the search in certain location functions. For example:

 

A

B

1

[2,4,-6,null,4,3,]

 

2

=A1.pselect@a(~>1, 5)

=A1.pos(4,3)

A2 searches for positions of members greater than 1 starting from the 5th member. B2 finds position of member 4 starting from the 3rd member. Here are A2 and B2’s results:

 

Besides @1 option and @z option, @0 is another option that often works with location functions. For example:

 

A

B

C

1

[2,4,-6,null,4,3,]

 

 

2

=A1.pselect(~>10)

=A1.pselect@0(~>10)

=A1.pos(5)

3

=A1.pos@0(5)

=A1.pmin()

=A1.pmin@0()

After execution, A2, B2, C2, A3, B3 and C3 get their result as follows:

   

   

By working with @0 option, both A.pselect() function and A.pos() function returns 0 instead of null when they cannot find a corresponding member, and A.pmin@0() returns position of the first null value when there is one in the sequence. The @0 option can work with the some other location or select functions, including A.pfind(), A.ptop () and A.top().

In addition to @0 option, A.pselect() function and A.pos() function can also work with @n option to return “the length of the sequence plus 1” when a corresponding member cannot be found. For example:

 

A

B

1

[2,4,-6,null,4,3,]

 

2

=A1.pselect@n(~>10)

=A1.pos@n(5)

Both A2 and B2 get same result:

 

Note that @0 option and @n option are mutually-exclusive. They cannot work together.