Method of External Memory Sorting

Read(2803) Label: external memory sorting,

It is common to sort records in a table during data analysis and computing. In esProc, sort function is used to sort data in a sequence or a table sequence. But external memory sorting is required when data being sorted is huge, cannot be loaded into memory all together and thus the ordinary sorting method becomes incapable.

7.6.1 External memory sorting on big data

Cursors are usually used in statistical data analysis to fetch massive data. This applies in esProc, which also processes big data with the cursor. Similar to a cursor inside a database stored procedure, an esProc cursor reads one or more records each time according to its position(s) instead of returning data all at once.

A cursor can only fetch part of the whole data every time, thus you can’t directly perform operations like sorting and grouping on all data. esProc uses external memory to solve this big data problem. Data will be retrieved and processed in batches, and the result of handling each of the batches of data will be temporarily stored in the hard disk. Then these intermediate results will be merged and a cursor will be returned to generate the final result.

Here’s an example. Let’s create a big data table in which the dates and 8-digit phone numbers (the first and the last digit should not be zero) are generated arbitrarily and store it as a bin file for convenience.

 

A

B

C

1

100000

1000

=A1/B1

2

=file("PhoneRecord.btx")

=create(ID,Date,PhoneNum)

 

3

=date(now())

0

 

4

for C1

for B1

>B3=B3+1

5

 

 

=elapse(A3,-rand(1000))

6

 

 

=rand(90000000)+10000000

7

 

 

>B2.insert(0,B3,C5,C6)

8

 

>A2.export@ab(B2)

 

9

 

>B2.reset()

 

10

=A2.cursor@b()

>A10.skip(50000)

=A10.fetch(1000)

11

>A10.close()

 

 

Altogether 100,000 rows of data are generated. C10 fetches data from the 50,001th row to the 51,000th row through the cursor:

Let’s use the generated data file, PhoneRecord, to explore how to perform external memory sorting in esProc.The function for doing this is cs.sortx(x…), which sorts data in cursor cs in ascending order according to the result of evaluating expression x…For example:

 

A

1

=file("PhoneRecord.btx")

2

=A1.cursor@b()

3

=A2.sortx(PhoneNum;20000)

4

=A3.fetch(1000)

5

>A3.close()

To learn in detail how esProc sorts data using the external memory, click  in the debug area on the toolbar to execute the code step by step until A5. A2 creates a cursor based on the bin file PhoneRecord.btx. A3 sorts cursor data using sortx function. The parameter 20000 sets the number of buffer rows as 20000. It isn’t necessary in acutual coding because esProc will automatically perform buffering according to the available memory capability. The final sorting result will be a grand cursor generated by merging multiple temporary cursor files in a certain order. Here’s A3’s result:

A4 fetches the first 1,000 records from this resulting cursor:

While A3 is executed, external files are generated in the directory of temporary files:

As more buffer rows are set for the testing execution of sortx function, five temporary files are generated for the 100,000 records in the cursor. Now import data from one of them:

 

A

1

=file("/temp/tmpdata984048609330992507")

2

=A1.import@b()

3

=A2.count()

A2 imports data as follows:

A3 counts the rows in this temporary file:

By comparing the data in A2 with the final sorting result previously obtained, you can see that the former is, actually, the result of sorting a part of the data. This indicates that each temporary file is the result of sorting a part of the data.

Then go on to execute the previous cellset file. You can find that the temporary files will be deleted automatically when the cursor is closed. The sortx function can also sort data by multiple fields. For example:

 

A

1

=file("PhoneRecord.btx")

2

=A1.cursor@b()

3

=A2.sortx(Date,-PhoneNum)

4

=A3.fetch(1000)

5

>A3.close()

A3’s sorting expressions Date and -PhoneNum mean that the function first sorts data by date in ascending order, and then those of the same date by phone number in descending order. It is common to precede an expression for sorting data in descending order by a negative sign. A4 fetches the first 1,000 records from the sorting result:

7.6.2 Applications of external memory sorting

In fact, the process of external memory sorting suggests one of the uses of the cursor-based sorting, which is merging the sorted data in a certain order. The operation is performed based on multiple cursors and computes the sorting expression. Data will be fetched from the cursor that currently has the smallest (or biggest) result. Apparently, this method can only be used when the data of every cursor is ordered in the same way. In addition, cursor-based alignment joins with joinx() function also require that data in each of multiple cursors be sorted. About how to use joinx() function, refer to Merge and Join Operations on Cursors.

When the data of a cursor is properly ordered, it can be specified that data is fetched continuously from the cursor until the expression is changed. For example:

 

A

1

=file("PhoneRecord.btx")

2

=A1.cursor@b()

3

=A2.sortx(Date,-PhoneNum;20000)

4

=A3.fetch(;Date)

5

>3.(A3.skip(;Date))

6

=A3.fetch(;Date)

7

>A3.close()

A4 fetches data from the cursor in A3 until the Date is changed, meaning the data of the first day will be fetched. A5 skips data of consecutive three days. A6 fetches the data of the fifth day. Here’re the results of A4 and A6: