PREVIOUS
Aggregator Stage
The Aggregator stage is a processing stage. It classifies data rows from a single input link into groups and computes totals or other aggregate functions for each group. The summed totals for each group are output from the stage via an output link. Follow this link for a list of steps you must take when deploying an Aggregator stage in your job.
The stage editor has three pages:
Stage page. This is always present and is used to specify general information about the stage.
Inputs page. This is where you specify details about the data being grouped and/or aggregated.
Outputs page. This is where you specify details about the groups being output from the stage.
The aggregator stage gives you access to grouping and summary operations. One of the easiest ways to expose patterns in a collection of records is to group records with similar characteristics, then compute statistics on all records in the group. You can then use these statistics to compare properties of the different groups. For example, records containing cash register transactions might be grouped by the day of the week to see which day had the largest number of transactions, the largest amount of revenue, etc.
Records can be grouped by one or more characteristics, where record characteristics correspond to column values. In other words, a group is a set of records with the same value for one or more columns. For example, transaction records might be grouped by both day of the week and by month. These groupings might show that the busiest day of the week varies by season.
In addition to revealing patterns in your data, grouping can also reduce the volume of data by summarizing the records in each group, making it easier to manage. If you group a large volume of data on the basis of one or more characteristics of the data, the resulting data set is generally much smaller than the original and is therefore easier to analyze using standard workstation or PC-based tools.
At a practical level, you should be aware that, in a parallel environment, the way that you partition data before grouping and summarizing it can affect the results. For example, if you partitioned using the round robin method records with identical values in the column you are grouping on would end up in different partitions. If you then performed a sum operation within these partitions you would not be operating on all the relevant columns. In such circumstances you may want the hash partition the data on the on one or more of the grouping keys to ensure that your groups are entire.
It is important that you bear these facts in mind and take any steps you need to prepare your data set before presenting it to the aggregator stage. In practice this could mean you use Sort stages or additional Aggregate stages in the job.The Properties tab allows you to specify properties which determine what the stage actually does.
0 comments:
Post a Comment