IBM InfoSphere DataStage



IBM InfoSphere DataStage is an ETL tool and part of the IBM Information Platforms Solutions suite and IBM InfoSphere. It uses a graphical notation to construct data integration solutions and is available in various versions such as the Server Edition and the Enterprise Edition.

A data extraction and transformation program for Windows NT/2000 servers that is used to pull data from legacy databases, flat files and relational databases and convert them into data marts and data warehouses. Formerly a product from Ascential Software Corporation, which IBM acquired in 2005, DataStage became a core component of the IBM WebSphere Data Integration suite.

DataStage originated at VMark[1], a spin off from Prime Computers that developed two notable products: UniVerse database and the DataStage ETL tool.


The first VMark ETL prototype was built by Lee Scheffler in the first half of 1996[1].

Peter Weyman was VMark VP of Strategy and identified the ETL market as an opportunity. He appointed Lee Scheffler as the architect and conceived the product brand name "Stage" to signify modularity and component-orientation[2].

This tag was used to name DataStage and subsequently used in related products QualityStage, ProfileStage, MetaStage and AuditStage.

Lee Scheffler presented the DataStage product overview to the board of VMark in June 1996 and it was approved for development.

The product was in alpha testing in October, beta testing in November and was generally available in January 1997.

VMark acquired UniData in October 1997 and renamed itself to Ardent Software[3]. In 1999 Ardent Software was acquired by Informix[4] the database software vendor.

In April 2001 IBM acquired Informix and took just the database business leaving the data integration tools to be spun off as an independent software company called Ascential Software[5].

In November 2001, Ascential Software Corp. of Westboro, Mass. acquired privately held Torrent Systems Inc. of Cambridge, Mass. for $46 million in cash.

Ascential announced a commitment to integrate Orchestrate's parallel processing capabilities directly into the DataStageXE platform. [6].

In March 2005 IBM acquired Ascential Software[7] and made DataStage part of the WebSphere family as WebSphere DataStage.

In 2006 the product was released as part of the IBM Information Server under the Information Management family but was still known as WebSphere DataStage.

In 2008 the suite was renamed to InfoSphere Information Server and the product was renamed to InfoSphere DataStage[8].

•Enterprise Edition: a name give to the version of DataStage that had a parallel processing architecture and parallel ETL jobs.

•Server Edition: the name of the original version of DataStage representing Server Jobs. Early DataStage versions only contained Server Jobs. DataStage 5 added Sequence Jobs and DataStage 6 added Parallel Jobs via Enterprise Edition.

•MVS Edition: mainframe jobs, developed on a Windows or Unix/Linux platform and transferred to the mainframe as compiled mainframe jobs.

•DataStage for PeopleSoft: a server edition with prebuilt PeopleSoft EPM jobs under an OEM arragement with PeopleSoft and Oracle Corporation.

•DataStage TX: for processing complex transactions and messages, formerly known as Mercator.

•DataStage SOA: Real Time Integration pack can turn server or parallel jobs into SOA services.




Friday, June 13, 2008

Processing Stages

Change Capture Stage

The Change Capture Stage is a processing stage. The stage compares two data sets and makes a record of the differences. An example before and after data set are given in Parallel Job Developer's Guide. Follow this link for a list of steps you must take when deploying a Change Capture stage in your job.

The Change Capture stage takes two input data sets, denoted before and after, and outputs a single data set whose records represent the changes made to the before data set to obtain the after data set. The stage produces a change data set, whose table definition is transferred from the after data set’s table definition with the addition of one column: a change code with values encoding the four actions: insert, delete, copy, and edit. The preserve-partitioning flag is set on the change data set.

The compare is based on a set of key columns, rows from the two data sets are assumed to be copies of one another if they have the same values in these key columns. You can also optionally specify change values. If two rows have identical key columns, you can compare the value columns to see if one is an edited copy of the other.

The stage assumes that the incoming data is hash-partitioned and sorted in ascending order (this is done automatically if (auto) is selected on the partitioning tab). The columns the data is hashed on should be the key columns used for the data compare. You can achieve the sorting and partitioning using the Sort stage or by using the built in sorting and partitioning abilities of the Change Capture stage.

You can use the companion Change Apply stage to combine the changes from the Change Capture stage with the original before data set to reproduce the after data set.
The Change Capture stage is very similar to the Difference stage.

The stage editor has three pages:


Stage page. This is always present and is used to specify general information about the stage.

Inputs page. This is where you specify details about the data set having its duplicates removed.

Outputs page. This is where you specify details about the processed data being output from the stage.

The General tab allows you to specify an optional description of the stage.

The Properties tab lets you specify what the stage does. The Advanced tab allows you to specify how the stage executes. The Link Ordering tab allows you to specify which input link carries the before data set and which the after data set.


0 comments: