Guideline | |
Design jobs for restartability/ if not designed then what is the reason ? | |
Do not follow Sequential File stage with "Same" partitioning. | |
check if the APT_CONFIG_FILE parameter is added. This is required to change the number of nodes during runtime. | |
Do not hard-code parameters. |
|
Do not hard-code directory paths. | |
Do not use fork-joins to generate lookup data sets. | |
Use "Hash" aggregation for limited distinct key values. Outputs after all rows are read. | |
Use "Sort" aggregation for large number of distinct key values. Data must be pre-sorted. Outputs after each aggregation group. | |
Use multiple aggregators to reduce collection time when aggregating all rows. Define a constant key column using row generator. First aggregator sums in parallel. Second aggregator sums sequentially. | |
Make sure sequences are not too long. Break up into logical units of work. | |
Is the error handling done properly? It is prefered to propogate errors from lower jobs to the highest level( ex a sequence) | |
What is the volume of extract data( is there a where clause in the SQL) | |
Are the correct scripts to clean up datasets after job complete revoked ? | |
Is there a reject process in place ? | |
Can we combine or split so we can reduce number of jobs or complexity respectively? | |
It is not recommended to have an increase in the number of nodes if there are too many stages in the job( this increases the number of processes spun off) | |
Volume information and growth information for the Lookup/Join tables? | |
Check if there is a select * in any of the queries. It is not advised to have select * , instead the required columns have to be added in the statement | |
Check the paritioning and sorting at each stage | |
When a sequence is used make sure none of the parameters passed are left blank | |
Check if there are separate jobs for atleast extract, transform and load | |
Check if there is annotation for each stage and the job, the job properties should have the author,date etc filled out | |
Check for naming convention of the jobs, stages and links | |
Try avoiding peeks in production jobs, peeks are generally used for debug in the development | |
Make sure the developer has not suppressed many warnings that are valid | |
Verify that the jobs conform to the Flat File and Dataset naming specification. This is especially important for cleaning up files and logging errors appropriately. | |
Verify that all fields are written to the Reject flat files. This is necessary for debugging and reconciliation. |
https://www.facebook.com/datastage4you
https://twitter.com/datagenx
https://plus.google.com/+AtulSingh0/posts
https://datagenx.slack.com/messages/datascience/
No comments:
Post a Comment