Informatica: 2014

Tuesday, April 8, 2014

How to seperate header and Footer from Flat File using Informatica

Scenario:
Source File is flat file which contains Header, Detail and Footer. Scenario is to Load the Header records into Header table ,Detail records into Detail table and Footer records into Footer table.

Source File (Source.txt)
H Company Name : Header
#####################

   Acc_num      Amount
D 123                 200
D 456                 300
D 345                 200

#####################
F No. of records : 3

Solution:

Follow the steps:

1. Import the file.

2. Create a mapping with source as file and Source Qualifier for the file.

3. Connect the Source Qualifier to an Expression Transformation.

4. In Expression Transformation take 1 port as Input port which is File, type of port will be string and length is max length of one row in Source File.

5. In Expression Transformation create three variable ports and three out ports
    v_Header=IIF(SUBSTR(File,1,1)='H',1,0)
    o_Header=v_Header
    v_Footer=IIF(SUBSTR(File,1,1)='F',1,0)
    o_Footer=v_Footer
    v_Detail=IIF(SUBSTR(File,1,1)='D',1,0)
    o_Detail=v_Detail

6. Connect Expression Transformation to a Router Transformation. Router Transformation will have three groups.
    a. IIF(o_Header=1) -- Pass this group data to target Header Table
    b. IIF(o_Detail=1) -- Pass this group data to target Detail Table
    c. IIF(o_Footer=1) -- pass this group data to target Footer Table

Monday, April 7, 2014

Informatica Administrator Training

Day 1 :

00 : Introduction
1 : Domain Architecture
2 : Best Practices
3 : Installing Informatica 9
4 : Configuring PowerCenter 9.1 Services

Day 2 :

5 : Security: Users and Groups
6 : Security: Privileges and Roles
7 : Security: Permissions
8 : PowerCenter Repository and Client Applications Management

Day 3 :

9 : Web Hub Service Management
10 : PowerCenter Repository Metadata Deployment
11 : Command Line Programs
12 : Subject Area Implementation

Day 4 :

13 : Metadata Reporting
14 : Reference Table Management
15 : Domain Management
16 : License Administration

Best Practices when building Maps

When you build maps try to use the following steps.
Sometimes this can help clarify "how" to architect the database. Before beginning ask the following questions:

1. How much data is there passing through this map?
2. What are the source, target, aggregator, and lookup sizes (in terms of Rows and Megabytes of space)?
3. What is the time frame this map is expected to run in?
4. Are we loading data over the network?
5. Are we waiting on a flat file? More than one?
6. Is there a way to balance the size of the source / target and maybe eliminate the lookups by utilizing Database Routines?
7. Is the speed such a problem that we must eliminate aggregators?
8. Can we tune the SQL on the source qualifier?
9. Do we need to generate new sequence ID numbers for each row?

How to handle apostrophe in Informatica?

In informatica CHR(39) refers ' (apostrope).
Let we want to bring the output as "Tarun's Blogs" then the variable expression must be defined as below

v_out= 'Tarun'||CHR(39)||'s Blogs'

Problems in Source Definition field with zero length

If the view field is defined as NULL value then the INFA source definition consider the field width as ZERO .

How this creates problem?

This creates problem while importing the code to another environment.

Solution:
To avoide this problem, Import the code through the LABEL.Copy the whole folder.

How to serach Informatica mappings from meta data table suing Sql Query

SELECT DISTINCT C.SUBJ_NAME, B.MAPPING_NAME FROM
OPB_SUBJECT C, OPB_MAPPING B, OPB_WIDGET_INST A
WHERE C.SUBJ_ID = B.SUBJECT_ID AND B.MAPPING_ID = A.MAPPING_ID
AND upper(A.instance_NAME) LIKE upper('%MAP%')

Sunday, April 6, 2014

How to run a single session multiple times in a flow?

There is a workflow with three sessions.
S1 ---> S2 ---> S3.
How to repeat 5 successful runs for S2(session 2) without connecting the same
session 5 times in the flow (like S1 then 5 times S2 and then S3)..

session s1-->session s2-->assignment task-->decision task-->cmd task
event wait (file watcher) session s3

1. First time session s2 will be run after session s1 normaly.
2. Assignment task will increase the value of workflow level variable to count run of session s2.
3. Decision task will be used to skip cmd task after 5 successful run of session2 and to run session s3.
4. cmd task will create a touch file (blank file) on some location.
5. Event wait (file watcher) will be used to start session s2 again.
6. if we want 5 times to run session s2 then s3 ,so after 5 times run of session s2, cmd task will be skipped and session s3 will run this time.

How to get ratio of two numbers through informatica?

What will be the logic?
e.g:--
id Number1 Number2 Ratio
1   3000    488     15:7
2   200     100      2:1
3    40      15      8:3

Use dis code in java expression to get GCD.
Number1=Number1;
Number2=Number2;
int i,j,GCD_temp=1;
i=Number1;
j=Number2;
while(i!=0 && j!=0)
{
if(i>j)
{
i=i-j;
}
else
{
j=j-i;
}
}
if(i==0)
GCD_temp=j;
else
GCD_temp=i;
GCD=GCD_temp; \

Your expression ports will be:
1 Product_ID as "Product_ID"
2 Year1 as "Year1"
3 Sales1 as "Number1"
4 Year2 as "Year2"
5 Sales2 as "Number2"
6 GCD
7 RatioName as "TO_CHAR(Year1, 'YYYY') || '_' || TO_CHAR(Year2, 'YY')"
8 Ratio as "TO_CHAR(Number1 /GCD|| ':' || TO_CHAR(Number2 / GCD)"

Wednesday, March 12, 2014

Informatica PowerCenter Development Best Practices

Informatica PowerCenter Development

1. Lookup - Performance considerations

What is a lookup transformation? It is just not another transformation that

fetches you data to look up against source data. A Lookup is an important and

useful transformation when used effectively. If used improperly, performance

of your mapping will be severely impaired.

Let us see the different scenarios where you can face problems with Lookup

and also how to tackle them.

1.1. Unwanted columns

By default, when you create a lookup on a table, PowerCenter gives you all the

columns in the table. If not all the columns are required for the lookup

condition or return, delete the unwanted columns from the transformations.

By not removing the unwanted columns, the cache size will increase.

1.2. Size of the source versus size of lookup

Let us say, you have 10 rows in the source and one of the columns has to be

checked against a big table (1 million rows). Then PowerCenter builds the

cache for the lookup table and then checks the 10 source rows against the

cache. It takes more time to build the cache of 1 million rows than going to the

database 10 times and lookup against the table directly.

Use uncached lookup instead of building the static cache, as the number of

source rows is quite less than that of the lookup.

1.3. JOIN instead of Lookup

In the same context as above, if the Lookup transformation is after the source

qualifier and there is no active transformation in-between, you can as well go

for the SQL over ride of source qualifier and join traditionally to the lookup

table using database joins, if both the tables are in the same database and

schema.

1.4. Conditional call of lookup

Instead of going for connected lookups with filters for a conditional lookup

call, go for unconnected lookup. Is the single column return bothering for

this? Go ahead and change the SQL override to concatenate the required

columns into one big column. Break them at the calling side into individual

columns again.

1.5. SQL query

Find the execution plan of the Lookup SQL and see if you can add some

indexes or hints to the query to make it fetch data faster. You may have to take

the help of a database developer to accomplish this if you, yourself are not a

SQLer.

1.6. Increase cache

If none of the above options provide performance enhancements, then the

problem may lie with the cache. The cache that you assigned for the lookup is

not sufficient to hold the data or index of the lookup. Whatever data that

doesn't fit into the cache is spilt into the cache files designated in

$PMCacheDir. When the PowerCenter doesn't find the data you are looking

for in the cache, it swaps the data from the file to the cache and keeps doing

this until the data is found. This is quite expensive being that this type of

operation is very I/O intense. To stop this issue from occurring, increase the

size of the cache so the entire data set resides in memory. When increasing the

cache you also have to be aware of the system constraints. If your cache size is

greater than the resources available, the session will fail due to the lack of

resources.

1.7. Cachefile file-system

In many cases, if you have cache directory in a different file-system than that

of the hosting server, the cache file piling up may take time and result in

latency. So with the help of your system administrator try to look into this

aspect as well.

1.8. Useful cache utilities

If the same lookup SQL is being used by another lookup, then shared cache or

a reusable lookup should be used. Also, if you have a table where the data is

not changed often, you can use the persist cache option to build the cache

once and use it many times by consecutive flows.

2. Workflow performance – basic considerations

Though performance tuning has been the most feared part of development, it

is the easiest, if the intricacies are known. With the newer and newer versions

of PowerCenter, there is added flexibility for the developer to build betterperforming

workflows. The major blocks for performance are the design of the

mapping, SQL tuning if databases are involved.

Regarding the design of the mapping, I have few basic considerations to be

made. Please note that these are not any rules-of-thumb, but will make you act

sensibly in different scenarios.

1. I would always suggest you to think twice before using an Update

Strategy, though it adds a certain level of flexibility in the mapping. If

you have a straight-through mapping which takes data from source and

directly inserts all the records into the target, you wouldn’t need an

update strategy.

2. Use a pre-SQL delete statement if you wish to delete specific rows from

target before loading into the target. Use truncate option in the session

properties, if you wish to clean the table before loading. I would avoid a

separate pipe-line in the mapping that runs before the load with

update-strategy transformation.

3. You have 3 sources and 3 targets with one-on-one mapping. If the load

is independent according to business requirement, I would create 3

different mappings and 3 different session instances and they all run in

parallel in my workflow after my “Start” task. I’ve observed that the

workflow runtime comes down between 30-60% of serial processing.

4. PowerCenter is built to work of high volumes of data. So let the server

be completely busy. Induce parallelism as far as possible into the

mapping/workflow.

5. If using a transformation like a Joiner or Aggregator transformation,

sort the data on the join keys or group by columns prior to these

transformations to decrease the processing time.

6. Filtering should be done at the database level instead within the

mapping. The database engine is much more efficient in filtering than

PowerCenter.

The above examples are just some things to consider when tuning a mapping.

2.1. SQL tuning

SQL queries/actions occur in PowerCenter in one of the below ways.

• Relational Source Qualifier

• Lookup SQL Override

• Stored Procedures

• Relational Target

Using the execution plan to tune a query is the best way to gain an

understanding of how the database will process the data. Some things to keep

in mind when reading the execution plan include: "Full Table Scans are

not evil", "Indexes are not always fast", and “Indexes can be slow

too".

Analyse the table data to see if picking up 20 records out of 20 million is best

using index or using table scan. Fetching 10 records out of 15 using index is

faster or using full table scan is easier.

Many times the relational target indexes create performance problems when

loading records into the relational target. If the indexes are needed for other

purposes, it is suggested to drop the indexes at the time of loading and then

rebuild them in post-SQL. When dropping indexes on a target you should

consider integrity constraints and the time it takes to rebuild the index on post

load vs. actual load time.

3. Pre/Post-Session command - Uses

• It is a very good practice to email the success or failure status of a task,

once it is done. In the same way, when a business requirement drives,

make use of the Post Session Success and Failure email for proper

communication.

• The built-in feature offers more flexibility with Session Logs as

attachments and also provides other run-time data like Workflow runinstance

ID, etc.

• Any archiving activities around the source and target flat files can be

easily managed within the session using the session properties for flat

file command support that is new in PowerCenter v8.6. For example,

after writing the flat file target, you can setup a command to zip the file

to save space.

• If you have any editing of data in the target flat files which your

mapping couldn’t accommodate, write a shell/batch command or script

and call it in the Post-Session command task. I prefer taking trade-offs

between PowerCenter capabilities and the OS capabilities in these

scenarios.

4. Sequence generator – design considerations

In most of the cases, I would advice you to avoid the use of sequence generator

transformation, while populating an ID column in the relational target table. I

suggest you rather create a sequence on the target database and enable the

trigger on that table to fetch the value from the database sequence.

There are many advantages to using a database sequence generator:

• Fewer PowerCenter objects will be present in a mapping which reduces

development time and also maintenance effort.

• ID generation is PowerCenter independent if a different application is

used in future to populate the target.

• Migration between environments is simplified because there is no

additional overhead of considering the persistent values of the

sequence generator from the repository database.

In all of the above cases, a sequence created in the target database would make

life lot easier for the table data maintenance and also for the PowerCenter

development. In fact, databases will have specific mechanisms (focused) to

deal with sequences and so you can implement manual Push-down

optimization on your PowerCenter mapping design for yourself.

DBAs will always complain about triggers on the databases, but I would still

insist on using sequence-trigger combination for huge volumes of data as well.

5. FTP Connection object – platform independence

If you have any files to be read as source from Windows server when your

PowerCenter server is hosted on UNIX/LINUX, then make use of FTP users

on the Windows server and use File Reader with FTP Connection object.

This connection object can be added as any other connection string. This gives

the flexibility of platform independence. This will further reduce the overhead

of having SAMBA mounts on to the Informatica boxes.

Dynamic Partitioning In Informatica

Dynamic Partitioning

Supported Versions of Informatica:

PowerCenter 9.1 and 9.5

Dynamic Partitioning:

If the volume of data grows or you add more CPUs, session run time increases. In dynamic partitioning, Integration Service determines the number of partitions to create at run time based on factors such as source database partitions or the number of nodes in a grid.

Types of Dynamic Partitioning:

1. Based on Source Partitioning - Determines the number of partitions using database partition information. The number of partitions is the maximum of the number of partitions at the source.

2. Based on number of CPUs - Sets the number of partitions equal to the number of CPUs on the node that prepares the session. If the session is configured to run on a grid, dynamic partitioning sets the number of partitions equal to the number of CPUs on the node that prepares the session multiplied by the number of nodes in the grid.

3. Based on number of Nodes in Grid - Sets the partitions to the number of nodes in the grid running the session. If you configure this option for sessions that do not run on a grid, the session runs in one partition and logs a message in the session log.

4. Based on No. of Partitions - Sets the partitions to a number that you define in the Number of Partitions attribute. Use the $DynamicPartitionCount session parameter, or enter a number greater than 1.

Description:

Informatica PowerCenter provides enhanced dynamic pipeline partitioning capabilities that when appropriately configured, can increase pipeline parallelism, resulting in greater throughput and a significant performance improvement.

Benefits:

Enhance load performance, resulting in greater throughput and a significant performance improvement.

Note: - Dynamic partitioning can be used if the source qualifier not includes a SQL query or source filter.

Steps to implement the solution:

1. Open Informatica Administrator Home Page (Admin Console) and check whether partitioning is enabled or not. Partitioning feature should be enabled in license. As show below.

2. Create 4 partitions of data at database Level for dynamic Source Partitioning.

3. Workflow developed in Informatica.

4. Valid connection objects assigned for the source.

5. Open workflow, Observe partition navigation pane of the session by edit- >mapping->partition.

As shown above, source qualifier transformation’s partition type is selected as Database Partitioning. Same can be verified by clicking on “Edit Partitioned Point”. Partitioned and Non-partitioned transformations are listed accordingly in navigation panel.

6. In “Config Object” tab of the session, observe the “Partitioning Options” section. From here select type of dynamic partitioning.

7. Run the workflow and view target table for result. Observe the source/target statistics as shown below; 2 partitions have been created based on the 2 partitions in the source database.

Throught Put using Simple task

Note:- Here data at database level is not partitioned so unable to fetch data for partition 2.

8. When the execution completes, open up the session log by selecting on the session from the Workflow Monitor and selecting “Session Log” from the context menu.

9. Observe how the partitions are created in Session Log. You should see 2 partitions created (2 reader threads, 2 writer threads).The source database table is partitioned into 2.

Drawbacks:

Dynamic partitioning can’t be used if the source qualifier includes a SQL query or source filter.