Interview Questions
1. Explain rollup?
Ans. ROLLUP processes groups of input records that have the same key, generating one output record for each group. Typically, the output record summarizes or aggregates the data in some way; for example, a simple ROLLUP might calculate a sum or average of one or more input fields. ROLLUP can select certain information from each group; for example, it might output the largest value in a field, or accumulate a vector of values that conform to specific criteria.
You can use a ROLLUP component in two modes, depending on how you define the transform parameter:
Template mode — You define a simple rollup function that may include aggregation functions. Template mode is the most common way to use ROLLUP.
Expanded mode — You create a transform using an expanded rollup package. This mode allows for rollups that do not necessarily use regular aggregation functions.
2. How to achieve parallelism ?
Ans. Ab Initio software uses three kinds of parallelism:
1. Component Parallelism: Component parallelism allows program components to be executed simultaneously on different branches of a graph. The more branches a graph has, the greater the possibilities for component parallelism
The following graph takes the Customers and Transactions datasets, sorts them, and then merges them into a dataset named Merged Information:
Because the Sort Customers and Sort Transactions components are on different branches of the graph, they execute at the same time, creating component parallelism.
2. Data Parallelism: Data parallelism occurs when a graph separates data into multiple divisions, allowing multiple copies of program components to operate on the data in all the divisions simultaneously.
3. Pipeline Parallelism: Pipeline parallelism occurs when several connected program components on the same branch of a graph execute simultaneously.
For example, the following graph divides a list of customers into two groups, Good Customers and Other Customers. The Score component assigns a score to each customer in the Customers dataset; then the Select component directs each customer to the proper group based on that score.
Both Score and Select read records as they become available and write each record immediately after processing it. After Score finishes scoring the first Customer record and sends it to Select, Select determines the destination of the record and sends it to the appropriate output file. At the same time, Score reads the second Customer record. The two processing stages of the graph run concurrently — this is pipeline parallelism.
If a source component must read all its records before writing any records — a SORT component, for example — pipeline parallelism does not occur. Because a SORT component blocks pipeline parallelism, many Ab Initio components are built with the capability of processing unsorted data, thus eliminating the need to insert a SORT component.
3. How to read multifile?
Ans. Read Multifile component reads multiple input files, reformats them to use a different record type, and writes the file contents to a single serial file.
Add an INPUT FILE component whose URL specifies a file containing a list of the files to read and whose record format specifies how the filenames are separated.
In read_multiple_files_simple.mp, the URL of Input File List is file:$AI_SERIAL/list_of_files.dat, and the referenced file has the following content:
trans-2006-01.dat
trans-2006-02.dat
trans-2006-03.datThe record format for list_of_files.dat files (in $AI_DML/Input_File_List.read.dml) is:
record string('\n') filename;
end2. Connect a READ MULTI FILES component to the INPUT FILE and configure it as follows:
Hold down the Shift key and double-click the component to generate a default transform. (If the Transform Editor is displayed in Package View, switch to Text View, and you will be prompted to generate the transform.)
Delete all the transform code except the get_filename function.
Code the get_filename function such that it will generate the full path to each input file and then retrieve its content. For example:
filename :: get_filename(in) =
begin
filename :: string_concat($AI_SERIAL, '/', in.filename);
end;On the Ports tab of the Properties dialog for the READ MULTIPLE FILES component, specify the record format for the output data (which is the record format for each file in the list). To do this, select Use file and then browse to the location of a file containing the record format.
This graph uses transaction.dml, which includes transaction_type.dml. The definition of transaction in transaction_type.dml is as follows:
type transaction =
record
date("YYYY.MM.DD") trans_date;
decimal(9,0) account_id;
decimal(10.2) amount;
string('\n') description;
end; 4. Types of files you worked on?
Ans. Worked on below files:
1. .dat - Stores data, in either serial or parallel (partition form.
2. .dbc - Enables a database connection.
3. .dml - Stores DML record format definitions.
4. .mp - Stores graphs or graph components.
5. .plan - Defines a set of tasks for Conduct>IT.
6. .pset - Defines a set of graph, plan, project, or sandbox parameters.
7. .xfr - Stores DML transform functions or packages.
8. .xml - In an installation context, controls configuration for a process such as the bridge (bridge-7070.xml) or the reporter (reporter-config.xml). In a job context, an external data file.
9. .rset - Stores a set of expressions and rules for Express>It rulesets.
10. .aic - Host connection profile — Enables client-to-host/Application Hub connection from the client side. Stored on client computer. The default host connection settings file is named Normal.aic.
5. What is lookup?
Ans.
6. Diff btw lookup n lookup file?
Ans.
7. Diff source n targets worked on ?
Ans.
8. How to read Mainframe file ?
Ans.
9. How to read data from table ?
Ans.
10. What is needed to load data into table ?
Ans.
11. One extra field has come in source n it is not in target, n job is failing, how to resolve this issue ?
Ans.
12. How to improve graph performance?
Ans.
- Go Parallel as soon as possible using Ab Initio Partitioning technique.
- Once Data Is partitioned do not bring to serial , then back to parallel. Repartition instead.
- For Small processing jobs serial may be better than parallel.
- Do not access large files across NFS, Use FTP component
- Use Ad Hoc MFS to read many serial files in parallel and use concat component.
- Using Phase breaks let you allocate more memory to individual component and make your graph run faster
- Use Checkpoint after the sort than land data on to disk
- Use Join and rollup in-memory feature
- Best performance will be gained when components can work with in memory by MAX-CORE.
- MAR-CORE for SORT is calculated by finding size of input data file.
- For In-memory join memory needed is equal to non-driving data size + overhead.
- If in-memory join cannot fir its non-driving inputs in the provided MAX-CORE then it will drop all the inputs to disk and in-memory does not make sence.
- Use rollup and Filter by EX as soon as possible to reduce number of records.
- When joining very small dataset to a very large dataset, it is more efficient to broadcast the small dataset to MFS using broadcast component or use the small file as lookup.
13. Explain diff partitioning components?
Ans.
14. What is plan ?
Ans.
15. Diff between plan and wrapper script ?
Ans.
16. How to design generic graph ?
Ans.
17. How to configure output table ?
Ans.
18. Most critical issue faced during Production Support? (In terms of technical issue)
Ans.
19. Air command to import, to create tag?
Ans.
20. Command to test db connection?
Ans.
21. Query to get 2nd highest salary from each department?
Ans. SELECT MAX(SALARY) FROM Employee WHERE SALARY < (SELECT MAX(SALARY) FROM Employee);
or
SELECT * FROM (
SELECT EMPLOYEE_NAME, SALARY, DENSE_RANK()
OVER(ORDER BY SALARY DESC) r FROM EMPLOYEE)
WHERE r=&n;
To find to the 2nd highest salary set n = 2
To find 3rd highest salary set n = 3 and so on
OR
SELECT SALARY FROM EMPLOYEE_INFO E1 WHERE N-1 = (SELECT COUNT(DISTINCT SALARY) FROM EMPLOYEE_INFO E2 WHERE E2.SALARY>E1.SALARY);
Where N = 1,2,3...
22. Unix command to print count of words from each line which are starting from "a" ?
Ans.
1. Rollup vs Scan?
Ans. Scan - It generates the cumulative summary of all the input records i.e., scan will give you 1 output record for each input record.
Two modes to use Scan :
1. Template Mode
2. Expanded Mode
Rollup : It generates the aggregate summary of all the input records. Rollup processes groups of input records that have same key, generating one output records for each group.
for example : A simplest rollup may compute a sum or average of one or more input fields. Rollup can select certain information from each group like it might output the largest value in a field or accumulate a vector of values that conform to specific criteria.
Two modes to use Rollup :
1. Template Mode
2. Expanded Mode
2. Sort with Null key?
Ans. If the sort key field contains NULL values then the NULL records are listed first with ascending order and last with descending order
3. Dedup sort 3 cases?
Ans. Dedup sort offers three cases for keep parameter:
1. Keep : First
2. Keep : Last
3. Keep : Unique Only
4. Null as key to dedup sort?
Ans.
5. If malformed data given to Reformat but don't wanna fail the process ?
6. CdC,SCD?
Ans.
Change Data Capture :
Slowly Changing Diamension :
7. 3 types of partitioning?
Ans. Partition by key : It partitions the input data records on the basis of provided key
Partition by Round Robin : It partitions the input data records evenly on the basis of provided block size
Partition by Percentage : It partitions the input data records on the basis of percentage mentioned.
8. What is Component parallelism?
Ans.
8. Attributes of Reformat?
Ans. Reformat : Reformat is the simplest component in Abinitio. Basically it's used to perform business transformations on the data. It can perform the field addition and drop as well.
Reformat has multiple parameter which can provide good computation:
Count : It specifies number of inputs and output ports
Select : It can be used to put any filter criteria on the data records before it goes to actual transformation logic
Transform :
log_group :
error_group :
reject-threshold :
----------------
Unix :
1. Command to remove blank lines of file and print other data?
Ans. sed '/^$/d' file.txt
2. In ls -lrta what each options functionality?
Ans. ls - list directory contents.
| -l | use a long listing format |
| -a, --all | do not ignore entries starting with . |
| -r, --reverse | reverse order while sorting |
| -t | sort by modification time |
--------------
SQL:
1. How to check duplicates in table ?
Ans. SELECT ID, COUNT(*) FROM TABLE GROUP BY ID HAVING COUNT(1)>1;
2. Could you design complex query to join 5 6 tables to fetch desired data ?
‐-------------
Prod Support :
1. If job takes usually 15 minutes but today it's taking more than hour, what would u check and your actions?
2. If input has no dups but dups got loaded Into table, what could be the reason ?
3. What is the project flow ?
4. What size of data you process daily ?
5. What if job failed due to data error, how to find and how to resolve ?
6. What if job failed for 5 clients, how you prioritize for which client to run first ?
Comments
Post a Comment