- Pig Latin is a data flow language.
- Each processing step results in a new data set, or relation.
- Pig Latin is used to perform complex data transformations, aggregations, and analysis.
- Queries or Scripts are translated into MapReduce or Apache Spark jobs, making it easy for more users to process and analyze unlimited amounts of data.
- As pig is an ETL tool .we extract the data ,transform and then load it.
Basics:
- Pig Latin uses relation name or alias.
Input= load ‘data’;
Here Input is relation name or we can call it as alias.
Case Sensitivity:
- Keywords in Pig Latin are not case-sensitive.
ex.LOADis equivalent toload. - Relation name and field name are case sensitive.
A = load 'data';is not equivalent to
a = load 'data';
Comments:
- Single line comment
A = load ‘data’; –this is a single-line comment - Multi line comment
/*
* This is a multiline comment.
*/ - Comment in middle of the line.
B = load /* a comment in the middle */ ‘data’;
Input and Output
pig is a data flow language.
The first step to any data flow is to specify your input.
Load
we can file with Schema or without Schema.
(Schema:Table structure i.e column and their type.)
Suppose there is one file i.e “Student.txt” (delimiter is TAB)
1 Amit 20
2 Amar 30
3 Amol 40
now i am going to load this file.
Without Schema:
The default delimiter is TAB.
A= Load ‘/path/Student’;
Suppose there is another file “Employee.txt” (delimiter is comma)
11,Amol,pune
22,Amit,Mumbai
33,Amar,Latur
With Schema:
1) a = load ‘path/Employee’ using PigStorage(‘,’)
as (eid:int,ename:chararray,city:chararray) ;
Emp.txt
11|Amol|pune
22|Amit|Mumbai
33|Amar|Latur
2) a = load ‘path/Emp’ using PigStorage(‘|’)
as (eid:int,ename:chararray,city:chararray) ;
Dump:
If you want to see the Output on your screen you can just Dump it.
Dump is the keyword you can use to see your output on screen.
It can also be useful for quick ad hoc jobs.
Default delimiter while printing the output is comma.
ex.
1) dump A;
2) dump a;
Input :
A = load ‘path/Emp’ using PigStorage(‘|’)
as (eid:int,ename:chararray,city:chararray) ;
foreach:
foreach takes a set of expressions and applies them to every record in the data pipeline.
Input :
A = load ‘path/Emp’ using PigStorage(‘|’)
as (eid:int,ename:chararray,city:chararray) ;
output:
B = foreach A generate eid ,ename ;
Dump B;
Filter:
The filter statement allows you to select which records will be retained in your data pipeline.
Input :
A = load ‘path/Emp’ using PigStorage(‘|’)
as (eid:int,ename:chararray,city:chararray) ;
output:
B =Filter A by eid == 33;
Dump B;
So in the above article i have explained basics of pig latin
how to load data,
how to get output using dump ,
and how to fetch data using filter and foreach.
Thank you……!