Pig Latin Script
Before writing the pig latin scripts some important note should be taken in consideration First one is Pig is the case sensitive for certain commands in certain cases . Keywords in the pig latin are not case sensitive but the function name and reaction names are case sensitive. Suppose if we define particular relation with upper and the lower case then that is case sensitive but the keywords for example “store ,load, some kind of import” are not case sensitive . But the relation names and the tables names are case sensitive. There are two commenting styles either we can use SQL-style of single line comments or java style of the multiline comments .
- Pig Latin is the language used to analyse data in Hadoop using Apache Pig.
- Case Sensitivity
- Keywords in Pig Latin are not case-sensitive but Function names and relation names are case sensitive
- Two types of comments
- SQL-style single-line comments (–)
- Java-style multiline comments (/* */).
Loading the Data into Pig
Lets start working with the pig . As we discussed earlier pig is build upon hadoop or the pig sits above the hadoop so we need to start hadoop before we start pig . So for starting the hadoop we need to run the command “start-all.sh” which will start the hadoop services. It is necessary to start hadoop before starting the pig because the pig latin script will be converted into the map reduce code internally, and map reduce is on top of the hadoop . So start the hadoop before staring the pig by command “start-all.sh”. Now we can see that hadoop is up and running . Now we can start the pig . “pig” is the command to start the pig . And it will open the ** grunt** shell
- Starting pig
Lets work on the dataset Online_Retail_Invoice.txt, First we have push the data into hdfs, then from hdfs we have to move the data inside the pig, because pig is the part of the hadoop ecosystem and it works with hdfs.
- Now open a new terminal and type the below command to access the hduser .
- Push the data onto HDFS using copyFromLocal statement
hadoop fs -copyFromLocal /home/hduser/datasets/ Online_Retail_Sales_Data/Online_Retail_Invoice.txt /Retail_invoice_hdfs
- Now come back to the pig terminal.
- Push the HDFS data onto pig by using LOAD statement.
- We need to push the data into a relation. Remember relation is a bag.
- We need to use PigStorage(‘’) here. ‘’ is for tab delemited data
Retail_invoice_pig = LOAD '/Retail_invoice_hdfs' USING PigStorage('\t');
Lets understand the code in detail, “Retail_invoice_pig” is the relation name its like the dataset name or data file name or the table name . Load is a keyword this load keyword is used to bring the hdfs data inside the pig and “Retail_Invoice_hdfs” is the location of the hdfs file that we want to load . Using PigStorage(‘’) is also an keyword which means the given data is in tab delimiter function. The tuple will be created based on the delimiter mentioned in this storage function . There are several options for loading functions such as Binstorage , JSonloader, Pigstorage, Textloader, in this particular example we are using the pig storage. We have used tab delimiter function for this particular example. Once we run the above command it shows an warning message as “command is depreciated” which is okay .
Accessing the Data on Pig- Dump
Now the dataset is inside the pig and we can use a dump statement “DUMP Retail_invoice_pig;” which is kind of a print statement which print the data. Inside the pig the relation name is “Retal_invoic_pig”. Inside pig we don’t call it as data set or data table this is called as the relation . So lets run the dump command. Once we run the dump command it will start calling some java libraries and finally we can see that it is print the data which was inside the “retail invoice”. Being a huge dataset it will take time to print all the rows. By now we can understand that “DUMP” is not a good option for printing the data set if the dataset is too larger, because dump command will consume considerable amount of time for printing the whole dataset. So “DUMP” command is not recommended to be used when the dataset is large.
- Dump operator simply prints the relation/output/result in the console
- Dump is NOT a good idea if your target dataset size is large
Lets have a look at data on Pig
Validating the data on Pig- Describe
So “DUMP” command is not recommended to be used when the dataset is large. Instead of “DUMP” we can use the “Describe” this is an another keyword inside pig. Lets try to run the “Describe”. Command for running the Describe is “Describe Retail_invoice_pig;” As soon as we run the describe command we get an error message saying “Schema for retail_invoice_pig is unknown”. Now we have to define along with schema to describe command to work
- Displays the schema of a relation(table)
- While Loading the data into pig you need to define schema. If you don’t define a schema for relation then output is void
#### Loading the Data with Schema
- Lets load this table using the schema .
Retail_invoice_pig1 = LOAD '/Retail_invoice_hdfs' USING PigStorage('\t') as (uniq_idi:chararray, InvoiceNo:chararray, StockCode:chararray, Description:chararray,Quantity:INT);
Lets try to understand the command . “Retail_invoice_pig1”” is the new relation, like it is the new table inside the pig . “Load” statement will tell the location of the data that should be loaded from. “Using” is an keyword which will take care about the delimiter used in the dataset. Now the next part of the command will tell about the scheme . First one is unique id which is type of char array Second one is ** Invoice Id** which type of char array Third one is ** StockCode ** which type of char array. Fourth one is description which type of char array. And the last one is the “Quantity” which is type of integer. Then we are loading the data again inside “Retail_invoice_pig1” but this time we are loading along with schema . The data is now loaded inside the table scheme “Describe Retail_invoice_pig1;”.
By running the describe “Describe Retail_invoice_pig1;” command it will give us the small description about relation, which consist of the column name and the structure of the column .
By running the “DUMP Retail_invoice_pig1;” it starred to print the data, again dump is not recommend command if the dataset consist of large amount of rows.
Validating the data on Pig- LIMIT DUMP
- Dump is a bad idea if the data size is large
- We can use LIMIT statement to dump fewer tuples(records)
head_Retail_invoice_pig1 = LIMIT Retail_invoice_pig1 10; DUMP head_Retail_invoice_pig1;