Group by in Pig
- In this section we will see about the grouping in the pig, grouping in the pig is very important because most of the pig function takes the bag as the input parameter.
Grouping before using functions
- Most of the inbuilt functions in pig take the bag as input.
- We need to group the data and relevant bags before applying functions.
LAB – Group by in Pig
- Grouping helps to gets the summary statistic.
- Grouping is similar to SQL group by.
- Lets us see how to group inside the pig.
- Now we will work with an example of online customer retail data.
- First we need to move the online customer retail data into the HDFS.
- So now go back to the hduser terminal.
- Get the Online Customer data on to HDFS
- Get the above dataset into PIG
- Dump first 10 tuples
- Group by country
- Show the group by schema
- Dump the first two bags
Get the Online Customer data on to HDFS
- This below command will move the data Online retail sales from local file system to the hadoop file system and it will be renamed as Online_Retail_Customer_hdfs.
- Just take a note that while using the hadoop command just switch to the hduser shell and during pig command switch to the pig shell.
hadoop fs -copyFromLocal /home/hduser/datasets/Online_Retail_Sales_Data/Online_Retail_Customer.txt /Retail_Customer_hdfs
Get the above dataset into PIG
- Now we can go back to the pig shell and import the data from hdfs.
- Command for importing the data from hdfs to pig is
Retail_Customer_pig = LOAD '/Retail_Customer_hdfs' USING PigStorage('\t') as (uniq_idc:chararray, InvoiceDate:chararray, UnitPrice:INT, CustomerID:chararray,Country:chararray);
- Retail_Customer_Pig is the relation name inside the pig.
- Load command is used for defining source location of the dataset, for this example; load command will load the Online_Retail_Customer_hdfs in to Retail_Customer_Pig, and further command is about the table schema.
Dump first 10 tuples
- We can use dump command to print the data but once again the dataset being very huge the dump command is not the efficient way, there is an better way for looking at the data and is by using the LIMIT command.
- The command for LIMIT is head_Retail_Customer_pig = LIMIT Retail_Customer_pig 10.
head_Retail_Customer_pig = LIMIT Retail_Customer_pig 10; Dump head_Retail_Customer_pig;
- As being the data flowing language the data set created from LIMIT command needs to be stored somewhere, for the same reason we have created a new relation called head_Reatil_Customer_pig and this can be printed by using dump command.
- Lets take a detail look inside the command of LIMIT.
- Retail_Customer_pig is the name of the data table or the relation which is inside the pig and LIMIT command with number 10 means it will choose the first 10 observation from the same relation.
- And those 10 observation will be stored inside the new relation called head_retail_customer_pig.
- The new relation is first created then the data is stored in it thats why it is called as data flow language.
- Now we can use the dump operator on the head_retail_customer_pig relation and it will display the first fews rows.
Group by Country – Lets now use the group operator and group by the country such as united states, united kingdom and etc. – Group by will simply create a bag.
group_country = GROUP Retail_Customer_pig by Country;
- In the above command the group_country is the new relation name then GROUP is the keyword which tells that group by the country and this operation is carried out on the relation Retail_Customer_pig.
Show the group by schema
Dump the first two bags
- Now lets print first two bags of the group by using LIMIT operator.
- Command is head_group_country = LIMIT group_country 2.
- So here we will be creating new relation which will be head_group_country having first two observations.
head_group_country = LIMIT group_country 2; Dump head_group_country;