
Wednesday, June 29, 2016

MapReduce Two Values for One key example

MapReduce  Multiple values for a single key

MapReduce  Joint example

In this example , creating MapReduce code for doing activity from hortonworsks

It need to map , one key to two values .

Year as key and PlayerID & Runs as Value

This also contains csv creator in any size and standalone java code to do the same activity

Details of the files : : MapReduce Driver Class : MapReduce Mapper Class : MapReduce Reducer Class

Batting.jar : MapReduce Jar

hadoop jar ./Batting.jar BattingExample <InputCSVfile> <OutputFolder> : it used for run same Mapper Reduce logic in stand alone mode

java StandAlone <inputCSV> <outputFIle>

Batting.csv : it contains data of players : Used to create similar csv file with any size in similar format

java CsvCreator <NumberOfPlayers> <outputCSVfile>

2000 players will create 1 MB file

Sunday, June 19, 2016

How to protect webUI port of namenode ?

How to protect webUI port 50070  of namenode ?

By default webUI port of namenode running on 50070 is not protected and details of HDFS and file system in read only mode are open to all , by accessing http://<namenodeServer>:50070

All hadoop daemons use an embedded Jetty web container to host JSP for webui.

Version used in the example : apache 2.7.2

1. Go to <hadoop_home>/ share/hadoop/hdfs/webapps/hdfs/WEB-INF

2. edit web.xml

<web-app version="2.4" xmlns="">

3. Create new file : jetty-web.xml

<Configure class="org.mortbay.jetty.webapp.WebAppContext">
<Get name="securityHandler">
<Set name="userRealm">
<New class="">
<Set name="name">explorerRelam</Set>
<Set name="config">
<SystemProperty name="hadoop.home.dir"/>/jetty/etc/

4. Create new file <hadoop_home>/jetty/etc/
(folder jetty/etc should be created )

format :

Username: password,group

tushar: welcome1,admin

5. Access http://IP:50070

6. If only explorer need to protect use , in step 2


Thursday, June 16, 2016

How to overwrite or update a file in hadoop HDFS ?

Using put or  copyFromLocal   wont able to update a file in HDFS . it will show below error

[hduser@localhost SampleData]$ hadoop fs -put books.csv  /yesB
put: Target /yesB/books.csv already exists

[hduser@localhost SampleData]$ hadoop fs -copyFromLocal   books.csv  /yesB
copyFromLocal: Target /yesB/books.csv already exists

To overcome this issue , distcp can be used 

 hadoop distcp -update  file://<source>  hdfs://<IP:PORT>/<targetlocation>

Example :

 hadoop distcp -update   file:///home/hduser/pigSample/labfiles/SampleData/books.csv hdfs://

-overwrite can be used . But using -update is better because it copy and do mapreduce only when there is difference in source and target