data science tutorials and snippets prepared by greysweater42
hadoop is the first ever popular big data tool;
it lets you quickly compute huge amounts of data thanks to dividing computation into many machines i.e. a cluster of machines; (quickly comparing to a standard, one-machine approach);
you can store and easily access huge amounts of data thanks to hadoop’s distributed file system (hdfs);
in general, hadoop is a cornerstone of big data.
In production environment Hadoop should be installed on several interconnected machines, called a cluster. I’m not going to explain how to do this, as this is usually a task for sysadmins/DevOps guys, and not for data scientists. However it is very usufeul to have your own one-node hadoop installation on your laptop for practising basic big data solutions. This is what we’re going to install in this section.
Here‘s a little script that I wrote which installs and runs hadoop. You can execute it with
sudo bash hadoop.sh
but I recommend running it line by line, so you would understand what you are doing and what is needed to have hadoop running.
You probably expect that knowing Hadoop will let you do Big Data. It would, but in reality nobody uses Hadoop to run data processing anymore. Spark and Hive have much nicer apis, and do the same tasks faster.
But there is still one functionality that was not replaced: a file system, called hadoop distributed file system or hdfs, where limited disk capacity is no longer a problem.
There are a couple of useful commands to remember. Their resemblance to bash commands is quite obvious.
ls - list files in a given directory
hdfs dfs -ls <dir>
mkdir - create a directory
hdfs dfs -mkdir <path>
put - upload file to hdfs
hdfs dfs -put <localSrc> <dest>
copyFromLocal - copy file local to hdfs, obviously
hdfs dfs -copyFromLocal <localSrc> <dest>
get - download file from hdfs
hdfs dfs -get <src> <localDest>
copyToLocal - copy file from hdfs to local, obviously
hdfs dfs -copyToLocal <src> <localDest>
mv - move file from one directory to another or rename a file
hdfs dfs -mv <src> <dest>
cp - copy file from one directory to another
hdfs dfs -cp <src> <dest>
rm - remove a file from hdfs (-rm -r
- remove recursively)
hdfs dfs -rm <dir>
More commands are available here.
Hadoop GUI (TODO)
example of MapReduce: wordcount (TODO)
RHadoop (TODO)