-
Notifications
You must be signed in to change notification settings - Fork 6
Pseudodistributed hadoop
Hadoop can also run in pseudo-distributed mode. It still runs on one machine, but multiple threads can run at the same time. This lets you take advantage of, say, one large multi-core machine to do extraction faster than in standalone mode. These instructions are adapted from the official apache quickstart guide here.
You should set some extra configuration parameters. In $HADOOP/conf/core-site.xml:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>In $HADOOP/conf/hdfs-site.xml:
<configuration>
<property>
<name>dfs.replications</name>
<value>1</value>
</property>
</configuration>In $HADOOP/conf/mapred-site.xml (mapred.tasktracker.{map,reduce}.tasks.maximum set the maximum number of concurrent map and reduce tasks):
<configuration>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>4</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>localhost:9001</value>
</property>
</configuration>ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >>~/.ssh/authorized_keys
Format the new filesystem:
$HADOOP/bin/hadoop namenode -format
Start the daemons:
$HADOOP/bin/start-all.sh
And now you're ready to run some hadoop jobs!
- Copy your data to the "distributed" filesystem:
hadoop fs -put corpus.unified input -
- In your thrax.conf, the
hadoop-work-dirkey should be set relative to a "home" directory you have on the distributed filesystem. Do not use the defaultfile://prefix! Leaving this key blank will let thrax use a sensible default:./thrax_run_YYYY_MM_DD_hhmmss. - The
work-dirstill refers to the local filesystem; somewhere in/tmpis fine to use. - The
input-fileworks similarly tohadoop-work-dir; it is relative to the distributed filesystem. Since you just copied it withhadoop fs -put, you know exactly where it is sitting. In the example from number 1, you would set this key toinput.
- In your thrax.conf, the
- Run! It's exactly the same command:
$THRAX/thrax <config>
You can stop the daemons with $HADOOP/bin/stop-all.sh.