A distributed computing system can be defined as a collection of processors interconnected by a communication network such that each processor has its own local memory. The communication between any two or more processors of the system takes place by passing information over the communication network. It has its application in various fields like Hadoop and Map Reduce which we will be discussing further in details.
Hadoop is becoming the technology of choice for enterprises that need to effectively collect, store and process large amounts of structured and complex data.
The purpose of the thesis is to research about the possibility of using a MapReduce framework to implement Hadoop.
Now all this is possible by the file system that is used by Hadoop and it is HDFS or Hadoop Distributed File System.
HDFS is a distributed file system and capable to run on hardware. It is similar with existing distributed file systems and its main advantage over the other distributed File system is, it is designed to be deployed on low-cost hardware and highly fault-tolerant. HDFS provides extreme throughput access to applications having large data sets.
Originally it was built as infrastructure support for the Apache Nutch web search engine. Applications that run using HDFS have extremely large data sets like few gigabytes to even terabytes in size. Thus, HDFS is designed to support very large sized files. It provides high data communication and can connect hundreds of nodes in a single cluster and supports tens of millions of files in a system at a time.
Now we take all the above things mentioned above in details. We will be discussing various fields where Hadoop is being implemented like in storage facility of Facebook and twitter, HIVE, PIG etc.
In the early decades of computing, programs were serial or sequential, that is, a program consisted of a categorization of instructions, where each instruction executed sequential as name suggests. It ran from start to finish on a single processor.
Parallel programming (grid computing) developed as a means of improving performance and efficiency. In a parallel program, the process is broken up into several parts, each of which will be executed concurrently. The instructions from each part run simultaneously on different CPUs. These CPUs can exist on a single machine, or they can be CPUs in a set of computers connected via a network.
Not only are parallel programs faster, they can also be used to solve problems on large datasets using non-local resources. When you have a set of computers connected on a network, you have a vast pool of CPUs, and you often have the ability to read and write very large files (assuming a distributed file system is also in place).
Parallelism is nothing but a strategy for performing complex and large tasks faster than traditional serial way. A large task can either be performed serially, one step following another, or can be decomposed into smaller tasks to be performed simultaneously using concurrent mechanism in parallel systems.
Parallelism is done by:
Parallel problem solving can be seen in real life application too.
Examples: automobile manufacturing plant; operating a large organization; building construction;
This paper was written and submitted by a fellow student
Our verified experts write
your 100% original paper on any topic
We will send an essay sample to you in 2 Hours. If you need help faster you can always use our custom writing service.Get help with my paper