Well, it is true. Amazon, Google, Facebook, Hotmail, Ebay and Yahoo have been using Apache's Hadoop to manage large sets of data and provide long term retention for large data sets and historical information. Just look at all the data in your Google Dashboard - (Log in to your Google account and under settings click on settings, then dashboard and prepare to be surprised!) Now, imagine all this data multiplied by millions of users worldwide and you get the picture... BIG DATA is here to stay.
"The Apache™ Hadoop™ project develops open-source software for reliable, scalable, distributed computing. The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-avaiability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-availabile service on top of a cluster of computers, each of which may be prone to failures. "
The project includes these subprojects:
- Hadoop Common: The common utilities that support the other Hadoop subprojects.
- Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.
- Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.
Other Hadoop-related projects at Apache include:
- Avro™: A data serialization system.
- Cassandra™: A scalable multi-master database with no single points of failure.
- Chukwa™: A data collection system for managing large distributed systems.
- HBase™: A scalable, distributed database that supports structured data storage for large tables.
- Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
- Mahout™: A Scalable machine learning and data mining library.
- Pig™: A high-level data-flow language and execution framework for parallel computation.
- ZooKeeper™: A high-performance coordination service for distributed applications.
Now, how come Microsoft is working with Apache? Simple, you cannot buy or control such an Open Source system and all the subprojects ( specially when some of yoru largest competitors have so much invested into it) . Hotmail ran on Apache for years before it was converted to SQL.
So, in the words of an old mentor, " the best way to jump into a parade is to jump in front of one that is already going!" By embracing Hadoop Microsoft buys a front ticket to the BIG DATA parade and guarantees its sustainability and viability in the market, combine that with its Azure strategy and you can now see the elements of the big picture! Now, let's see how it all works together the connector will be available shortly and I will definetely be testing it and showcasing it.