In 5, design and implement of an efficient strategy are given forthe collection and storage of large volumes of data.
It shows how important itis to select the correct model for efficient performance and technologymigration. It is clear from the study that keeping the main logic in acentralised location will simplify technological and architectural migration.The performance test results show that eliminating any transformation at thedata ingestion level and moving it to the analytics job is beneficial as the overallprocess time is reduced. Untempered raw data are kept in the storage level forfault-tolerance, and the required transformation can be done when requiredusing a framework such as MapReduce.In 6, authors proposed an effective tool for deep-web, its name smartcrawler. Smart crawler is a focused crawler consisting of two stages: efficientsite locating and balanced in-site exploring. Smart crawler performs site-basedlocating by reversely searching the known deep web sites for center pages,which can effectively find many data sources for sparse domains.
By rankingcollected sites and by focusing the crawling on a topic, smart crawler achievesmore accurate results.In 7, a survey on different methods of crawling is given.Previous systems face many challenges such as efficiency, end-to-end delay, andquality of link, failure to find the deep websites as they are unregisteredwith any crawler, scattered and dynamic. Thus, an effective and adaptiveframework could be proposed for retrieving deep web pages, namely smart web crawler.
This approach accomplishes a wide spread coverage of deep web and implements a proficientcrawling technique. A focused crawler that comprises of two layers has beenused: Site-discovery and In-depth Crawling. It performs site discovery using reversesearching technique to fetch center pages from the known deep web pages andhence relevantly finds many data sources from several domains.
The smart web crawler achieves more accuracy by correctlyprioritizing the gathered sites and concentrating on the given domain. TheIn-depth crawling layer uses smart learning to perform search within a site anddesign a link hierarchy for avoiding biasness towards certain directories of awebsite for wider coverage of web. Grid computing has emerged as a framework for aggregatinggeographically distributed heterogeneous resources that enable secure andunified access to computing, storage and networking resources for Big Data 8.Grid applications have vast datasets and/or carry out complex computations thatrequire secure resource sharing among geographically distributed systems.
Grids offer coordinated resource sharing and problem solving indynamic, multi-institutional virtual organisations 9. A virtual organisation(VO) comprises a set of individuals and/or institutions having access tocomputers, software, data, and other resources for collaborativeproblem-solving or other purposes 10. A grid can also be defined as a systemthat coordinates resources that are not subject to a centralised control usingstandard, open, general-purpose protocols and interfaces in order to delivernontrivial qualities of service 11.
A number of new technologies have emerged for handling big-scaledistributed data processing, (e.g. Hadoop), where the belief is that movingcomputation to where data reside is less time consuming than moving data to adifferent location for computation when dealing with Big Data. This iscertainly true when the volume of data is very large because this approach willreduce network congestion and improve the overall performance of the system.However, a key grid principle contradicts with this as in the grid approach,computing elements (CE) and storage elements (SE) should be isolated, althoughthis is changing in modern grid systems.
Currently, a lot of scientificexperiments are beginning to adopt the “new” Big Data technologies, inparticular for metadata analytics at the LHC (What isLHC????), hence the reason for the presented study.Parallelisation is used in order to enhance computations of BigData. The well known MapReduce framework 12 that has been used in a lot ofcompanies has been well developed in the area of Big Data science and has theparallelisation feature. Its other key features are: its inherent datamanagement and fault tolerant capabilities.
The Hadoop framework has also been employed in many places. It isan open-source MapReduce software framework. For its functions, it relies onthe Hadoop Distributed File System (HDFS) 13, which is a derivative of theGoogle File System (GFS) 14. In its function as a fault-tolerance and datamanagement system, as the user provides data to the framework, the HDFS splitsand replicates the input data across a number of cluster nodes.
The approaches for collecting and storing Big Data for analyticdescription were implemented on a community-driven software solution (e.g.,Apache Flume) in order to understand how the approaches integrate seamlesslythe data pipelines.
Apache Flume is used for effectively gathering,aggregating, and transporting large amounts of data. It has a flexible andsimple architecture, which makes it fault tolerant and robust with tunablereliability and data recovery mechanisms. It uses a simple extensible data modelthat allows online analytic applications 15.