学习交流

The Best Practices of Enterprise-level Data Center Construction

2017-07-28 16:27:09　|　来源：中培企业IT培训网

At present, most of data centers used by engineering enterprises are built by using traditional technology with several disadvantages, which contain high construction cost, weak scalability and limited capacities of calculation and analysis. To meet the need of data storage, processing, analysis and application based on big data, enterprise-level data centers, which combined with many technologies such as parallel computing, analysis of large-scale data, linear expansion, support of all types of data, are able to effectively achieve the centralized integration and analysis of data resources in all of businesses, levels and types.

At present, data centers built by most enterprises in engineering industry accumulate a large amount of structured data, unstructured data, geographic information data and massive real-time data. At the same time, most of them use centralized server architectures (such as Oracle Rac), which leads to weak scalability, so that it cannot meet the increasing need of data storage. Besides, data processing is mainly based on single-point models, lacking the capacity of real-time parallel computing, so that it cannot meet the need of processing the massive data in real time. Data storage and processing can only cope with structure data; it cannot effectively stores, processes or analyzes unstructured data; it cannot provide the service of data storage and processing in all directions and types under the environment of big data; it cannot support the deep analysis of data.

The overall structure of enterprise-level data centers in engineering industry based on big data is shown in Figure 1. According to the layers, it can be divided into seven layers, including data source layer, data integration layer, data storage layer, analysis/service layer, business application layer, front end access layer, overall data management platform.

Figure1. The Overall Structure of Enterprise-level Data Centers in Engineering IndustryBased on Big Data

By using interface tables, interface files, data reception services and data information reception, data centers can achieve the acquisition of structured data, unstructured data and real-time data to meet the requirement of different data timeliness. In the data storage layer, data centers contain data storage platforms, distributed data platforms and streaming data platforms to store data with different characteristics and provide the related data services. Data centers provide the integrated result data through the ways like push in batches and real-time data service, and meet the requirements of data sharing and data application through the ways of asynchronous data push. Besides, data centers achieve the functions of comprehensive information display and functional analysis and decision-making, and meet the requirement of displaying all kinds of analysis results in front ends through integrated display in various front ends (such as PC terminal, large screen terminal and mobile terminal). Meanwhile, data centers provide data resource management, which means managing metadata, data quality, data standards, data models and data resources in data centers.

Data Integration Layer:

[including data acquisition and job scheduling]

Data acquisition

Data acquisition refers to delivering the structured data, unstructured data and real-time data of the collecting source systems. It contains interface table processing, message reception processing, data reception processing, real-time data acquisition processing and unstructured files processing.

Job scheduling

Job scheduling can achieve the scheduling of structured data, unstructured data and real-time data, the operation of inner data in data centers (including ETL, MapReduce, Sqoop, etc.), and the unified centralized scheduling of jobs pushed to each target system by data. It implements scheduling engines, provides the automatic and manual adjustment mode of the job and controls the execution order of the job based on the job dependency configuration information. Meanwhile, it controls the concurrency of the job and records the running results and the logs of the job.

Data Storage Layer:

[It contains traditional data repository platforms based on relational database as well as distributed data platforms and streaming data platforms based on Hadoop ecosystem, which can store different data and provide different data services.]

The data repository platform

The data repository platform uses hierarchical design, divided itself into buffer layer, integration layer, summary layer and market layer.

Buffer layer stores data collected from source systems by data centers. It can share the pressure of distributing data in bulk and in real time in source systems, avoiding the problems of performance pressure, jet lag of different versions, developing for many times, redundancy storage because of getting data repeatedly. Meanwhile, as a kind of data source, it can avoid the influence to data integration layer and summary layer because of the changes of the original systems (such as data structure, time window).

Integration layer is the business data after data cleaning, conversion and integration, which is the core data layer in data centers.

Summary layer forms statistic and aggregate enterprise data according to the subject dimension; it can form aggregate data according to the requirement of processing the subject reports; the storage of aggregate data is formed by storing aggregate data according to main body and calculating business data through the dimensions of data, main body and processing types .

Data market layer is the analytical data set for specific business units (such as business departments). The data in the layer is mainly based on the data of integration layer and summary layer, which also contains the specific analytical data supporting targets.

The distributed data platform

The distributed data platform mainly stores the following types of data: massive structured data, unstructured file data and dumping data of streaming data and relational database which are difficult to store in traditional relational database. According to the data storage requirement and the characteristic of distributed platform technology component, the platform can be divided into HBase-based data storage area and Hive-based data storage area.

Unstructured data layer stores the unstructured data from all source systems, which contains office documents, design drawings, text files, image files, etc.

Massive structured data layer stores the massive structured data from all structured systems.

Dumping layer of streaming data stores the periodic dumping data from streaming data platforms, help streaming data platforms to achieve the persistent storage of real-time data.

The streaming data platform

The streaming data platform includes real-time data integration layer, real-time data summary layer and business data buffer layer.

As for real-time data integration layer, in the integration layer of streaming data platform, the entry end of source systems uniformly use the way of Socket communication to interact to avoid the inconsistency of the data source. The data center systems monitor Socket of source systems. When there is data in source systems, the monitor procedures obtain the data and write the source information of monitored data in the corresponding message queue.

As for real-time data summary layer, it processes the source data of message queue in integration layer by using Storm in the way of streaming data. Besides, it aggregates, calculates and stores data according to the business requirements.

As for business data buffer layer, when the calculation to streaming data by Storm is finished, it can figure out the result data according to the specific business logic. (The corresponding architecture is shown in Figure 2).

Analysis/Service Layer:

It includes comprehensive information display platform, intelligent analysis and decision-making platform and data services (shown in Figure 3)

Comprehensive information display platform

Comprehensive information display platform, based on data storage layer, is an application including report query and comprehensive analysis to achieve the dynamic configuration to analysis of the page content, layout, components, CCTV, linkage relations, etc.

Intelligent analysis and decision-making platform

Intelligent analysis and decision-making platform includes several modules, such as data loading, data preprocessing, data mining algorithm, analysis model management and model operation scheduling. It provides technical support for data understanding, data preprocessing, algorithm modeling, model evaluation, model application, etc. Besides, to meet the requirement of big data analysis, it digs algorithms library combined with big data (It includes three types of mining algorithms. They are descriptive mining algorithm such as clustering analysis and correlation analysis, predictive mining algorithm such as classification analysis, evolution analysis and heterogeneous analysis as well as the mining algorithm of dedicated data analysis such as text analysis, speech analysis, image analysis, video analysis, etc.)

Data services

Data services mainly achieve real-time data services, subscription, release, batch data services, etc. Besides, it provides the cache function to enhance the overall performance of the system.

Data Management Layer:

It includes functions of metadata management, data quality management, main data management, data standard management and centralized job scheduling and monitoring (shown in Figure 4).

Metadata management

It can achieve the rapid search, acquisition, use and sharing to metadata in data centers. Besides, it can provide metadata support for data centers data sharing and exchange, multidimensional analysis, assistant decision making, data mining, etc.

Data quality management

It can achieve the normalized quality audit of data in data centers and ensure the real-time, complete and compliance of data receiving in business systems.

Main data management

It can achieve the unified management, application and maintenance of main data like materials, projects and contracts to ensure the consistency and stability of main data modification.

Data standard management

It can achieve the unified management of standard documents in data centers.

Centralized job scheduling and monitoring

It can achieve the unified dispatching management and monitoring of ETL interface operations and big data operations.

With the development of information level in engineering industry, the information systems have been fully integrated into all aspects of the businesses of enterprise production and management, which have accumulated a large number of structured data, unstructured data, geographic information data and massive real-time data. As a result, using big data-based enterprise-level data centers can make up the disadvantages of traditional technology, solve the problems of weak expansibility, high construction costs and limited capacities of calculation, analysis and mining and meet the requirement of storage, processing, analysis and application of all types of data under the environment of big data.

标签：企业级数据中心