Laborarory of Data and Knowledge Engineering
Object Duputy
The enhancing of computers' performance and the emerging of network have made the information recourses become increasingly complex. The range of information processing is no longer limited to simple data types like numeral and string, but need to describe and store massive semi-structured or non-structured complicated data such as multimedia, geographic information, and biological information, etc.
The aim of complex data management is to realize general share of complex information. Different applications require different data format and management, which should be flexible in order to make the same data meet the requirements of various applications. Data is represented by relation in relational databases and relational table can be divided and recomposed by relational algebra. This property makes data be presented as different representation format through transformation to satisfy the requirements of different applications, which makes relational databases flexible. However, object-oriented idea encapsulates data and methods into an object which makes it impossible to divide or recompose objects, so the object-oriented database is not flexible and hard to reach the goal of information sharing.
Observing the objective world, we find that human beings can not be decomposed. When a man needs to fulfill different tasks in deferent occasions at a same time, he can ask a deputy to help him. If we define human being as the object in a database, then the deputy person need to be described as deputy object. Therefore, it's necessary to introduce deputy object into databases so that we can make an indirect change through its deputy objects, although the object itself is inflexible. Based on this idea, we put forward the object deputy model.
The object deputy model represents any objective entities as the objects, and displays its application particularity, role multiplicity and dynamic classification through its deputy objects. The object and the deputy object that have the same attribute and method are defined as class and deputy class respectively. It provides the object deputy algebra, including Selection, Extension, Projection, Union, Join, Group operations, etc. aiming to derive each kind of deputy class to represent the complex semantic relations such as Specialization, Generalization, Aggregation and Grouping, etc. Various of dependent relation and semantic constraint are existed between the object and the deputy object, their uniformity may be maintained through the object propagation. This model both has the complex semantic representation ability and highly effeciency in data processing in object oriented data model, and has the flexibility of the relational data model, which can well model for the non-structure/semi-structurized complex information, so that can realize the object view, role multiplicity and object propagation, etc. unitedly.
To apply the object deputy model to the practical complex data management, we needs to develop an database management system based on object oriented deputy model. Under the subsidizations of national 863 database significant special project etc., we profited from the mature realization technology of the relational databases and the object oriented databases, designed the object deputy database language according to the SQL style, developed new implementation technologies suitable for the complex data in object deputy model to save, index, query, propagation as well as the concurrency control, etc. and have successfully developed an independent intellectual property rights object deputy database management system, TOTEM. Aiming at the characteristic of complex data, we have established a database system testing plan similar to the international database benchmark performance testing standard TPC-C, and developed corresponding testing tools, and have given a comprehensive test to the TOTEM system performance and the ACID performance. The result validated its high efficiency and stability. Finally, through the development of the typical demonstration applications, the effectiveness of the TOTEM system in managing the complex data in multimedia, geographic information, biological information, etc. has been proved.
Database Watermarking Technology
With the popularization of Internet, all kinds of electronic products have been published in the network form, which has made copyright protection an urgent problem to be solved instantly. Digital watermarking is an effective method, and is under hot discussion in the research field of multimedia information security. As a sort of electronic product, database is very hard to be created but rather easy to be pirated. Therefore, owners of databases are likely to lose their enthusiasm in investigating the databases if a proper way of copyright protection is absent. To introduce digital watermarking technology to databases and to research on digital copyright protection in combination with cryptography, coding and protocol can organically form a perfect copyright management system, which will effectively protect the databases' copyright.
Based on PostgreSQL, we have already established a digital watermarking database management system, which can add three types of watermark information to the database, i.e. character, number and image, and make a periodical achievement on the copyright protection of geographic database. We can now detect nearly 100% watermark information in approximately 50% of data in selection attack. In alteration attack, we randomly change 50% of tuples by resetting 1 bit in a value and can detect approximately 79% of watermark. The watermark still survives 75% when 50% of tuples are added in additon attack. So these attcks can not severly affect the system.
Database Encryption
At present, the application of database technology has been more and more important so that many enterprises have put large number of data into databases, which has resulted in a mushroom of database information and a higher cost of database construction and maintenance. With the development and prevalence of Internet, a new kind of database managing method, the DA S (Database As a Service) model, has come into birth. Under this model, Database Service Provider (DSP for short) will provide enterprises a seamless mechanism, which enables them to manage data through Internet. Additionally, this mechanism is also in charge of the maintenance of database system, which will lead to a considerable cost saving.
On the basis of the encryption method on DAS model, we divide traditional encrypted table into encrypted data table and index table, and introduce an attrribute of tuple identifier so that we can withstand statistical attack effectively when data queries are not decrypted. Meanwhile, a more compatible query optimization method is under research, prospecting for an enhanced query efficiency. In the aspect of multi-user extension, we proposed an access control method using cryptographic key, and defined a hierarchy in key space which leads to notably reduced amount of keys.
Cross-media
With the increasingly popularity of the digital media, the channels of digital media acquisition, storage media, and the style and format of media are undergoging a rapid development. The digital media information in our daily life is peaking significantly, which takes us large amount of time on scanning and dealing with newly increased media information. However, most of the existing information searches engines can only deal with a limited range of media, and one searching mechanism suiting for a certain kind of media may not be applied for others. Furthermore, the single searching mode, for instance, the one based on image patterns can only find image set similar to the sample, and the one based on keywords can only find literal information such as website and theses. In this way, the gap between gross digital media and the amount of media that can be searched will increase gradually.
Cross-media is a combination of plane media, three-dimensional media and network media.
Plane media: discrete media such us number, text, image and figure.
Three-dimensional media: complicated media (3D model, and time based media such as video, audio and animation).
Network media: media using network as its carrier, for example, web page, instant message, email, fax and so forth.
Based on the multimedia and its technology, cross-media seeks multimedia resources conformity and information amalgamation on a three-dimensional platform, making them coexisted. It gains the conjunction and synergetic effect, the complementarity and the multi-dimensional interaction between different media to the maximum limit. Thus to achieve the goal of recognition of the demand, retrieval, release as well as discovery restructuring, the symbiosis newly use and so on, so as to use each kind of media with high efficiency.
The cross-media involves multimedia data mining, artificial intelligence, pattern recognition, probability statistics, neural network, natural language processing and so on. It is a multi-disciplinary overlapping research area. We now emphasize on how to excavate the semantic information from various kinds of media documents, obtain the semantic connection, and combine the extracted associate information from media structure, characteristic, man-machine interaction, etc. to construct a cross-media semantic net, and to realize cross-media retrieval based on semi-natural language or pattern set form.
Database Theory and Programming Theory
From the angle of programming language, the database language usual behaves as function language which have the characteristic of command language: The function of data definition language is to define a set variable with specified element type; The function of data operation language is to assign and materialize the set variable; The function of data query language is to declare and assign the new type (temporary) set variable. The core of database language is its type system, which should have the following characteristics: The program language provides the integrated type construction method, which has the adequate flexibility, and can meet the users' needs of creating the type (pattern e volvement) dynamically; The semantic model of type system is established on a group of carrier set whose elements can be decided by intension equivalence relation, but the constraint (dependence) relation of the elements in semantic domain influences (the standardization of) type definition; The semantic model of type system can be represented by data set in database physically, but the logical organization and physical realization of type system and data set are independent. The relational database language can conform to the above characteristics well, however as for the traditional object oriented database, its definition, operation and query language are relatively independent, lacking a unified type system to support its main operations, and the traditional object database lacks an appropriate theory system to merge these independent parts to an organic whole on semantic model. An important characteristic of the object deputy model is to unify data definition, operation and query together by deputy class and deputy object so that they can be formally represented as the definition and operation of the object class. This enables us to design a programming language and its semantic model for the object database with the above characteristic based on this model, and to establish a complete formal system to support the pattern definition and the evolution of the object deputy database in this foundation, so that can instruct the design and the implementation of the database system.
Scientific Workflow
Scientific workflow is used to describe and control the execution of scientific experiments and processes, for instant, the DNA sequence and the geographic process. Scientific workflow is more complicated and uncertain than commercial workflow. Therefore, a standard workflow mechanism can not sufficiently describe this kind of work.
In a commercial application, the aim to introduce a workflow management is to increase efficiency through reforming. While in a scientific application, the purpose lies on controlling the process of experiments, in order to get more information on how the experiments conduct.
Compared with commercial workflow, the scientific one has two different aspects. One is that all the activities are directed by experiment rather than commerce; the other is that the interaction between workflow model and the activities should support error in the experiment and the special cased decision in a scientific experiment.
Full Text Retrieval
Data records in a database, mainly include structured data such as character, date, numerical value, currency and so on, are all those with limited length or fixed format. Some non-structured data as synopses and theses are also called full text data, which are character data without fixed length and format.
All the existing database systems recognize structured data as their retrieval target, because it is easier to implement. A case in point is that, with a well ordered index table and dichotomy, it is quick to find out desired data in numerical value retrieval. However, as for non-structured data like full text data, it is much more difficult to realize the retrieval. The fundamental purpose of our full text retrieval is to realize a quick search for large volume of non-structured data.
Full text retrieval comprises the technology of word segmentation and full text index technology, etc. At present, our research is mainly based on inverted full text retrieval technology, aiming to integrate it to the source code of PostgreSQL database. So that it can be in the same layer as the existing B-tree, R-tree index mechanism and to realize 10-millions' data amount high efficiency retrieval per table.