An recent years enhancing the security of statistical database

An efficient data perturbation scheme for preserving privacy on numerical database1.Introduction
Technology brings convenience, and the cloud
computing technique rises in recent years. The inside data in organization may
increase rapidly. In spite of organization may build storage space by himself,
but they may publish these data into the data analysis company for some
marketing purposes. Hence, the data mining techniques plays an important role
in the Knowledge Discovery in Databases (KDD). But the malicious data analysis
company may record personal data when organization publishes statistical
database to the company end. If the company is not trusted, there is a leakage
crisis. For these reasons, it leads to privacy research becomes more popular in
these years. Statistical Data Bases (SDBs) are used to produce result of
statistical aggregates, such as sum, average, max and min. The results of
statistical aggregates do not reveal the content about any single individual
tuple. However, the user may ask many legal queries to infer confidential
information from the gaining of database responses.

In
recent years enhancing the security of statistical database has gotten a lot of
attention. The problem of security in classical statistical database involves
three different roles 17: statistician, who interest is to gain aggregate
data; data owner, who wishes individual records are security; database
administrator, who needs to satisfy both of above roles. The privacy challenges
in statistical database are classified to two aspects 15: for data owner, it
should avoid data theft by hacker, avoid data abuse by service provider, and
restrict user access right; for user, it should hide query content, and database
does not reveal query detail. There are many approaches have been proposed. Navarro-Arribas and Torra
organize four categories approaches as following 16: 1) Perturbative methods, which modify the original data to reach a
degree of privacy. They usually called noise; 2) Non-perturbative methods, that technique masks the data without
introducing error. In contrast to perturbative methods, data is not distortion;
3) Cryptographic methods, which use
classical cryptography system; 4) Synthetic
data generation, which generates random data while retaining relationship
with the original data. In order to protect confidential information in database,
Statistical Disclosure Control (SDC) is most used for privacy preserving
solution on statistical database. Micro-Aggregation Techniques (MATs) are
considered to the family of SDC, and belong to perturbative methods. The method
of microaggregation has many attractive features including robust performance,
consistent responses, and ease to implementation 6. User is able to get
useful information since this method would not reduce the information in the
content. In other words, there is minimum information loss through this method.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!


order now

Furthermore, we review
some approaches for preserving privacy 1-5,8.12-14,17. In particular, the
microaggregation scheme is attracted to be use on statistical databases in
these years, because it replaces the original value, lower distortion, to
prevent the identity and prediction disclosure. And the replaced data has not lead
to the problem for data analysis or data mining applications. All of records in
database can be represented to a data point in coordinate systems.

This
paper considers a combination of two or more nonconfidental attributes, such as
age, weight, can be used to link an individual. Such set of attributes is
collectively called a quasi identifier. A popular approach for replacing
original data is to use clustering-based technique to prevent identity
disclosure. Hence, the adversary may be confused when the original data is
replaced by grouping measure. Although the data in the dataset is homogeneous
by clustering-based technique, but there is a problem of prediction disclosure.

 

2.Proposed scheme

The
paper concerns the problem of prediction disclosure that quasi-identifier is
generalized by homogenous of microaggregation method. The quasi-identifier has
one or more attributes may link to an individual. For briefly, we only consider
a quasi-identifier with two attributes. First, all values of quasi-identifier
are converted to data point on coordinate system. To address prediction
disclosure, the homogenous values after the process of original
microaggregation method do cluster first. Then we generate noise based on
centroid of these groups. In order to enhance speed of noise injection, all
noise values are formed to a set, which is called noise matrix in this paper.
Each original value corresponds to a noise value. In this section, we introduce
concept of microaggregation, and next illustrate the clustering technique which
is based on Prim’s MST. The paper mainly idea are generating noise and noise
injection procedure. These of two will describe in the rest of this section.

 

2.1Preliminary

Microaggregation technique is the family of statistical disclosure
control and is applied to numerical data, categorical data, sequences, and heterogeneous
data 16. It computes a value to represent a group and replace original value
to confuse the adversary. All of records are formed a group with its nearest  records.  is a constant value, threshold, preset by data
protector. If  is higher, the degree of privacy is higher,
but data quality is lower. In the contrast,  is lower, the degree of privacy is lower, but
data quality is higher. It is a trade-off between data disclosure risk and less
information loss. Although this method may damages the original data and may
leads to data distortion. But it just makes sure that the low levels of data
distortion. It has not affected to database operation. Therefore, minimizing
the information loss is a major challenge of this method. There are two main
operations for microaggregation, that are partition and aggregation which we
describe in detail as follows:

l       
Partition: records are partition to
several disjoint groups, and each group is included at least  records.

l   Aggregation: each
record in group is replaced by the centroid of group, which is a computed value
to represent the group..

 

2.2MST Clustering

We
adopt Prim’s minimum-cost spanning tree clustering technique which is proposed
by Lazlo and Mukherjee in 2005 11.

First
step, the proposed clustering technique relies on Prim’s minimum-cost spanning
tree, which is constructed based on all records in the dataset. Prim’s
algorithm is a greedy algorithm that finds a minimum-cost spanning tree for a
connected edges undirected graph. It finds a subset of edges to form a
minimum-cost spanning tree that connects to all nodes, where the total weight
of all edges is minimized. Some notation is defined to facilitate the
discussion. Each record with more attributes in the dataset D can be converted
to data point on the coordinate systems and is considered a node u in the
minimum-cost spanning tree. The node u can be connected to the other node v in
the dataset D and forms an edge e(u,v), u,v?D. All
of edges can be computed to a value by random two nodes in the dataset. That
computed value can be used to as a weight w for each edge. According to Prim’s algorithm, it first selects a single node u?D and builds a minimum-cost spanning tree F={u}, no edges. The
next step of Prim’s
algorithm selects another node v? F-D,
where v is closest to the set F and is closest to the node u. There is a new
edge e(u,v) be formed by two nodes u,v?D, and
node v points to parent node u, and add v to the set F, F={u,v}. Each node
points to its parent node in the tree, but the initial node points null. In
this case, the node u points null. This is an iterative process until F=D.
Prim’s algorithm selects a single node, regarded as a root of tree, in the
graph to grow to a minimum-cost spanning tree. The total weight in all selected
edges is minimized. The result of Prim’s MST algorithm is shown in Fig 1, where
nodes of tree are connected by red lines and the number of weight is near to
each edge.

Fig 1. Minimum-cost spanning tree

 

Second step, in order to partition all nodes to form cluster in
the MST, we should consider that how many edges in the MST are removable. The
idea is visits all edges in the MST from longest to shortest, and determines
the edge cutting while retaining the remaining edges. After edge cutting, the
MST partitions to several subtrees and these can be formed cluster. All of
edges are assigned to a priority queue  in descending order. Then, we obtain an edge  in sequence from the priority queue , and
consider each edge whether is removable, 
where  is visiting node and  is parent node of . We consider
the two subtrees size from visiting node  and parent node  respectively, and determine each size is
greater than  which preset by protector. The edge  is removable when both of two subtrees sizes
are greater than , respectively.
In the contrast, the edge  is not removable. First, we obtain a subtree
size from visiting node  by , where  is used to obtain subtree size from node . Second, we
consider the root node from visiting node  toward its parent node . Then we
obtain another subtree size by . For briefly
illustrate, we suppose these two subtrees size are greater than , that edge  is removable. We remove the edge  from priority queue , , and replace
the parent pointer of  to  to represent it is a root node of subtree.

Final
step is a simple processing for all nodes partition to disjoint cluster. Each
root of subtree can be formed cluster by traversal its descendant nodes. We
find out all node which parent pointer is , and assign to a set of root . The parent pointer with  which represents root of subtree, and each subtree
can be formed cluster , , where  is a set of clusters, , where . We obtain a root node  from front of root set , and traverse all descendant
nodes  by following the root node  of subtree. After traversing the subtree, the
root node  and its all descendant nodes  can be form a new cluster . And then remove root node  from , . We can find next cluster
follows above procedure. This is an iterative process until . Finally, all nodes are
partitioned into disjoint clusters.

 

2.3Noise Generation

After
clustering all data points, the next step is generating noise based on centroid
of these groups. Our scheme is based on Huffman coding which proposed by
Huffman in 1952 9. Huffman coding algorithm is popular on data compression
technique 710. We can identify distinct data point by building Huffman
coding tree. Because Huffman coding has some features, such as 1) each
character has a corresponding Huffman code; 2) the character with higher
probability has shorter Huffman code. In the contrast, the character with lower
probability has longer Huffman code. These features can be used on generating
noise to preserve privacy in database. There is longer noise to be injected to
original data with lower probability, easy reveal privacy, to confuse
adversary. In the other words, the data with high probability means not easy
reveal privacy for personal.

 

2.4Noise Injection procedure

As
mentioned above, the noise is built by Huffman coding tree based on the
probability of original value, then it is converted to a set, we called the
noise matrix in this paper. Each data point is the original value v may
correspond to a noise value r in the noise matrix. This method can simplify the
process of the original value of the disturbance, and the noise injection
process easier and more intuitive. After the building noise matrix, we describe
the process of noise injection. We put the sting of noise into the queue, and
add ‘1’ sequentially to original data by the function of least significant bit
(LSB) until the queue is empty. Due to the use of the LSB function disturb the
original value, the data distortion may be significantly reduced.

 

3.Results

We consider running
time of noise generation that calculates per unit time in milliseconds. In
order to estimate the precise time, we obtain the average of 61 times of running
time of noise generation. Our experiments were conducted to explore time
changes between unclustering and clustering. Which the unclustering scheme has
not include MST clustering technique. Moreover, the clustering scheme has
various k which are group size preset by data protector. We also discuss the
time changes of instances from 10 to 1,000.

The
experimental results show the running time of noise generation will be slower
when the records are increased. The running time of noise generation of
clustering scheme is faster than unclustering scheme. In the experiments, we
also find a noise in the running time in clustering scheme, but overall the
growing of time is very smooth. In addition to running time examining, we also
explore the data quality after noise injection procedure. The measure of data
quality is follow Domingo-Ferrer proposed in 2002. The experimental results
show less information loss in unclustering scheme. But all of the information
loss results are not exceed to 50 percentages. However, it is a trade-off
between minimum information loss and value disclosure risk. Summary, our
proposed scheme is an efficient to generate noise to preserve privacy in
database.

 

4.Conclusion

Due to
cloud computing technique becomes very popular in these years, and technology
brings convenience. It leads to the inside data in the organization increase
rapidly. The concept of database as a service has been proposed in 2002.
However, it may reveal personal privacy when all the data publish to the
third-party service provider, but it cannot be trusted. Another scenario is
when dealer collects transaction data about personal and publishes to the data
analysis company for some research or marketing purposes, but the company is
malicious. It also has leakage crisis. For these reason, how to preserve
privacy in database becomes more important in recently years. Although security
issues in database are a huge problem, this paper only concerns prediction
disclosure issue that adversary is able to predict the confidential value of an
individual. We present an efficient noise generation scheme which relies on
Huffman coding algorithm. We also build a noise matrix that can add intuitively
noise to original value. Moreover, records in the dataset are partitioned to
disjoint cluster before generating noise. Our scheme can only be used in
numerical database or statistical database. In the future, we will consider
non-numeric values and propose a conversion mechanism. The mechanism is
adaptive to our scheme and can be converted between non-numeric value and
numeric value. When all non-numeric can be converted to numeric value, it can
adapt to our scheme and extend this study. 

 

References

1A. Asuncion and D. Newman, “UCI Machine Learning Repository,” University
of California, Irvine, 2007. Online. Available:
http://archive.ics.uci.edu/ml/.

2C. C. Chang, Y. C. Li, and W. H. Huang, “TFRP: An efficient
microaggregation algorithm for statistical disclosure control,” Journal of
Systems and Software, vol. 80, no. 11, pp. 1866–1878, Nov. 2007.

3S. K. Chettri and B. Borah, “An Efficient Microaggregation
Method for Protecting Mixed Data,” Journal of Computer Networks &
Communications, vol. 131, pp. 551–561, 2013.

4J. Domingo-Ferrer and J. M. Mateo-Sanz, “Practical
data-oriented microaggregation for statistical disclosure control,” IEEE
Transactions on Knowledge and Data Engineering, vol. 14, no. 1, pp.
189–201, 2002.

5J. Domingo-Ferrer, A. Martínez-Ballesté, J. M. Mateo-Sanz, and
F. Sebé, “Efficient multivariate data-oriented microaggregation,” The
International Journal on Very Large Data Bases, vol. 15, no. 4, pp.
355–369, Aug. 2006.

6E. Fayyoumi and B. Oommen, “A survey on statistical disclosure
control and micro?aggregation techniques for secure statistical databases,” Journal
of Software: Practice and Experience, vol. 40, no. 12, pp. 1161–1188, 2010.

7P. Gonciari, “Variable-length input Huffman coding for
system-on-a-chip test,” IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 22, no. 6, pp. 783–796, 2003.

8S. Hajian and M. A. Azgomi, “A privacy preserving clustering
technique using Haar wavelet transform and scaling data perturbation,” in IEEE
International Conference on Innovations in Information Technology, 2008,
pp. 218–222.

9D. Huffman, “A method for the construction of minimum
redundancy codes,” in the IRE, 1952, vol. 27, pp. 1098–1101.

10X. Kavousianos, “Optimal selective Huffman coding for
test-data compression,” IEEE Transactions on Computers, vol. 56, no. 8,
pp. 1146–1152, 2007.

11M. Laszlo and S. Mukherjee, “Minimum spanning tree
partitioning algorithm for microaggregation,” IEEE Transactions on Knowledge
and Data Engineering, vol. 17, no. 7, pp. 902–911, 2005.

12X. B. Li and S. Sarkar, “Data clustering and
micro-perturbation for privacy-preserving data sharing and analysis,” in International
Conference on Information Systems, 2010.

13X. B. Li and S. Sarkar, “Class-restricted clustering and
microperturbation for data privacy,” Journal of Management Science, vol.
59, no. 4, pp. 796–812, Apr. 2013.

14J. L. Lin, T. H. Wen, J. C. Hsieh, and P. C. Chang,
“Density-based microaggregation for statistical disclosure control,” Journal
of Expert Systems with Applications, vol. 37, no. 4, pp. 3256–3263, Apr.
2010.

15Y. Lu and G. Tsudik, “Enhancing data privacy in the cloud,” Journal
of IFIP Advances in Information and Communication Technology, vol. 358, pp.
117–132, 2011.

16G. Navarro-Arribas and V. Torra, “Information fusion in data
privacy: A survey,” Journal of Information Fusion, vol. 13, no. 4, pp.
235–244, Oct. 2012.

17J. Traub, Y. Yemini, and H. Wo?niakowski, “The statistical
security of a statistical database,” ACM Transactions on Database Systems,
vol. 9, no. 4, pp. 672–679, Nov. 1984.