Written by an RIG Intern Researcher
The development of computer technology and the rapid rise of artificial intelligence, machine learning has become the most commonly used technology in many fields such as banking, e-commerce, spam recognition, and speech & facial recognition. Machine learning technology plays an indispensable role in promoting social progress and facilitating people’s lives. The performance of machine learning is usually proportional to the amount of user data, so many companies need to collect large amounts. In the context of big data, both users and service providers are facing privacy leaks. Users worry that their personal information such as identity information and geographic location will be leaked. Service providers are also concerned about the theft of related information. In addition, machine learning hackers use model leakage to obtain data for profit. How to solve privacy protection problems in machine learning is an important research field.
In recent years, many scholars have carried out research on the realization of privacy protection in the application of machine learning and have steady made progress. This blog introduces and summarizes the corresponding homomorphic encryption technology in view of the privacy issues in the data and models in machine learning.
Homomorphic encryption enables users to directly perform operations on the ciphertext, and the result obtained after decryption is consistent with the calculation result under the plaintext, which is a direct and effective technology to protect data privacy. The private scalar product protocol was proposed to realize the classification of cooperative decision trees to achieve privacy protection in the decision tree algorithm; subsequently, scientists used the Paillier encryption algorithm to design a decision tree under homomorphic encryption Classification model. Zhan used additive homomorphic encryption algorithms and inadvertent transmission protocols to construct a privacy decision tree and random forest evaluation protocols. Additionally, Zhan used a decision tree with a depth of up to 20 and more than 10,000 decision nodes to evaluate the scalability of their protocol. The results show that the protocol can protect the data privacy of the algorithm while ensuring the efficiency of the algorithm.
Yu proposed a support vector machine that uses a nonlinear kernel function to protect the privacy of horizontally-blocked data, and considered the vertical-blocked data but only focused on the data represented by the binary feature vector. Scientists designed a kernelized support vector machine (kernelized SVM), which can output encrypted kernel values and classifiers.
Zhu established an interactive protocol for outsourcing data for logistic regression, in which users must always communicate with cloud servers for multiple rounds. The communication cost in this protocol is determined by the size and dimensions of the data set, and also depends on the user’s computational cost. Logistic regression can be solved by the least square method. Kin designed a new type of homomorphic encryption scheme and logistic regression’s small square approximation algorithm for the real number calculation optimization problem, and achieved improved accuracy and efficiency.
Samanthula implemented k-nearest neighbor classification on semantically secure encrypted relational data. They used Somewhat Homomorphic Encryption (SFHE) Paillie to encrypt the data and outsourced it to the cloud. While the data owner sent the key to another Cloud server, these two cloud servers are semi-honest, and there is no collusion. They completed the classification by calling the sub-protocols in the main protocol, and designed the Secure Multiplication (SM) protocol, the Secure Squared Euclidean Distance (SSED) protocol, the Secure Minimum (SMIN) protocol, etc. Sub-agreement. It is worth mentioning that they have encrypted the class tags, which has never been done; data owners and queriers are offline after uploading private data, because they have two clouds to perform the computation . However, if the two cloud servers are colluding, the classification will no longer be secure.
Researchers proposed a privacy protection neural network model named Cryptonets. They used a fully homomorphic encryption algorithm to encrypt prediction data, and used a square with a multiplication depth of 1 on the trained convolutional neural network. The function replaced the activation function, and finally achieved 99% accuracy on the MNIST data set. However, there are shortcomings. The model is only suitable for small neural networks. For neural networks with more than 2 nonlinear layers, the test accuracy becomes very low. After that, scientists suggested that the neural network in the training phase should still use alternative activation functions, and a batch normalization layer should be added before each nonlinear polynomial activation layer in the testing phase. The input of the activation function has a stable normal distribution, which improves the accuracy of classification. Scientists proposed faster CryptoNets. By deriving the best approximation of commonly used activation functions, they realized large sparse coding, minimized the approximation error, and improved the nerve of encrypted data through the sparse representation of the entire neural network.
Homomorphic encryption can guarantee the security of machine learning systems and help break the goal of data silos. In the implementation of encrypted machine learning, the use of homomorphic encryption schemes is the current mainstream view. However, due to the limitations of the current encryption schemes on the types of operands and operations, people seek compound strategies, such as combining multi-party computation and algorithm approximations. Therefore, how to improve homomorphic encryption efficiency and combine it with other encryption algorithms is the research direction that scientist can do in the future.
- Zhan, Justin. “Using homomorphic encryption for privacy-preserving collaborative decision tree classification.” 2007 IEEE Symposium on Computational Intelligence and Data Mining. IEEE, 2007.
- Yu, Hwanjo, Xiaoqian Jiang, and Jaideep Vaidya. “Privacy-preserving SVM using nonlinear kernels on horizontally partitioned data.” Proceedings of the 2006 ACM symposium on Applied computing. 2006.
- Zhu, Xu Dong, Hui Li, and Feng Hua Li. “Privacy-preserving logistic regression outsourcing in cloud computing.” International Journal of Grid and Utility Computing 4.2-3 (2013): 144-150.
- Kim, Miran, et al. “Secure logistic regression based on homomorphic encryption: Design and evaluation.” JMIR medical informatics 6.2 (2018): e19.
- Samanthula, Bharath K., Yousef Elmehdwi, and Wei Jiang. “K-nearest neighbor classification over semantically secure encrypted relational data.” IEEE transactions on Knowledge and data engineering 27.5 (2014): 1261-1273.