I’m harukasan, the CTO of pixiv Inc.
We consider it an important responsibility as a platform to protect the works posted by users on pixiv and pixiv-related services from being gathered or aggregated for malicious purposes. Although pixiv Inc. already employs various technologies for identifying and blocking crawlers that aggregate large amounts of content for malicious purposes, I would like to take this opportunity to introduce, with this article, the strategies we currently use and outline some of the measures we’re working towards implementing in the future.
Our current measures against malicious large-scale content aggregation
pixiv Inc. uses multiple solutions to prevent the large-scale aggregation of content for malicious purposes, including some very basic measures. Below, I’ll introduce some specifics about the technologies we use.
CAPTCHA technology is a well known anti-bot solution. At pixiv Inc., we use reCAPTCHA Enterprise, Google’s premium version of reCAPTCHA. reCAPTCHA Enterprise includes all the features of the standard reCAPTCHA, but also allows model tuning using annotation and the detection of suspicious activity through Account Defender, which can also identify accounts that are exhibiting similar supicious behavior.
By using these features of reCAPTCHA Enterprise, we are able to block requests based on IP address or authorized sessions and also identify and investigate malicious activity by the same individual using different IP addresses and authorized sessions. However, Account Defender and other new features provided by reCAPTCHA Enterprise are still difficult to fine-tune, meaning that reCAPTCHA Enterprise alone is not enough to detect all malicious activity.
pixiv Inc. uses Cloudflare security solutions such as DDoS Protection and Web Application Firewall. These services help mitigate DDoS attacks at the network layer while also detecting attacks at the application layer and control requests from specific IP addresses or, in certain cases, from specific networks.
We also use Cloudflare Bot Management to help protect against bots. However, we are not currently able to implement Cloudflare Bot Management on all of pixiv. The reason for this is that we receive a high number of false detections on some networks, particularly mobile carrier networks. Through minimum viable testing, we have learned that if we implement bot management across the board, many legitimate users receive frequent erroneous verification challenges. Bot management continues to pose an issue for us, but Cloudflare is an effective method for handling suspicious activity that also protects against high-volume attack attempts. This technology increases the cost incurred by the bad actors behind malicious automated crawling and enables us to rapidly tackle the problem once it is detected.
We have implemented basic countermeasures like a rate limit on HTTP API when searching or getting work details on pixiv that increase the time cost for acquiring large volumes of data. We have also prohibited access to some endpoints for unauthorized sessions in order to restrict requests from unverified users.
All behaviors regarding searches and acquisition of work details are logged, stored, and analyzed in Google BigQuery, which enables us to detect and take measures against malicious accounts.
robots.txt and control using HTTP referers
In order to prevent images from being leaked to outside sites against the wishes of the creator, we have instituted access controls using robots.txt and referers. This method is effective against conventional crawler programs, but it cannot prevent all malicious attacks.
pixiv Inc. takes many other security measures in addition to the measures detailed above, but identifying all malicious data acquisition requests is essentially impossible. Therefore, we are not currently able to guarantee a prevention of malicious requests 100% of the time.
New measures being introduced
In addition to the countermeasures introduced above, pixiv Inc. is currently working on new measures for preventing large-scale data acquisition for malicious purposes and other forms of malicious activity on our platforms.
Detection of fraud using data infrastructure strategy
pixiv Inc. is working to build a cross-platform data infrastructure strategy for all of its services using Google BigQuery and Looker; using this, we will be able to track accounts that repeatedly engage in suspicious activity.
Going forward, we will continue to improve the precision, scope, and real-time efficacy of this data infrastructure strategy in order to further boost our ability to detect malicious activity.
Similar image detection
We are currently conducting technical research on similar image detection for the purpose of automatically detecting similar images posted to our platforms. pixiv already uses this kind of technology to block spam and other malicious activity by blocking the posting of images that have been posted previously and identified as malicious.
The current precision level of this technology only allows us to block images that have already been clearly identified as spam, but we’re looking into whether certain improvements to the technology may allow us to use it for other purposes as well.
Monitoring using machine learning
While images posted to our platforms are monitored by our monitoring team, we are working on developing a machine-learning moderation system that will improve the speed and accuracy of image moderation. We already use a product supplied by Hive for image moderation, but using a purely automated approach for image moderation is still not feasible due to the existence of a number of potential methods for avoiding detection and other limits of the technology. However, we are continuing to conduct technical research aimed at improving this technology for possible future use.
We are also looking into other methods for improving the security of our platform, in addition to the strategies discussed above. However, as previously stated, it is very difficult to block all malicious requests and there is currently no available technology capable of fully addressing the root of this specific problem.
The efficacy of these technological measures is limited by definition and in many cases detection of malicious activity resembles a game of cat-and-mouse. Furthermore, not all of the measures discussed above have been fully implemented at the time of publication of this article, and we fully acknowledge that there are many ways in which our countermeasures need improvement. However, despite this reality, we consider it our responsibility to continue to do what we can to verify and incorporate the most effective technology we can find to boost the security of our platform.
While no single perfect solution exists, pixiv Inc. will continue to actively conduct research and develop its services in order to continuously improve its safety and security for the users.