Inside Russia’s internet monitoring. How the censorship agency tracks online activity with the help of tech companies
Article
8 March 2023, 12:02

Inside Russia’s internet monitoring. How the censorship agency tracks online activity with the help of tech companies

Illustration: Sonya Vladimirova / Mediazona

Cyberpartisans, a hacker collective from Belarus, gained access to the internal network of the General Radio Frequency Centre (GRFC), a subsidiary organization of Roskomnadzor, the Russian censorship agency.

The hackers claim to have encrypted employees’ computers, disrupted the internal network, and offloaded roughly 1.2 terabytes of data, including the internal mail server archive, internal file storage, data from various internal systems, and data from the FalconGaze employee monitoring system.

The archives of the correspondence, numbering approximately 1.5 million emails and 200,000 text documents, spreadsheets, and presentations, are primarily between 2020–2022.

Mediazona was granted access to these archives which exposed how Roskomnadzor has been using neural networks to monitor the Russian internet—and how non-state actors were providing their help and infrastructure for this endeavor. 

— Переписки раскрывают масштабные планы Роскомнадзора по слежке за российским интернетом при помощи нейросетей. Большая часть этих технологий уже применяется, они ищут не только картинки о суициде, но и, например, посты о войне в Украине.

— Самый масштабный проект называется «Чистый интернет». По замыслу разработчиков, он должен контролировать 100% российского сегмента.

— Для сбора данных «Чистый интернет» использует API поиска «Яндекса». По просьбе Роскомнадзора «Яндекс» увеличил для ведомства число возможных запросов в сутки.

— Кроме того, ГРЧЦ использовал для обучения нейросетей платформу «Яндекса» «Толока». Степень вовлеченности «Яндекса» в сотрудничество с РКН неясна; в компании отрицают, что давали ведомству какие-либо преференции.

— Среди тех, кто сотрудничал с ГРЧЦ активно — Московский физико-технический институт (МФТИ), а также компания Brand Analytics. Технологии последней помогли ГРЧЦ составить сотни отчетов на миллионы страниц.

— Еще две системы с использованием искусственного интеллекта создавались для автоматического анализа видеоконтента (сейчас все трансляции сотрудники ведомства смотрят сами), тоже для поиска «запрещенной информации».

Search and censor. Yandex and the Clean Internet

The archive obtained by Mediazona reveals a connection between Yandex and the efforts to implement a Clean Internet in Russia. It reveals that more than 680 emails mentioning Yandex’s corporate email addresses were sent between 2014 and 2022. Some of the emails are standard communications between the IT giant and Roskomnadzor officials justifying why certain pages (Yandex search result pages, URL shortening service click.ru or Yandex Turbo, a service similar to Google’s AMP) should not be blocked.

There were also a few in-person meetings between Yandex representatives and Roskomnadzor officials, as mentioned by Anastasia Volkova, Acting Head of the Automation Department of the Media Division of GRFC, in her correspondence with colleagues. At one of these meetings, according to Volkova, the Yandex representatives were “advising us [GRFC] on neural networks.” However, Mediazona was unable to find any traces of this collaboration; in all likelihood she was referring to their joint participation in an industry conference.

The Clean Internet flowchart revealing how Roskomnadzor was using Yandex’ technology. Screenshot: Mediazona

Volkova also mentioned that during these meetings, Yandex employees described their internet search API, Yandex.XML, and allegedly promised to remove the request limit for Roskomnadzor.

This promise came in handy. In 2020, Roskomnadzor began developing Clean Internet, a system designed to replace an old version of automatic crawler for “prohibited” content. The new system was to rely on neural networks, not keyword dictionaries.

The head of the department that maintains registers of prohibited information, Ivan Zuev, wrote in May 2020 that “GRFC’s efficiency with regard to social networks is low,” with only child pornography and “suicidal content” searches being fully automated.

Clean Internet was intended to crawl a list of predetermined sources and social networks and use neural networks to pinpoint violations such as extremism, terrorism, calls for participation in rallies, “propaganda of non-traditional relationships,” insults to state symbols, and others.

The slides promise that Clean Internet (AS CHI) would cover 100% of the Russian-language internet—excluding streaming services which were to be handled by another system, AS MAVR—once at full capacity.

The main obstacle was crawling itself—the agency needed help from search engines. In May 2020, Anastasia Volkova decided to reach out to Yandex’s Alexander Kraynov, Artificial Intelligence Development Director, to ask him about operational limitations: only one thousand requests were possible daily.

Later, Volkova clarifies that Roskomnadzor plans to use this API to “monitor the internet for violations of federal law.” At this stage, Yandex declines. In her message to colleagues, Volkova mentions that the company cited its inability to provide extended access for free, and commercial access would require not only payment but also traffic exchange—of which Roskomnadzor has none on its own websites.

The correspondence suggests that GRFC considered other search engines, such as Rambler, Google, or Sputnik, as options, but ultimately dismissed them. One report explains: Google is paid, Rambler is the same as Yandex, and Sputnik has not been indexed for several years.

There is no further correspondence with Yandex regarding the API in the archive. It is likely that Roskomnadzor took over direct communication. In December 2020, Volkova wrote to Yandex again, referred by the head of the Department of Control and Supervision of electronic communications of Roskomnadzor Evgeny Zaitsev. Previously, the discussion was about increasing the limit from 1,000 to 100,000 requests per day, but now officials were requesting 300,000 requests for two accounts.

In 2021 (the exact date we were unable to determine), Yandex finally succumbed to Roskomnadzor’s pressure. The company increased the limit of requests for accounts to 300,000 per day, as noted in reports by GRFC.

Yandex’s search became a key component for Clean Internet. Another tool is a social media crawler developed by Vector Iks LLC. It combs through VKontakte (VK), Odnoklassniki (OK), Moi Mir, Otvety Mail.ru, LiveJournal, and—partially—Telegram and YouTube. In 2023, GRFC plans to add Facebook, Instagram, Twitter, TikTok, Yandex.Dzen, and Rutube to the list.

Yandex’s API is mentioned in reports on Clean Internet’s deployment from January 2022; it is likely still in use. The addition of Mail.ru search is planned for 2023, Google for 2024.

On February 25, 2022, just one day after the start of the war, Clean Internet was tasked with searching for posts and comments containing “calls to illegal rallies related to the situation in Ukraine.”

Another service of interest to Roskomnadzor was Yandex Toloka, a crowdsourcing platform that helps label data for machine learning purposes.

Customers make contracts with Yandex and upload tasks such as classifying images for model training. Tasks are distributed to human users registered with the service and rewarded by the client.

Roskomnadzor employed Toloka from Autumn 2021 to February 2022 to annotate suicide-related content for the “Unified Analysis Module”, Clean Internet’s AI component. It’s unclear if Yandex had any agreement with Roskomnadzor to allow only the agency’s employees to do certain tasks; this is available in Toloka In-House, launched in 2022.

Yandex told Mediazona that they never enabled In-House mode specifically for Roskomnadzor.

Another facet of the Clean Internet project is the bot farm developed by GRFC. According to the plans mentioned in the emails, the final version is set to be presented in May 2023.

This would not be a usual bot farm: fake accounts are used not to post online but rather to comb through messages on social media, including those posted to closed groups and communities.

‘Information Tension Points’: Vepr, Oculus and MIPT

It would be an overstatement to call Yandex a company that aided in the establishment of a system of control over the Russian internet. The IT giant granted access to two services to the General Radio Frequency Centre (GRFC), and according to the correspondence, it did not comply with the request on the first attempt. However, there are also those who collaborated closely with Roskomnadzor and developed standalone services for the agency.

In September 2021, journalists discovered two GRFC contracts published on the government procurement website. One contract was for the development of the proposal for Oculus image and video analysis system, while the other was for the more comprehensive Vepr system proposal. Both contracts were awarded to the Moscow Institute of Physics and Technology (MIPT), with the Vepr contract valued at 10 million rubles and the Oculus contract valued at 14 million rubles.

This Vepr presentation slide discusses “psychopathological over-valued obsessions,” which are different from typical interests because they involve an unusual object or activity that most people would not find interesting. Examples include collecting “boogers” or “counting the number of windows in houses.” These obsessions pose a danger to society because they challenge traditional norms and can lead to negative attitudes towards power, the slide insists. The people expressing themselves this way are characterized as “driven to philosophical intoxication.”

Screenshot from GRFC and Roskomnadzor presentations / Mediazona

In dozens of reports and development plans, GRFC refers to Vepr as a critical priority. The system is required to monitor and even predict so-called “information tension points.”

The Vepr system is similar in concept to the Clean Internet, which involves the collection of online content and analysis using artificial intelligence. However, the emphasis of Vepr is not just on the identification of content, but on its in-depth analysis. This includes developing specific scenarios that GRFC operators can input into the system. A similar project was carried out by RTI, a joint stock company, for the Ministry of Defense, valued at 1,5 billion rubles. It was described as “largely similar to the Vepr information system” in the context of countering information attacks.

The scientific rationale for Vepr was prepared by the Department of Machine Learning and Digital Humanities at MIPT. Dozens of employees worked on the document, which includes references to philosophers such as Niccolo Machiavelli and José Ortega y Gasset, memes featuring Vladimir Putin and Joseph Goebbels, as well as mathematical principles related to language models.

MIPT also paid great attention to the classification of “information tension points.” In a poorly structured 500-page document prepared by the institute, all possible threats are listed in a haphazard manner: terrorism and extremism, criticism of authorities and non-systemic opposition, “LGBT propaganda,” “child-free,” drug addiction, draft evasion, “death groups,” “offensive art stunts,” Gene Sharp methods [of nonviolent action], and even “collecting one’s own boogers or trimmed nails.”

The development of Vepr was not entrusted to MIPT; instead, the contract was awarded to a company called NeoBIT based in St. Petersburg.

Oculus. Screenshot from GRFC and Roskomnadzor presentations / Mediazona

Another proposal developed at MIPT relates to Oculus, an artificial intelligence system for detecting prohibited content in videos and images. The GRFC rationale laments that agency employees are currently required to manually review content, which is impossible due to the enormous amount of information.

In a presentation to officials, MIPT described its capabilities for facial recognition on images (including masked faces), recognizing text in images, and classifying images and videos into categories such as protests, suicidal content, “roofing and zatseping,” and banned logos and symbols. According to an example in the presentation, the neural network recognized the NATO emblem as a symbol of the criminal underworld.

One MIPT document lists similar systems that could be purchased for “insurance.” For example, the “banned content” search system was developed by OKAS LLC for the Center for the Study and Network Monitoring of the Youth, and for facial recognition, MIPT recommended analogues from the same OKAS LLC, NtechLab, VisionsLabs, State Research Institute of Aviation Systems, and the Department of Information Technology of Moscow.

In August 2022, Execution RD LLC was awarded the right to develop Oculus at a cost of 57,7 million rubles. The deadline for completion was set in December 2022. As reported by Kommersant daily, the company had not previously worked as a contractor in government procurement.

Brand Analytics and thousands of pages of reports

Another major company whose services are actively used by GRFC is Brand Analytics (Palitrumlab LLC).

On its website, Brand Analytics claims to be a leader in monitoring and analyzing social media and the media. Its areas of work include brand analysis, finding mentions, working with the audience, and responding to user feedback. It serves large Russian companies, banks, and “government bodies, ministries and agencies.”

Internal accees to Brand Analytics / Mediazona

The requests made by GRFC to Brand Analytics bear a striking resemblance to what Roskomnadzor planned for the Clean Internet. Clients of BA can search for publications using keywords and receive reports with detailed statistics, citation indices, audience analysis, and a tone assessment of the publication. In addition to social media, BA also analyzes media outlets, including newspaper scans, broadcast transcripts, and paid news agency feeds.

The use of Brand Analytics was first mentioned in GRFC correspondence in December 2021, and a month later, the agency issued its first detailed report on the use of the system.

The Brand Analytics administrator panel that GRFC uses to search for content on the internet. Screenshot from presentations / Mediazona

The report states that GRFC paid the maximum tariff, which allows uploading up to 5 million pieces of content per month. Among the topics of interest to GRFC were analyzing daily protest sentiment at the federal and regional levels, searching for negativity towards Vladimir Putin, the Shanghai Cooperation Organization, the Eurasian Economic Union, and BRICS, reports on “Cossacks” and “Echo of Moscow” radio station, “distortion of the history of the Great Patriotic War history,” and “LGBT propaganda.”

Separately in the report, “urgent” topics are mentioned, “requests for which arose in Telegram chats,” but their content is not disclosed.

Since the start of the Russian invasion of Ukraine, GRFC has been using Brand Analytics to search for calls to anti-war protests and “fake news” about the military. The government deems fake all reports about killings of civilians or destruction of social infrastructure.

Separate topics are “Fake Putin’s arrest” and “Fake Patriarch Kirill called for an end to the war.”

In October 2022, the list of topics was expanded to include prisoners of war, mobilization, “conspiracy theories related to superstitions and predictions,” nuclear war, “the critical health condition of Russian President Vladimir Putin,” and “the overall crisis of the Russian economy.”

In the correspondence, one can find several thousand summary reports on topics, including daily reports. They are Excel spreadsheets in which all the publications found on the topics and their statistics are collected. The texts of posts, their analysis, such as tone and aggression, the number of reposts and likes, as well as information about the author of the post (name, city or region, and age indicated in the profile) are given in full.

Dualism and MAVR

GRSC has two more projects related to artificial intelligence, which are more modest in terms of tasks and scope. The first is the “Automated System for Monitoring Audiovisual Resources” (AS MAVR).

AS MAVR is supposed to be responsible for searching for prohibited information in films and TV series on streaming services. The system was developed in 2021 by E.Soft, a long-time contractor of the GRFC. For more about the company, see this report by Meduza.

In the proposal for the development of MAVR, it is stated that currently, the GRFC employees watch TV shows and broadcasts themselves, hoping to find any violations. The AS MAVR is supposed to relieve them of this work, but it is still unclear whether it is functioning.

In 2021, AS MAVR was only able to collect metadata for films through the public APIs of IMDB and Kinopoisk. In 2022, the team started to refine the system, and one of the main tasks was full automation and transferring content into the “Unified Analysis Module,” where AI will search for prohibited information. Mediazona was unable to find any traces of the new version of this system.

Another video content-related project of the GRFC is Dualism. The agency wants to search for deepfakes using neural networks; the work is financed by the Foundation for Advanced Research. In the documents, the employees emphasize the danger of deepfakes and the prospects of countering them. The system itself has not yet been developed.

Editor: Dmitry Treschanin

Support Mediazona now!

Your donations directly help us continue our work

Load more