A search engine could help researchers scour the Internet for privacy documents


UNIVERSITY PARK, Pa. — A search engine that uses artificial intelligence (AI) to “read” millions of online documents could help privacy researchers find those related to online privacy. The researchers who designed the search engine suggest it could be an important tool for researchers trying to find ways to design a safer Internet.

In one study, researchers said the search engine, which they dubbed PrivaSeer, uses a type of AI called natural language processing (NLP) to identify online privacy documents, such as privacy policies, terms of use, cookie policies, bills and privacy laws, regulatory guidelines and other related texts on the web.

Rather than trying to search for privacy documents themselves, researchers could enter their queries into the search engine to efficiently identify and collect relevant documentation.

Ultimately, however, the search engine could help researchers better understand online privacy in general and examine trends in online privacy over time, which could one day lead to an Internet where users could browse more safely and securely, according to Shomir Wilson, Assistant Professor of information science and technology at Penn State and a Institute of Computer and Data Sciences affiliate.

“This can be a resource for natural language processing and privacy researchers interested in this area of ​​text,” Wilson said. “Given large volumes of text like this, we can find ways to automatically identify and label certain data practices that people might be interested in, which then allows us to build tools to help users understand online privacy.”

NLP combines linguistics, computer science and AI to program computers to process and analyze large amounts of text. In this case, the researchers used NLP to gather privacy policy documents from around the web, according to Mukund Srinath, a doctoral student in information science and technology and first author of the study.

“The NLP approach can tell the difference between privacy policy documents and non-privacy policy documents based on certain words that appear in the text,” Srinath said. “Intuitively, you might think that privacy policies might contain some words that non-private policies don’t, like data protection and privacy, which are just some of the common words. With the NLP approach, you could say that the algorithm learns to recognize the difference between these two different types of documents.

He added that finding and classifying privacy literature without machine learning would be time-consuming and difficult, if not impossible.

According to Wilson, a better understanding of privacy information is needed because this type of documentation is largely ignored by regular users.

“Most websites present you with information about their data practices, and then you’re expected to give your consent by browsing and reading all that information,” Wilson said. “But nobody really does it because it’s not practical and it doesn’t fit the way people use the internet. People usually don’t have the legal knowledge either.

The privacy policies were collected by the search engine PrivaSeer during two separate web crawls. A web crawler refers to the systematic browsing of the Internet on a large scale, as performed by software. The first crawl took place in July 2019. The second crawl took place in February 2020.

The PrivaSeer database now includes around 1.4 million English language website privacy policies.

“One thing that’s distinct about our database is that we have the biggest snapshot in time of online privacy,” Wilson said.

Soundarya Nurani Sundareswara, former graduate student in information science and technology, currently a software engineer at Apple, and C. Lee Giles, David Reese Professor at the College of Information Sciences and Technology, both of Penn State, worked with Wilson and Srinath on the project.

The team published their findings in the International Web Engineering Conference.

Source link


Comments are closed.