UNIVERSITY PARK, Pa. — A search engine that uses artificial intelligence (AI) to “read” millions of online documents could help privacy researchers find those related to online privacy. The researchers who designed the search engine suggest it could be an important tool for researchers trying to find ways to design a safer Internet.
Rather than trying to search for privacy documents themselves, researchers could enter their queries into the search engine to efficiently identify and collect relevant documentation.
Ultimately, however, the search engine could help researchers better understand online privacy in general and examine trends in online privacy over time, which could one day lead to an Internet where users could browse more safely and securely, according to Shomir Wilson, Assistant Professor of information science and technology at Penn State and a Institute of Computer and Data Sciences affiliate.
“This can be a resource for natural language processing and privacy researchers interested in this area of text,” Wilson said. “Given large volumes of text like this, we can find ways to automatically identify and label certain data practices that people might be interested in, which then allows us to build tools to help users understand online privacy.”
He added that finding and classifying privacy literature without machine learning would be time-consuming and difficult, if not impossible.
According to Wilson, a better understanding of privacy information is needed because this type of documentation is largely ignored by regular users.
“Most websites present you with information about their data practices, and then you’re expected to give your consent by browsing and reading all that information,” Wilson said. “But nobody really does it because it’s not practical and it doesn’t fit the way people use the internet. People usually don’t have the legal knowledge either.
The privacy policies were collected by the search engine PrivaSeer during two separate web crawls. A web crawler refers to the systematic browsing of the Internet on a large scale, as performed by software. The first crawl took place in July 2019. The second crawl took place in February 2020.
The PrivaSeer database now includes around 1.4 million English language website privacy policies.
“One thing that’s distinct about our database is that we have the biggest snapshot in time of online privacy,” Wilson said.
Soundarya Nurani Sundareswara, former graduate student in information science and technology, currently a software engineer at Apple, and C. Lee Giles, David Reese Professor at the College of Information Sciences and Technology, both of Penn State, worked with Wilson and Srinath on the project.
The team published their findings in the International Web Engineering Conference.