The recent Federal judge’s injunction against LinkedIn that it can’t stop bots accessing its public data could turn out to be a landmark moment for the commercial use of the data on the net. It was a nuanced interpretation of the law keeping up with the reality of the rapid webification of the real world. It is still early stages regarding the result of that specific lawsuit, but the directions and arguments bode well to clarify the gray area of legality of scraping websites associated with internet data and commercial use.

Before we go into the questions of what is practical, legal and ethical web scraping regarding the public data on the net, let us quickly clarify a few basic quetions so that we are all on the same page.

What is public data on the internet?

The easy definition is of all the sites which have been explicitly expressed as for public use such as, or any site that operates under open licenses like the open data commons. But in the real world, any website or source which does not require registration and lets itself indexed by search engines is also deemed public data. This is notwithstanding the terms & conditions mentioned on the site (which are usually templates and copy paste jobs) imposing some limitations in the fine print. This is the big gray area which the LinkedIn injunction touched upon.

Can the public data be used for the commercial purpose?

The key question here is whether the public information is copyright protected or comes under intellectual property ambit. If yes, clearly the data cannot be used for commercial purposes by others. But if the information is just factual data, then it does not get any IP protection.

The second question is related to individual privacy. More and more governments are enacting laws to protect privacy details of individuals and rightly so. However, if the information is about businesses and publicly accessible, then it is deemed to be open. It can be argued this is true for public activities of individuals too and there are more nuances here but for the moment this will do.

Can the public data be accessed through electronic means (a.k.a web scraping)?

The only contention against scraping is that it can put the web servers under stress and affect the primary user experience. With technology improvements, most high-end sites have an elastic server design to handle this additional traffic spikes even if from (non-malicious) bots. However, there is certainly a cost element to this and also the nuisance of skewing of website analytics due to this unwelcome traffic for the web master.

Coming back to the situation when the start-up HiQ Labs sued LinkedIn, it was certainly counterintuitive. Here is a company that openly declares it accesses data from LinkedIn through anonymous, automated scraping and arguably the data usage could be detrimental to LinkedIn members’ interest. In this scenario, LinkedIn blocking their activity seemed natural and fair. However, HiQ questioned the very basis of non-openness of that data and the public’s right to access that data through bots under a legal framework.

On the publicness of the data, the argument was that it can’t be a crime to visit a public website and anyone who types in a url in a browser should have access to the public information on the site. Prof. Laurence Tribe of Harvard University arguing against LinkedIn brought an interesting point “If you exclude someone from sites like LinkedIn, Facebook and Twitter, you are excluding them from the modern version of the town square”. This is an important and valid argument because these social media sites are the conquerors in the winner takes-all, ‘network effect’ fuelled monopolists’ game. In the wise words of Uncle Ben, ‘With great power comes great responsibility’ and so these social media sites cannot claim ownership on the town square.

The usage of data for commercial purposes by a third party becomes a moot point once the data is deemed public. However, in this case, the Judge also raised the question if LinkedIn is actually indulging in anti-competitive practices by restricting access to its public data to other businesses. The other point noted was that LinkedIn could only showcase three actual member complaints about third party accessing data from its hundreds of million members, including the fifty million who had signed up for no broadcasting of information. This questions the basic premise of confidentiality expectations by the members who published their data, especially so when LinkedIn and the other social media sites themselves reserve the right and actively engage in using the member data for commercial purposes.

The more interesting aspect of this case is about the injunction against blocking of bots. So far, the general narrative has been that the sites are within their rights to restrict access to automated bots, even if the bots follow ethical web scraping practices. But the narrative needs to be broader to include the rights for public access of data, keeping in with the direction of inclusive and open internet for every netizen of the world. So when some sites actively block traffic based on some criteria or even go as far as feeding wrong information to bots as an offensive defense play, they could be the bad guys in the eyes of the law.

There is a lot more to be done on this subject like individual’s privacy and security and it can be a tricky subject but will continue to evolve. However, enabling more easier access to the enormous wealth of data available on the web can certainly help businesses leapfrog their digital transformation and contribute to greater public good. A legal framework to share the data in an equitable way, either by making available APIs without restriction or having a reasonable access fee for bots with built-in guidelines for access could be a good start.

The current laws and seeing the new web world through the prism of earlier social and business models are flawed. It is a fact that law and ethics will always find tough to keep pace with the speed of technology advancement and in this case, the arguments were based on 1980s act called Computer Fraud & Abuse Act (CFAA) while the reality of internet has enormously transformed over the last 30 years. It will be important to have more conversations and modify the legal frameworks to suit current realities and technology advancements. From that angle, the HiQ vs LinkedIn is certainly a positive development.

  Karthik Karunakaran

  CEO & Co-founder