Companies are waging an invisible data war online. And your phone might be an unwitting soldier.
Retailers from Amazon and Walmart to tiny startups want to know what their competitors charge. Brick and mortar retailers can send people, sometimes called "mystery shoppers," to their competitors' stores to make notes on prices.
Online, there's no need to send people anywhere. But big retailers can sell millions of products, so it's not feasible to have workers browse each item and manually adjust prices. Instead, the companies employ software to scan rival websites and collect prices, a process called “scraping.” From there, the companies can adjust their own prices.
Companies like Amazon and Walmart have internal teams dedicated to scraping, says Alexandr Galkin, CEO of the retail price optimization company Competera. Others turn to companies like his. Competera scrapes pricing data from across the web, for companies ranging from footwear retailer Nine West to industrial outfitter Deelat, and uses machine-learning algorithms to help its customers decide how much to charge for different products.
Walmart didn’t respond to a request for comment. Amazon didn’t answer questions about whether it scrapes other sites. But the founders of Diapers.com, which Amazon acquired in 2010, accused Amazon of using such bots to automatically adjust its prices, according to Brad Stone's book The Everything Store.
Scraping might sound sinister, but it’s part of how the web works. Google and Bing scrape web pages to index them for their search engines. Academics and journalists use scraping software to gather data. Some of Competera’s customers, including Acer Europe and Panasonic, use the company’s “brand intelligence” service to see what retailers are charging for their products, to ensure that they are complying with pricing agreements.
For retailers, scraping can be a two-way street, and that’s where things get interesting. Retailers want to see what their rivals are doing, but they want to prevent rivals from snooping on them; retailers also want to protect intellectual property like product photos and descriptions, which can be scraped and reused without permission by others. So many deploy defenses to subvert scraping, says Josh Shaul, vice president of web security at Akamai Technologies. One technique: showing different prices to real people than to bots. A site may show the price as astronomically high or zero to throw off bots collecting data.
Such defenses create opportunities for new offenses. A company called Luminati helps customers, including Competera, mask bots to avoid detection. One service makes the bots appear to be coming from smartphones.
Luminati’s service can resemble a botnet, a network of computers running malware that hackers use to launch attacks. Rather than covertly take over a device, however, Luminati entices device owners to accept its software alongside another app. Users who download MP3 Cutter from Beka for Android, for example, are given a choice: View ads or allow the app to use "some of your device's resources (WiFi and very limited cellular data).” If you agree to let the app use your resources, Luminati will use your phone for a few seconds a day when it’s idle to route requests from its customers’ bots, and pay the app maker a fee. Beka didn’t respond to a request for comment.
The ongoing battle of bot and mouse raises a question: How do you detect a bot? That’s tricky. Sometimes bots actually tell the sites they’re visiting that they’re bots. When a piece of software accesses a web server, it sends a little information along with its request for the page. Conventional browsers announce themselves as Google Chrome, Microsoft Edge, or another browser. Bots can use this process to tell the server that they’re bots. But they can also lie. One technique for detecting bots is the frequency with which a visitor hits a site. If a visitor makes hundreds of requests per minute, there’s a good chance it’s a bot. Another common practice is to look at a visitor’s internet protocol address. If it comes from a cloud computing service, for example, that’s a hint that it might be a bot and not a regular internet user.
Shaul says that techniques like disguising bot traffic has made it “almost useless” to rely on an internet address. Captchas can help, but they create an inconvenience for legitimate users. So Akamai is trying something different. Instead of simply looking for the common behaviors of bots, it's looking for the common behaviors of humans and lets those users through.
"There's really a lot of different scenarios where scraping is used on the internet for good, bad, or somewhere in the middle."
Josh Shaul, Akamai Technologies
When you tap a button on your phone, you move the phone ever so slightly. That movement can be detected by the phone's accelerometer and gyroscope, and sent to Akamai's servers. The presence of minute movement data is a clue that the user is human, and its absence is a clue that the user might be a bot.
Luminati CEO Ofer Vilenski says the company doesn't offer a way around this yet, because it's a relatively uncommon practice. But Shaul thinks it's only a matter of time before bot makers catch on. Then it will be time for another round of innovations. So goes the internet bot arms race.
Good Bots and Bad Bots
One big challenge for Akamai and others trying to manage bot-related traffic is the need to allow some, but not all, bots to scrape a site. If websites blocked bots entirely, they wouldn't show up in search results. Retailers also generally want their pricing and items to appear on shopping comparison sites like Google Shopping and Price Grabber.
"There's really a lot of different scenarios where scraping is used on the internet for good, bad, or somewhere in the middle," Shaul says. "We have a ton of customers at Akamai who have come to us to help us manage the overall problem of robots, rather than humans, visiting their site."
Some companies scrape their own sites. Andrew Fogg is the co-founder of a company called Import.io, which offers web-based tools to scrape data. Fogg says one of Import.io's customers is a large retailer that has two inventory systems, one for its warehouse operations and one for its e-commerce site. But the two systems are frequently out of sync. So the company scrapes its own website to look for discrepancies. The company could integrate its databases more closely, but scraping the data is more cost effective, at least in the short term.
Other scrapers live in a gray area. Shaul points to the airline industry as an example. Travel price-comparison sites can send business to airlines, and airlines want their flights to show up in the search results for those sites. But many airlines rely on outside companies like Amadeus IT and Sabre to manage their booking systems. When you look up flight information through those airlines, the airline sometimes must pay a fee to the booking system. Those fees can add up if a large number of bots are constantly checking an airline’s seat and pricing information.
Shaul says Akamai helps solve this problem for some airline customers by showing bots cached pricing information, so that the airlines aren’t querying outside companies every time a bot checks prices and availability. The bots won’t get the most up-to-date information, but they’ll get reasonably fresh data without costing the airlines much.
Other traffic, however, is clearly problematic, such as distributed denial-of-service, or DDoS, attacks, which aim to overwhelm a site by flooding it with traffic. Amazon, for example, doesn’t block bots outright, including price scrapers, a spokesperson says. But the company does “prioritize humans over bots when needed to ensure we are providing the shopping experience our customers expect from Amazon.”
Fogg says Import.io doesn't get blocked much. The company tries to be a "good citizen" by keeping its software from hitting servers too often or otherwise using a lot of resources.
Vilenski says Luminati's clients have good reasons to pretend not to be bots. Some publishers, for example, want to make sure advertisers are showing a site’s viewers the same ads that they show to the publishers.
Still, the company's business model raised eyebrows in 2015 when a similar service from its sister company, Hola VPN, was used to launch a DDoS attack on the website 8chan. Earlier this month, Hola VPN’s Chrome extension was accused of being used to steal passwords of users of the cryptocurrency service MyEtherWallet. In a blog post, Hola VPN said its Google Chrome Store account was compromised, allowing attackers to add malware to its extension. Vilenski says the company carefully vets its customers, including a video call and steps to verify the potential customer’s identity. He declined to comment on alleged malicious uses of Luminati’s service. Controversial or not, Vilenski says the company's business has tripled in the past year.