Turn the web into a database: An alternative to web crawling/scraping - Mixnode News Blog


After months of development we are incredibly excited to announce that starting today Mixnode will enter private beta and we will start sending invitations to the awesome, patient people on the waiting list. Once you receive your invitation you can create an account and take Mixnode for a spin. If you haven't done so yet, you can request an invite here.

What is Mixnode?

Mixnode turns the web into a giant database!

In other words, Mixnode allows you to think of all the web pages, images, videos, PDF files, and other resources on the web as rows in a database table; a giant database table with trillions of rows that you can query using the standard Structured Query Language (SQL). So, rather than running web crawlers/scrapers you can write simple queries in a familiar language to retrieve all sorts of interesting information from this table of live data.

url content_type content_language content headers url_protocol url_host url_domain url_etld url_abs_path
https://news.ycombinator.com/text/html; charset=utf-8en<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=de...HTTP/1.1 200 OK Server: nginx Date: Mon, 24 Sep 2018 19:36:30 GMT Content-Type: text/html; charse...httpsnews.ycombinator.comycombinator.comcom/
https://fr.wikipedia.org/wiki/Base_de_donn%C3%A9estext/html; charset=UTF-8fr<!DOCTYPE html> <html class="client-nojs" lang="fr" dir="ltr"> <head> <meta charset="UTF-8"/> <title...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:39:49 GMT Content-Type: text/html; charset=UTF-8 Connec...httpsfr.wikipedia.orgwikipedia.orgorg/wiki/Base_de_donn%C3%A9es
https://www.reddit.com/sitemaps/subreddit-sitemaps.xmltext/xmlNULL<?xml version='1.0' encoding='UTF-8'?> <sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/...HTTP/1.1 200 OK Last-Modified: Mon, 24 Sep 2018 06:13:14 GMT ETag: "aeae350d08f76f005e2fe8098a4713...httpswww.reddit.comreddit.comcom/sitemaps/subreddit-sitemaps.xml
http://www.diarioelpuerto.com.mx/text/htmles<!DOCTYPE HTML> <html> <head> <meta name="google-site-verification" content="SzDRrSxL_mhLV_bCAnR_s8e...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:13:26 GMT Server: Apache X-Powered-By: PHP/5.2.17 Keep...httpwww.diarioelpuerto.com.mxdiarioelpuerto.com.mxcom.mx/
http://www.wfnmc.org/mc20101.pdfapplication/pdfen%PDF-1.6 206 0 obj <</Linearized 1/L 213940/O 208/E 89344/N 12/T 209772/H [ 1196 788]...HTTP/1.1 200 OK ETag: "343b4-53e2b129-5cf784d6aa98c961" Last-Modified: Wed, 06 Aug 2014 22:50:17 G...httpwww.wfnmc.orgwfnmc.orgorg/mc20101.pdf
https://code.jquery.com/jquery-1.11.3.jsapplication/javascript; charset=utf-8NULL/*! * jQuery JavaScript Library v1.11.3 * http://jquery.com/ * * Includes Sizzle.js * http://si...HTTP/1.1 200 OK Date: Mon, 24 Sep 2018 19:55:14 GMT Connection: Keep-Alive Accept-Ranges: bytes ...httpscode.jquery.comjquery.comcom/jquery-1.11.3.js
...

Mixnode turns the web into a giant database table with multiple columns.

Just like a regular database table, you are provided with several columns (a.k.a. fields) that represent different attributes of web resources such as URL, content, content type, content language, domain name, ... Additionally, Mixnode comes with hundreds of functions that you can use to further analyze the data in any way that you want. From parsing HTML/XML and JSON to handling date/time and processing text, there are numerous built-in functions to use directly in your queries.

As a simple example, using Mixnode, getting the URL and title of every web page from the web boils down to a simple SQL query:

select url, string_between(content, '<title>', '</title>') as title
from resources
where content_type like 'text/html%'

Where the results will look similar to:

url title
https://stackoverflow.com/questions/8318911/why-does-html-think-chucknorris-is-a-color [Why does HTML think “chucknorris” is a color? - Stack Overflow]
https://en.wikipedia.org/wiki/List_of_animals_with_fraudulent_diplomas [List of animals with fraudulent diplomas - Wikipedia]
https://www.amazon.co.jp/dp/B06XXQD54H/ [Amazon | アクータメンツ フィンガーリス 指人形 フィンガーパペット 指人形 | おもちゃ雑貨 | おもちゃ]
https://www.reddit.com/r/funny/comments/5yhipb/its_a_bit_breezy_out_there_today/ [It's a bit breezy out there today : funny]
https://imgur.com/gallery/cJO834B [Just cause you pelican doesn't mean you pelishould - Album on Imgur]
...

You can expand this query in any number of ways by utilizing the built-in columns and functions of Mixnode. For example, if you wanted to get the title of every English web page you could simply use a condition on the content_language column:

select url, string_between(content, '<title>', '</title>') as title
from resources
where content_type like 'text/html%' and content_language = 'en'

Did you want the title and first paragraph of every English web page? The css_text_first function has you covered:

select url, string_between(content, '<title>', '</title>') as title, css_text_first(content, 'p') as first_paragraph
from resources
where content_type like 'text/html%' and content_language = 'en'

Same query, but only on .net domains? You only need to use the url_etld column:

select url, string_between(content, '<title>', '</title>') as title, css_text_first(content, 'p') as first_paragraph
from resources
where content_type like 'text/html%' and content_language = 'en' and url_etld = 'net'

Consider the question "Sort the English Wikipedia articles by length". All you need to answer this question is to use the order by clause:

select url, cardinality(words(content)) as article_length
from resources
where url_host = 'en.wikipedia.org' and url_abs_path like '/wiki/%'
order by article_length desc

By combining table columns and built-in functions you can practically analyze the web in an infinite number of ways. Additionally, you can integrate Mixnode with external data sources (e.g. sending and receiving data from Amazon S3) and create even more flexible queries.

Give it a try!

Mixnode allows you to focus only on what you need to get from the web and not how to get it. It is an end-to-end solution that takes you from question to answer with a simple query; you don't need to deploy web crawlers or run scrapers, you don't need to process raw data, and there are no "intermediate results".

We are super excited to finally share this amazing technology with the world. Once you receive your invitation you can create an account and start using Mixnode. You can request an invite here.

Last but not least, we would love to hear from you! Please contacts us at hi@mixnode.com if you have any questions or comments.