Public blockchains produce enormous amounts of data. In theory, anyone can access the raw contents of transaction and blocks. In practice, however, querying blockchains can prove to be a daunting task.
The difficulty lies in the fact that blockchains are particular types of distributed databases and thus carry several limitations. Most, if not all, blockchains lack the most basic SQL querying capabilities supported by nearly every off-the-shelf database system.
Take Bitcoin as an example. Its API lacks even the most basic calls which would allow a user to query any address and receive the balance. In order to achieve this, block explorers and alike have developed sophisticated middleware infrastructure that parses the blockchain, normalizes the data, and stores it in a database, where it can be queried. In the best of cases, companies offer API calls for only a limited set of operations. Google hopes to change this by freeing blockchain datasets.
We’re joined by Allen Day, Science Advocate at Google’s Singapore office. Earlier this year, he and his team released the Bitcoin blockchain as a public dataset in Big Query, Google big data IaaS offering. In August, they added Ethereum to their list of freely available public datasets, which includes US census data, cannabis genomes, and the entirety of Reddit and Github. Anyone wishing to query the data can do so in SQL on the Big Query website or via an API. For instance, a relatively simple query would return the daily mean transaction fees since the Genesis Block in just a few seconds.
Coupled with Google’s AI and Machine Learning infrastructure and other open data sets, one can only imagine the potentially groundbreaking insights we could gain from this data.
Topics we discussed in this episode
- Allen’s background as a geneticist
- The similarities between blockchains and evolution process in lifeforms
- Google’s cloud platform and its various components
- Big Query and its publicly available datasets
- The Bitcoin and Ethereum datasets in Big Query
- Why this data is useful to the public and for what it may be used
- The particular challenges in implementing Ethereum as opposed to Bitcoin
- Insights we may gain by crossing blockchain dataset with other data
- How machine learning and AI could help us better understand specific transaction patterns
Sebastien Couture: My name is Sebastien Couture and today I’m very pleased to have with me Allen Day who is a science advocate at Google at the Singapore office.
We met in Singapore a few months ago when I was traveling through Asia and at the time he told me about this really interesting initiative which I hadn’t heard about but it had been up for a few months which was that Google had actually added the entire Bitcoin transactional dataset to their cloud infrastructure and was on the cusp of releasing also an Ethereum dataset, and so I vowed to have him on at some point so that we could discuss this. And in August Google then in fact released their Ethereum dataset on BigQuery and so I’m here with Allen today to talk about all this and other things. So, Allen, thank you so much for coming out.
Allen Day: Yeah, sure, it’s really my pleasure. It’s good to be here.
Sebastien: Yeah, before we get started let’s talk a bit about your background. Your PhD is in human genetics. Talk a bit about your journey, where did you come from, and how did you end up at Google working as a Science Advocate putting blockchains in big datasets.
Allen: Yeah. I’ve been working with computers since I was a little kid and when I was moving through school and eventually ended up in a doctorate program, I was combining computing and biology all the way through that and so that led me into an interdisciplinary field called bioinformatics. And that involves working with distributed systems for doing scientific computing as well as large datasets and computer science and statistics and so I was becoming something that’s now called the data scientist before the title really existed. A lot of these people come from physical sciences and so once I had acquired that skillset it was quite easy to apply it to other disciplines.
And so I could see that there was something interesting happening with these blockchain datasets and so I decided to start looking at those and applying some of the same techniques and methods that I learned or analyzing biological networks to analyze these new types of financial networks.
Sebastien: One thing that I think is sort of interesting, and I think Meher, my co-host, has often talked about this, is this idea that blockchains and biological systems are so much similar or they have common characteristics in that a blockchain can mutate and can fork and you’ll have a sort of new evolution within its life form. You maybe give us your take on what your thoughts are on this. Do you find that there are similarities between the way blockchains evolve to the way biology has evolved?
Allen: Yeah, certainly. The most direct parallel is the forking that happens between projects where one project may decide to change their operational rules or how the consensus works for example or the block time or block size or something. And that’s very similar to if you were to have a mutation that caused two populations of individuals from the same species to become different species. So a speciation event is the equivalent of a fork. Also if you look at the smart contract platforms and having smart contracts stored on chain and these things have some function that’s made available to any blocks that are added after the smart contract was added, there’s some additional effect that are possible as the blockchain evolves. That’s also related to adding new functions into a genome for example. Yeah, there are certainly some parallels.
Sebastien: Do you know of anyone who’s doing any research on this and that is sort of exploring this at a much deeper level?
Allen: No, but there’s something interesting that I encountered, from a friend of mine. His name is Daniel Suarez. He’s a sci-fi author and he recently published a book called Change Agent which is about a bioinformatician based in Singapore. So I thought that was kind of interesting since it’s been parallels with my life there. But in his book he talks about blockchains at some point and one of these chains is called the BioCoin which the Proof-of-Work, we’ll just call it that, is basically blocks are added as a result of some mutation happening in a bioreactor. And so there’s some interesting concept or idea here that if you could define a fitness function that you wanted to have a population of organisms move toward through directed evolution, that’s some form of work because you’re exploring the combinatorial space of the genome, or the proteome, or whatever aspect of these living systems that are evolving in parallel, right?
You’re trying to move toward some target and that’s the work that you’re establishing. If you have a way to measure that, you could actually link evolution to adding records onto a chain. So this is maybe a way that we could do some interesting work as part of securing the chain, but it requires a much lower cost of genome sequencing and genome editing than we have today. But certainly in the future if you look at the rate at which these things are dropping in cost, it’s conceivable some kind of technology like this could exist.
Sebastien: That’s really fascinating. I think you can probably spend the whole episode just exploring this topic, but specifically I wanted to have you on to discuss this initiative that Google has, bringing the Bitcoin and Ethereum blockchain onto Google Cloud. But first, tell us about your role at Google. What’s typical day like as a Science Advocate at Google?
Allen: Sure. We can start by unpacking my title a little bit. So this is one that I just gave to myself because my official title is a developer advocate and I’m specifically interacting with communities who are involved in mostly physical sciences, and part of that is doing communication and so this is more a title that resonates with them so I usually just use that. My day-to-day is, as I mentioned, communication, so doing interviews like this or blogging or public speaking. This is maybe, I don’t know, 30%, 40% of my time. About half of my time I spend doing software development, so I’m actually an engineer at Google but I happen to be externally facing.
Showing people outside what cloud can be used for to develop interesting applications and then collecting information from outside, seeing what the market is doing, what kind of cool stuff people are building, and in particular where they’re encountering friction or where cloud doesn’t have some specific offering yet, and then bringing that back into Google to help product teams to help us make better stuff for the people that we’re trying to serve. And the remainder my time is, like everyone else administrative kind of things, quite a lot of travel, and email, and et cetera.
Sebastien: And so this advocation work that you do is mostly centered around cloud platform or do you also touch other Google products?
Allen: It’s all cloud. Yeah. And I’m specifically building things that are more like end-to-end realistic use cases and I’m working quite a lot with these public datasets, as we’ll talk about later I’m sure. Some of my colleagues are doing more like[feature advocating about incremental updates on products, but I tend to build large integrated projects that are touching many possible cloud components.
Sebastien: Interesting. And as you said, you build these projects that are sort of more realistic and that is sort of like experiments that could potentially turn into products. Have any of the things that you’ve worked on turned into or morphed into Google products or anything that has been commercialized?
Allen: As a geneticist, I work quite a lot with the genomics and health care product team, and, yeah, definitely some of the stuff that I encounter, either frictions that I’m encountering as an individual developing with the tools. I’m basically customer zero. I give them that feedback and then stuff that’s really bothering customers, I give that to them too and then that results in updates to products. Yeah, for sure.
Sebastien: It sounds like a really fascinating role where you can live your passion for technology and science while experimenting and having, sort of, a lot of flexibility to propose new types of experiments internally.
Allen: Absolutely. Yeah, it’s like . . .
Allen: People who love to play with new technology, they find themselves quite at home in this kind of role and I basically just get to play with Lego bricks all day. It’s fantastic.
Sebastien: That’s cool. Let’s talk about Google Cloud Platform at a high level. I think most of our listeners will probably be familiar with Google Cloud Platform but give us a high-level overview of that product and the types of components that exist within it.
Allen: Okay, sure. Yeah, it’s a public cloud, so we have a bunch of data centers and network connecting the data centers. It pumped up the other public clouds in that regard. Google has been operating its own data centers for 20 years now. We just passed our 20th birthday and so we know quite a bit about how to operate these things. The first cloud product was something called the App Engine, which is still around. It was a bit ahead of its time. Today our product areas you could break it down roughly into three areas. One of them is related to virtualization and infrastructure, so this would be Kubernetes or other types of virtual machine services and microservices infrastructure, networking, firewalls, et cetera.
Another area product is related to applications development. This would be for App Engine fits into there for example or other components for building, let’s just say, Web services, and integrating all of your stuff together to make something usable. And then the final area is data analytics and this is the area where I’m advocating which is primarily big data technologies. BigQuery is one of these, BigTable, Spanner, or a bunch of our databases. Quite a lot of it is database related and on storage. And then on the compute side, we have a whole bunch of A.I. technologies. And so you can’t really compute if you don’t have data and data is not really useful if you don’t have compute, so we have these two things that can move data back and forth between them to build new data from old data. And the more interesting types of services we have in the compute side are A.I. related.
Sebastien: Give us a sense of how big this cloud computer is. I don’t know what kind of metric we want to use, whether it’s the number of data centers, number of computers, or terabytes of information processed. Can you give us a sense of how massive Google Cloud is?
Allen: Yeah. I can’t give you specific stats on the number of data centers, but I know that we’re represented in all the major geographies around the world and have our own dedicated connection between the data centers, so the connectivity is quite good running through our dark fiber. It doesn’t ever pass over the public internet. And then a lot of our services are, because, again it comes from this heritage of Google before Google Cloud, we have built some services like Spanner for example. So this is a globally consistent distributed database that relies on atomic clocks to make sure transactions allow these different data centers to be synchronized and that’s now available via public cloud as well. So there’s a whole bunch of goodies from Google that are inside of this public cloud. There’s a lot of big customers on here. Snapchat runs on Google Cloud for example.
If you remember Pokemon GO, people still play this, that’s also in Google Cloud. Dropbox runs on Google Cloud. Yeah, there’s a lot of customers. Yeah, it’s quite big and we’ve got some major customers.
Sebastien: Now you mentioned that Google Cloud sort of this suite of services, so we’ve got the virtualization aspect, we’ve got the storage aspect, and then also the data processing and machine learning in A.I. As a user of Google Cloud, I presume that you can, sort of all these products integrate together, correct?
Allen: Yeah, yeah. There are some places where some components don’t interact as seamlessly as you would like or to move data between them, but in general there’s some way for them to interoperate.
Sebastien: And so what are the most interesting things, the most cutting-edge things that you’ve seen that you can talk about that people are doing in areas like data processing, or research, or certain things that people are doing with your A.I. modules?
Allen: Oh, I would recommend looking at another YouTube Channel called Two Minute Papers. They’re usually a little bit more than two minutes but they cover the latest advances in deep learning and quite a lot of that is happening with an application, sorry an S.D.K. called TensorFlow and TensorFlow was developed by Google. It was open-source. This is the largest most popular library for doing deep learning which is the current most popular area of machine learning and that’s all compatible with Google Cloud. It all runs in Google Cloud. We’ve got specific services that make it run really, really well, Cloud Machine Learning Engine for example. Yeah. Computer vision is one of the most interesting areas. You could see how computers are able to now drive cars for example, right?
So they’re doing real-time analysis of images and looking at all the sensor data coming in and using that into the model of how the car is operating to make sure it can operate safely.
Sebastien: Moving on more towards the BigQuery component, can you spend a bit of time describing BigQuery and the different components there as it relates to what you’re now doing with Bitcoin and Ethereum?
Allen: Sure. BigQuery is also a distributed system similar to Spanner as I mentioned earlier, but it’s not distributed across multiple data centers like I was describing. It’s living more locally than that, but it still has a whole bunch of nodes that store parts of a dataset and so when you do a query, you’re actually running a job in parallel across a large number of machines to produce a result. And so we take the approach basically having data center as computer and don’t try to implement anything very fancy like indexes on the tables, we just do linear scan across everything because we have enough hard drives that it’s economical enough to do that given we can distribute the workload well within the data center and we’re using A.I. to do that.
And because we’re not making many assumptions about the structure of the dataset, it’s quite workable for many different datasets of many different shapes and sizes, and scales extremely well.
Sebastien: So people are using BigQuery with their own datasets, so presumably all types of companies processing consumer data to user behavior data, whatever types of data that anyone can think of wanting to process you could presumably use BigQuery to hold that data and query the data to get some sort of analysis. But there are also public datasets on BigQuery and this is, specifically in the context of what we’re talking today, quite interesting because there are, I think, quite a few public datasets on there. Can you describe some of the other datasets that people are using on BigQuery?
Allen: Yeah, there are. As you mentioned, there’s this quite a lot of private datasets and then those can be joined against public datasets for, we could call it augmentation for example, where you might have some private information that you want to enhance or enrich by joining against public data. The majority of the public datasets though are not dynamic, so they’re not regularly updated. It’s typically some kind of toy dataset, there’s one about the New York taxicabs and so there’s, I forgot how many days or months it is of data, but it’s a snapshot or a sampling of taxi rides, what time the pick-up happened, what time the drop-off happened, and what was point A to point B. There’s other ones that look at types of trees and shade coverage for doing solar radiation analysis on city streets.
There’s various image datasets. There’s one from, I forget which museum it’s from, but a bunch of pieces of art are cataloged. Another dataset that I produced was a genomics dataset. I have a thousand different cannabis genomes where in order to accelerate innovation in agriculture, there’s a whole bunch of stuff happening right now with this new plant that is undergoing regulatory changes. And so if we begin to look at the genetic structure of these plants we might be able to improve the varieties more quickly. So there’s a whole hodgepodge of all kinds of different stuff, weather data, satellite imagery data. All of the Reddit comments are also in BigQuery, so if you want to query any of the subreddits and threaded forums, you can look at all of that.
All of GitHub is in BigQuery, not just the source code but also the comments and the merge requests and everything. So if you wanted to do some code analysis, that’s a pretty popular one because developers are interested in development, so it gets quite a lot of use.
Sebastien: I like this idea of combining private datasets and public datasets, and some of the ways that one might use that, so for example, tell me if this makes sense, like if you’re a company like a ride-sharing app and you want to gain some insights as to the ways people are using your app and specifically with regards to your competition which are taxis. You could use that New York taxi ride dataset and cross it with your own dataset of how your users are using your app, how many times a day or a week they’re booking rides, and then maybe extract some sort of insight from that so that you can perhaps put more cars in a certain area to better compete with the New York taxis for example. What are the types of examples can you point to as to how people are using public datasets with their own private datasets?
Allen: Sure. Yeah, conceivably that’s possible. Although bear in mind that this taxi dataset, we can keep working with this example, is quite small and limited in what it has. The Yellow Cab Company is not putting all of their data into the public dataset. It’s just a little bit as a toy. But that raises an interesting possibility. What if all of the data were available? How much would you have to pay to incentivize a company to put all their public data out there? Or at what level of resolution would you be willing to pay for a lower resolution data and would they be willing to sell that as opposed to the highest resolution data? And so there’s actually an interesting case study. We can provide it as a supplement possibly in the comments or something. That, Thomson Reuters did something like this where they actually host their headline data along with some other attributes.
I don’t know if it’s the full article or what. I’ve not looked at it. It’s a private dataset and what they’re doing is they’re using Google Data Exchange to make this available using Google’s access control. So Google basically allows them to manage the access control and is managing BigQuery tables to store the data such that Thomson Reuters only takes the responsibility to put the data in and then they are selling subscription access to get access to these tables. So you could do this. You could also put data into queues for real-time streaming analysis. So that’s an example of where, we can now generalize out to not just two datasets but actually having the notion of a marketplace.
And maybe there’s some opportunity for transportation or logistics companies to bring it back to Yellow Cab where they could be willing to operate by exchanging some or all of their data and pricing it accordingly depending on how much you need access to, how much latency you’re willing to accept, et cetera, so turning all those knobs that all is possible in a marketplace design. You could think about AdTech that’s doing a very similar thing, right? Like advertising and ad exchanges, it’s quite a similar idea.
Sebastien: Interesting. Well, maybe we can go back to some other examples a bit later on in the show. Let’s talk about this Bitcoin blockchain dataset on Big Cloud. This came out in February of this year. What exactly does it include? What is the Bitcoin dataset on Google Cloud, on BigQuery and what was the goal in making this dataset publicly available?
Allen: I wanted to be able to explore the data just, sort of, in my own selfish interest to be able to make some blogpost or just look at the data because I know other developers are wanting to do this too. You certainly see a lot of interest in cryptocurrency and Bitcoin, and any of these keywords, they’re all growing over time, right? It’s like, “Okay, there’s developers here. I can go and become one of these developers and draw some attention to Google Cloud. And I know we have good A.I. tools, so blockchain plus A.I. should be super exciting, right? So I tried to do some of these queries and it turned out to be really difficult to talk to a Bitcoin node directly. And usually the kind of query I’d want to do would be some kind of historical analysis and that’s not possible just going block by block very easily.
You have to query one by one by one, whereas normally in a SQL type of scenario you do like a Group By to aggregate. It’s a particular type of operation for this kind of programming. And so I realized that I could extract these data out of the Bitcoin dataset and put them into BigQuery and do these analyses I wanted to do. And so that’s what’s in there. It’s nothing more than the Bitcoin blockchain data itself, so we download all of the blocks. It’s about 200 gigabytes worth of data. And then parse each of the blocks and load it into BigQuery. So every time a new block comes out, we update the table and put it in there. It’s just a transaction data. I don’t know how much you listeners would be familiar with what’s in Bitcoin, but it’s really just some addresses are sending some numbers of Satoshis from address A to address B.
Sebastien: Right. I’m looking at this right now, so I will link in the show notes to the Bitcoin dataset. It actually has two tables, so it has a blocks table and a transactions table, correct?
Allen: Yup, that’s right.
Allen: And those actually are identical tables. The reason I did that is I denormalized it because the way the BigQuery pricing model works is that you’re paying for unit of I/O and if I unnest the blocks, I allow access to the transactions at a lower cost. So it saves the users money to do it that way.
Sebastien: Okay, that makes sense. Right, so rather than having to query the block and then find the transactions within the block, then you can simply query transactions?
Allen: Yeah, that’s correct. That’s right.
Sebastien: Interesting. And so you update this every time a block is confirmed?
Allen: Every block. We’re staying intentionally six blocks behind height because that allows us to avoid having to deal with chain reorgs. We don’t want to have to delete data. There’s some complexity, right? Because if you add a block and that ends up not being the real chain and just some kind of dead branch on the chain, you don’t want to have to then delete that and manage that. If you’re going to build the simplest possible system, you don’t want to take that into consideration. And so by staying several blocks behind the tip of the chain, you can avoid that problem but the trade-off is then the data are slightly stale.
Sebastien: Okay. So you’re not storing orphan chains, or orphan blocks?
Allen: Correct. Yeah, none of the branches on the blockchain that don’t end up becoming part of the main trunk are not stored in the table.
Sebastien: Okay. Interesting. Is there a particular reason why you chose not to also store that data? It seems like there could potentially be some interesting analysis that could be made on orphan blocks.
Allen: Well, any data that’s on an orphan block is not part of the consensus, right? There was some minority who thought there might be a block there that everyone had agreed to, but due to race conditions or randomness or whatever, it just ended up not being the case. Looking at what transactions end up on these dead branches? Yeah, it could be interesting. Maybe there’s some censorship happening on the blockchain where entity A is blocking entity’s B’s transactions from being placed on chain by denial-of-service attacking them. I suppose so. Yeah, it could be interesting. I have not gone down that direction. That’s actually a really cool idea though.
Sebastien: Yeah, I was thinking sort of along those lines. If at some point that were to be the case perhaps we could detect those types of anomalies on this dataset
Allen: In particular I think it would be interesting too if there was some geospatial data or IP address data which is not stored on the chain, right? But you can see that from the Mempool if you’re operating a node. And if there some relationship, can you basically see an adversarial relationship between peers where they try to block one another? It could be.
Allen: And there’s probably some interesting patterns in there if that does happen.
Sebastien: You’re also not storing Mempool transactions. At any point I can’t query this dataset, like what are the transactions waiting to be confirmed. It’s only concurrent…
Allen: That’s correct.
Sebastien: . . . blocks, six blocks after height.
Allen: Correct. No Mempool in BigQuery dataset. That’s correct.
Sebastien: Okay, cool. Can you talk about the technical infrastructure that you’ve built in order to query the blockchain and pull this data into your dataset?
Allen: Sure. You want to talk about just Bitcoin or you want to get into Ethereum? How do you want to . . .
Sebastien: Let’s talk about Bitcoin for the moment. We can talk about Ethereum a bit later.
Allen: Okay. Yeah, sure. For the Bitcoin infrastructure what we did is we built a custom Bitcoin client with a library called bitcoinj, so this is the Java version that implements the Bitcoin peer-to-peer protocol and it’s a peer on the network like any other peer and it accepts new blocks coming in. And if a peer asks for a block, it will send the blocks out, but we’re not doing any mining. We’re just acting as sort of like a file sharing node, like a BitTorrent node basically, storing the blockchain. And then we know when new blocks are coming in because we’re accepting these new files, right? Block files. And we’re looking at what the height is of the chain and every time a block comes in we increment this follower position that’s X blocks behind and then kick off a job using something called Cloud Functions that will grab that block from Cloud Storage.
So there’s the node that’s running a Compute Engine Instance, it’s a virtual machine so it’s just like a computer. No special mining hardware because we’re not mining. And then it writes the block file to storage and then there’s a function that watches that storage area for a new file to have come in and it processes that file and sticks it into BigQuery. That’s it. That’s all it does.
Sebastien: Essentially you have a node that’s listening to the network, that pulls in transactions, and then stores them into BigQuery where they can then be queried. And so the BigQuery, as a user how do I query it? How do I query the blockchain? What language am I using? Are there A.P.I.s or sort of S.D.K.s that I can plug in to my software to query the Bitcoin blockchain?
Allen: Yeah. So we’re using a language called S.Q.L. or SQL. This is an industry-standard language for interacting with databases, so Oracle database runs SQL and MySQL, Postgres, Microsoft Access, and all of that. It’s an industry standard thing. Teradata, all of them support some core SQL words or functions, you could say, like operators, and then typically there are some vendor-specific extensions. So BigQuery has some vendor-specific extensions too related to AI, geospatial, various other things. But we’re working with the blockchain data, you don’t really need to have any vendor-specific extensions. You could conceivably take the loading system and push it into MySQL and it would all work in the same way.
Sebastien: Okay. So you could construct a query as simple as select all transactions from this day to this day where the amount transacted was like one Bitcoin for example and it would just return all of those transactions in a result?
Sebastien: Okay. Interesting
Allen: Yup, you could select the mean price of a transaction per day across all days or you could look at the quartiles, or max, or variance, or whatever attribute you’re looking to if we continue on with this per day example. You could partition by day. You could partition by block. There’s many different ways you could slice it.
Sebastien: Maybe you could correlate it with some other public dataset, like weather and try to see if there’s any patterns that are affecting people’s transaction volumes or something like that.
Allen: For sure.
Sebastien: While we were preparing for this episode you mentioned this platform called Kaggle which I had sort of heard of before but wasn’t super familiar with it. So the way you described it to me is sort of this GitHub for data analysis, so it’s a platform where data scientists can share their data analyses. I presume sort of like queries and code. And they can fork them and it’s sort of this open community of data scientists. Talk about some of the things that people are doing on Kaggle, things that you may have noticed with regards to these datasets. What type of analyses are people doing on the Bitcoin blockchain?
Allen: Yeah. Kaggle is the largest community of data scientists online, so there’s quite a lot of machine learning happening there. They analyze in these notebook environments, so there’s a computer sitting behind a interactive forum in a web browser and they can run code against data that’s sitting on a remote machine that’s connected to the web via the web browser. And it can also connect to BigQuery, so they can pull data into this analysis environment and process it with code inside of this notebook environment. They’re typically programming in Python. Specifically in regards to the Bitcoin dataset, users have mostly been interested in looking at these features that we were just talking about, like prices of transactions denominated in Satoshis or what were the largest transactions per day, and then correlating these to other datasets. So they like to link against other private data.
So for example, as part of the Bitcoin dataset, as you mentioned, we just have the blocks and the transactions. It’s really just the chain data, right? But since this is all financial stuff, frequently people want to link against financial data and so they’ll bring in some other tables. They may host it in their own BigQuery tables or they may be uploading a CSV as part of their analysis and then they can start doing pricing type analyses over time.
Sebastien: Interesting. So then what people are analyzing is the Bitcoin transaction data and then crossing that data with, it could be financial data. For example, if one wanted to see if there’s any correlation between like the NASDAQ Financial Indexes or Dow Jones with the price of Bitcoin for example, well, I suppose the price of Bitcoin wouldn’t because it’s not in your dataset, but one could make that type of analysis using the BigQuery language perhaps combined with the machine learning stuff.
Allen: Yeah. Auxiliary tables that are available elsewhere, I think the more data that you can link together, that’s structured and documented and linkable, the more value that comes out. And it’s not just one plus one equals two. It’s more like, well it’s not two plus two equals, how would I say? Three plus three equals six, it’s three times three equals nine, right? Your utility of the data is more of a product of the pieces as opposed to the sum.
Sebastien: Recently there was a blogpost that was published sometime in August that announced that Google was also releasing an Ethereum dataset . . .
Sebastien: . . . that’s now available on BigQuery. What has the reception been like?
Allen: It’s been initially very positive. This is only a few weeks ago now, not even a month yet but the numbers are looking good in terms of utilization and number of inbound inquiries. I get a lot of, basically, direct pings from developers because my name is out there. I’m on Twitter, et cetera. Some fraction of them want to talk to me about things they’re interested in doing. And relative to the Bitcoin dataset, the amount of developer activity has been very high, so I expect that the utilization of the Ethereum public dataset will be even larger than the Bitcoin public dataset. Yeah. The Bitcoin one, it’s been regularly, heavily scanned ever since it was released. So it’s a very popular public dataset and presumably people are acquiring this public dataset to link against their private data, right? I don’t know what they’re trying to do, something. Analyze it for some purpose.
The Ethereum one, because it’s such a large developer community though, I think there’s going to be many more different, there’s going to be more variety of applications and maybe even more volume of applications on this one, just because it’s a lot more complex.
Sebastien: What are the unique challenges that you face in implementing the Ethereum dataset as opposed to Bitcoin? Because I mean the Bitcoin dataset it seems, I don’t want to say it’s a simple feat, but it seems quite simple. You’ve got a node, it pulls transactions, there’s some data processing in these transactions to normalize the data and then it’s put into essentially a SQL database. And with Ethereum there’s a bit more to it. There’s the transactions but there’s also smart contract transactions and token transactions, and there’s a whole bunch of other things that go into that that add some complexities. Can you talk about those and how you’ve overcome those challenges?
Allen: Yeah. And so as you said, Bitcoin dataset is really just transfers and then the cost of the transfer that the requester was willing to pay for that transaction, right? And there’s some strange like obfuscation where there’s change addresses and intentional obfuscation of who’s really paying whom as part of like a pseudo-private design. Ethereum doesn’t have that, so this analysis of where money is flowing to is not a problem in Ethereum. But there’s this other difficulty where there can be data that go along with the transaction and that could be a smart contract or other types of data. That’s really the core complexity is you’ve got this Ethereum virtual machine that takes inputs that go into some compiled code that lives at an address on the blockchain that can do things with those input bytes that are going along with the transaction that is coming in.
And what those smart contracts do with those input bytes is arbitrarily complex. It’s a Turing machine. Representing that complexity is very difficult which is why it took a lot longer to get the Ethereum dataset released than the Bitcoin dataset. So token transfers, that’s a great example. The infrastructure for the Ethereum dataset is pretty similar. So we’re operating an Ethereum node and it’s writing out some files into Cloud Storage, but it deviates from the Bitcoin design there where the loading is happening not just as a direct insertion into BigQuery, but there’s another cloud component called Cloud Composer, so this is based on an open-source project called Apache Airflow that you can define an ETL pipeline. So ETL is a term used in data warehousing. Basically everything we’re talking about today is data warehousing and then analysis on the stored data.
So ETL is for extract, transform, and load, and basically you’re extracting data from the Ethereum node. You’re transforming it to some form that will be useful for users, so it could be reading the transactions and parsing them so that you can see if it’s an ERC20 transfer or not, or ERC721 transfer, or whatever other kind of smart contract function call. And then the load part is putting it into the tables. And so there’s a whole bunch of additional ETL processing that we’re doing as part of loading the Ethereum data into BigQuery because we don’t just want to load the transactions where we give only the bytes that were the input and then they go to some smart contract but we don’t tell you what did that mean. At face value, there’s really no interesting analyses that are enabled by only giving input bytes.
You actually have to look at what the smart contracts are doing and so we’re looking at these things called traces and logs that are emitted by the smart contracts as part of their operation. So they have some events that are coming out. They describe what the smart contract is doing. And we’re putting all of that into some tables too so that you can look at those and aggregate on the effects of the smart contract on the network.
Sebastien: Can you go in to a bit more detail about how you perform these analyses on these traces and logs?
Allen: Yeah, sure. This is getting pretty far down into the weeds of how Ethereum works, but smart contracts, they have these functions that are defined in typically Solidity and a given function will have some defined inputs. An input could be like an address, or it could be an amount, or it could be other random binary data. And then they do something with this data and what the result of that is being emitted as the events typically and so what we’re doing is we’re taking those events and putting them into a table so you can observe it. An example would be, and the input is actually, it’s difficult to understand because it’s a binary string. It’s a bunch of bytes and you need to segment the bytes according to the function specification.
So there’s like all this array manipulation and very low level stuff that’s not really relevant to the business purpose of making a query or analysis that you have to do in order to be able to do the query or the analysis. And so really what we’re doing is we’re factoring out all of this menial labor that a developer would have to do, doing it once, and doing it correctly, and then nobody has to work on that problem again if they’re willing to run their operations on BigQuery.
Sebastien: Interesting. The Ethereum blockchain data is contained there, so within the input bytes of the smart contract we have the information there. For example, perhaps some address data or the function call hash and what you’re doing is you’re extracting that data from essentially just sort of like a blob of bytes and making it available in this query format so that now you might be able to say, “Okay, what are all the contracts that are calling this specific function . . .
Sebastien: . . . or what are the transactions that are calling to the specific function in this contract,” and making it easily available, whereas if you want to do that by yourself you would have to build it from scratch.
Allen: Yeah. We’re just reformatting the data, right? It’s in this form that is just difficult to access for doing certain types of analyses. It’s designed for usability by the Ethereum blockchain software, right? Ethereum peer-to-peer network is concerned with consensus, and concerned with the efficiency of transactions, and concerned with the operations of the Ethereum Virtual Machine in the blockchain, right? But it’s not concerned with, “Hey, what if somebody wants to do some historical analysis on all the data in here?” That’s irrelevant from the point of view of a database. If somebody was to design a database that is for transactions then you would design it in a particular way. If you had a database you wanted for analysis, you would call it the opposite way.
And if you get into data warehousing theory, these are the two extremes of different types of database design. One is called an O.L.T.P. It’s an online transaction processing database. The canonical example is usually like a hotel or an airline reservation system. It’s very concerned with transactional integrity and very concerned with a throughput, number of transactions per unit time. But it doesn’t really concern itself with analyzing like price trends of the hotel rooms or the flights, right? But you can take that transactional data and reformat it into an online analytics processing system, which is an OLAP database, which denormalizes it. It doesn’t care about transactions but tries to structure the data in such a way that it’s easy to send any arbitrary query against it and get reasonable performance for the query when you want to get the data back to a business application or an analyst or something like that.
So at a higher level that’s what we’re doing with this blockchain data. We’re taking O.L.T.P. data, it’s optimized by the O.L.T.P. system, i.e. the blockchain system or transaction throughput, and we reformat it, and we make it available as OLAP.
Sebastien: Okay. It occurs to me that an O.L.T.P. system is like Bitcoin but without mutability?
Allen: Bitcoin does not have mutability, yeah. These O.L.T.P.s…so if you add an immutability constraint to an O.L.T.P., you get a blockchain. Sure, you could say it like that.
Sebastien: Okay. What type of analyses have you been doing on Ethereum datasets? In the blogpost there was a couple of examples there. Can you talk about those?
Allen: Yeah. Some of them are time series examples where we’re looking at number of transactions per day, and that’s a very obvious kind of thing to do. What interests me more though is the characteristics of actors on the network or I guess interactors you could say because wallets aren’t doing anything on their own on Bitcoin or Ethereum or any of these other blockchains. There’s always some interacting partner, right? And that interaction between those two partners tells you something about it’s measurable observation and if you look at multiple of these observations in aggregate for one address or groups of addresses over time you can start to quantify what they’re doing and assign attributes to the addresses.
So you could begin to identify, as a concrete example, you could identify exchanges for example because they are typically going to have large volumes flowing in and out of them, and they’ll typically have many, many interacting partners of other wallets sending money or tokens in or out. Now there could be other addresses in the network that aren’t exchanges or aren’t known exchanges that has similar behavior but we could use a duck typing approach to characterize those. If it looks like a duck and it quacks like a duck and it smells like a duck, it’s probably a duck. So you could start to label exchanges that are unknown based solely on their attributes, by looking at their behavior over time. Mining pools would be other example.
You can start to pick those out and you could imagine what kind of characteristics the pool would have or a miner who is time sharing inside of a pool, they’ll have a particular [type of] [00:51:23] characteristic, right? They’re going to receive deposits periodically and only after the mining pool mines the block, right? That sort of thing is very interesting to me because it relates very, very much to the type of work I was doing in my dissertation as a graduate student. I was looking at genetic networks and particularly the human genome network which is composed of genes that are interacting with one another to operate a cell which is basically a highly parallel distributed system of these molecules that are interacting with another to process information.
And there’s a whole bunch of analytical techniques that can be used that come from biostatistics where analyzing these biological networks you can analyze financial networks with the same techniques. And it turns out some of these guys doing anti-money-laundering applications or other types of fraud analytics or forensic accounting, they’re also very interested in this kind of stuff, but they don’t necessarily have the level of sophistication that biologists do because the National Science Foundation and the National Institutes of Health has been throwing a lot of dollars at curing cancer for a long time which is why a lot of these methods were developed because cancer is a disease of the genetic program.
I feel like I’m kind of like getting off on some tangents here and rambling, but there’s some very direct connection between the math and the methods of what I was doing as a graduate student and this stuff that’s happening with blockchain right now. And not the blocks of the consensus themselves but the interaction between the entities on the network, that’s what I’m interested.
Sebastien: Well, feel free to elaborate. I think it’s really a fascinating topic, so . . .
Allen: Yeah. I just gave you two examples. One of them being identifying exchanges. The other one being identifying mining pools or timeshare miners. I don’t know. What other sorts of interesting patterns would you want to look for?
Sebastien: Well, there was one example here in the blogpost that was kind of interesting and that’s analyzing the functionality of smart contracts. Can you dive into that one?
Allen: Yeah, sure. That’s pretty cool because, as you mentioned earlier, there’s these hashes of the functions, right? And each function has its own signature. If you are to consider a function as having a smart contract is having some set of functions available to it where it’s between zero functions to whatever all possible functions in the hash space, four billion functions, and every smart contract has some subset of that zero to four billion you can define a distance metric or a similarity metric between any two contracts that tells you how similar they are in the functions that they implement. So it’s reasonable to say that if two smart contracts have the same functions available, probably what they do is, if not the same, quite similar. So all ERC20s implement four methods and so you could find all ERC20 contracts by checking to see if the contract implements those four methods.
That’s actually how we do this in the database. There’s a table that documents specifically ERC20 transfers and there’s a table that lists all smart contracts and there’s a Boolean column on that smart contract table which says, “Is ERC20,” or, “Is ERC721,” because these are the two dominant smart contract types and so we just kind of add it as a convenience, pre-analyze to save developers some time so they can only restrict analysis to those smart contracts. But you could do it for any arbitrary set of functions you’re interested in. All the data are there.
Sebastien: Right. So for any standardized smart contract such as the token contract or ERC721 contract, you can essentially just look at the transaction, look at the function hashes in the contract and determine whether or not that’s a fact, this type of transaction or not.
Allen: That’s right.
Sebastien: And then you’re making that data available so that you don’t have to extract the data yourself, it’s already made available for you in the BigQuery dataset.
Allen: Yup. And I define a function for the similarity metric we were just describing. There’s an analysis like this that’s done. It’s one of the examples in the blogpost where I show the original CryptoKitties contract and you can see all of the subsequent iterations where they basically upgrade the contract and they’re all similar to one another because they’re adding functions over time. And then you can also see clones because the CryptoKitties contract is open source, you can see somebody made CryptoPuppies and Crypto Clowns and other variants of this thing that’s basically a clone of the game using the same code and it shows up as a very similar contract. So if you have some game that you like to play, like let’s say you like to play any of these connect three jewel games, Candy Crush or something, you could find all other Candy Crush-like games on the blockchain because they would have similar functionality.
Sebastien: I think I saw this in one of your talks that you gave in Singapore recently. I’ll try to find a link and add it to the show notes, but there was this analysis of the frequency that smart specific contracts were called and the token contract I guess overshadowed everything else. But then there was one contract which more recently had gained quite a bit of volume and that was the CryptoKitties contract.
Allen: Yeah. It had a very brief spike and I think probably most of your listeners will remember that the Ethereum blockchain . . . you couldn’t add transactions to it. It was clogged up. And a bunch of I.C.O.s had to delay because there was a CryptoKitty craze, right? Yeah, but the ERC20 transfer function is the most common one or has been. I haven’t looked at the data recently, but I would imagine still.
Sebastien: Of course, I’m sure it still is. One area that I wanted to dive into is . . . so we mentioned earlier that the Google Cloud services are integrated, so we can essentially connect different services. So what are the types of analyses that one could make. I figure there’s actually quite powerful analyses that you could make using the A.I. component of Google Cloud with the Ethereum transaction data. Have you thought about the types of things that one could infer from this data by plugging in a machine learning algorithm and doing some deep learning on this transaction data?
Allen: Yeah. I am actively thinking about that and so the data that are in BigQuery, there is a couple of first-order things that you can do just with the data as they are today. You could look at the input bytes going into transactions and begin to reason about what the functionality of a contract might be a function signature that you don’t know what it does but you see what its inputs are, you could identify round numbers or you could identify addresses and this would give you some hint as to what that function is likely to be doing without getting into analysis of the stack trace of the virtual machine. You can also analyze the smart contracts themselves, right? Because these are all some byte code and you can treat each of those bytes as features and train some kind of analyzer to classify contracts.
It would probably be something quite similar to what we were just talking about, of finding similar contracts but would have some additional ability to detect other things that are difficult to do with such a simple method. But the more interesting methods you can’t really do them so let’s come back to . . . I was talking about networks and network analysis. You can’t really do network analysis on the BigQuery dataset directly because network analyses require traversals through the network. So this basically would mean like looking at a table that is set up for scanning. We talked about BigQuery being a scanning system earlier. You would have to basically have random access and do these recursive queries which it’s not really very well suited to that. It’s well-suited for linear scans. And in order to do these analyses that involve traversing the network you need to move them into another type of database called a graph database.
And so what’s beginning to happen now, this is me and some other data scientists, in the open-source space we’re exploring this. And there’s another link we can put in the supplement to some work by one of my collaborators. We’re loading data into graph databases, analyzing it, reducing the graph down to something like a single measurement per address or per transaction allowing us to assign a value between zero and one of the probability of this thing being in exchange for example, and then taking those attribute data from the graph database, putting them back into BigQuery so that it becomes, we would call it a vector of features. This is getting kind of very specific into machine learning now. But these machine learning models typically want to operate on something called a vector space model.
So they want that every observation to be a row, but what I just told you is like in order to work with the network data it’s not row-like in nature. It’s graph-like in nature. But you can reduce the graph down to rows by going out to a graph database. So iterating between the data warehouse and a graph database to create more elements in the data warehouse, basically doing enrichment through analysis of the graph enables these A.I. algorithms to begin analyzing the graph. So that’s the direction that we are moving right now and fortunately a Google Cloud has good technologies for building graph databases as well, so we’re secure there but it requires a lot of work.
Sebastien: You mentioned earlier that the entirety of Reddit was also as a public dataset in Google Cloud.
Sebastien: It occurs to me that we might be able to do some sort of analysis as well as to the success of an I.C.O. for example. So take all of the I.C.O.s since two or three years ago, look at the amount of money raised, and then perhaps correlate that with some natural language processing data that you would have extracted from like Reddit communities and maybe some other data in there, and come up with some kind of predictive model to determine what types of characteristics . . . what kind of characteristics of communities that you might look for in order to determine whether an I.C.O. will be successful or not, or this type of thing.
Allen: Yeah, totally. Reddit actually was presenting at the Google Cloud Next conference over the summer talking about what they’re doing to analyze some of the activity, they’re trying to basically help people find more content on Reddit so they can continue living out their happy Reddit life. But, yeah, you can use this type of technology called natural language processing. It’s a type of A.I. that tries to understand natural language like the Google Assistant. You might see this on your phone or Siri or Google Duplex or Google Home. You’ve probably seen these conversational agents, right? Bots. You can take some text, or speech-to-text and then take the text and send it into an A.I.
And extract things like, “Hey, what was the key object that was being talked about here,” or, “What was the sentiment being expressed about that object,” and quantifying it, basically reducing the human readable text down to some numbers, and then you could then cross reference those numbers. So back to your example like, “Okay, was it more important to the success of a project that there was a lot of positive sentiment or is it more important that there was just a lot of buzz in general even though a whole bunch of it was negative?” I don’t know what the answer is, but you could begin to explore. this kind of line of reasoning by linking the, yeah, the Reddit dataset to the Bitcoin or Ethereum dataset.
Sebastien: It could make for a real interesting PhD thesis.
Allen: I would love to support any data scientist who wants to work on this. Please, if anybody listening who wants to work on that, shoot me an email or join up on Kaggle … let’s get the data analysis going.
Sebastien: I can see something like analyzing memes, like analyzing also those stuff like Dogecoin memes. How many times are people sharing a specific meme? Does that have an impact on an I.C.O
Allen: Yeah, and then there’s . . .
Sebastien: . . . or something in that nature.
Allen: We even have this Vision A.P.I. which, it’s a sibling of the Language A.P.I. that can actually look at images and analyze those and tell you what’s in the image and what’s the general sentiment about that. Yeah, for sure.
Sebastien: Looking into the future, are there any plans to release other tokens, blockchains, or cryptocurrency on the open platform?
Allen: I’m looking at a bunch of other stuff right now, yeah. More of these public blockchain data. It’s all just sitting there in its public datasets and these current ones are getting a bunch of good traction analysis, yeah, that would be interesting to do. I think the Ethereum dataset has quite a lot of mileage to be gotten out of it though. It’s just so deep and interesting, and yeah, I’ve not done any plot of developer activity versus, I don’t know, market cap or anything like that, but I would imagine Ethereum would be a real outlier. That community has done a good job of making ecosystem components available and being very supportive of their developers, and so, it’s just very vibrant compared to a lot of the other projects.
Even though Bitcoin is bigger by market cap, it didn’t really attract as much…I didn’t get as many inbound inquiries about that dataset as I did for the Ethereum dataset.
Sebastien: Well, Allen, this is all very fascinating. I want to thank you for coming on the show and talking about this. I think there’s a lot of interesting things to be built on top of these datasets. And we kind of mentioned this before the show, but in in some ways this data is meant to be public, right? And a lot of companies out there has, sort of, made their business model upon these datasets, like Block Explorers, companies that do blockchain analysis, their entire business models are built on what is essentially a public dataset but where the blockchain itself doesn’t really have the underlying infrastructure that allows just anyone to make these really, actually quite simple queries on top of them. We’ve had to build all this other infrastructure on top.So the fact that Google is making this available to the public is really great and I’m looking forward to seeing what kind of applications or what kind of research comes out of this in the future.
Allen: Yeah, that’s what we’re here to do. We’re here to help developers make cool applications.