Crissy Field is a company offering services like software development and tech consulting such as sparring sessions with tech leads, engineers, CEOs, and CTOs. The company initially started with information security and still operates in that space. While doing penetration tests, Dr. Thomas Jansen, founder and CEO of Crissy Field (formerly a Sr. Engineering Manager at Apple in Cupertino), ran into a specific issue of internal Git folders being inadvertently exposed. To warn people, he developed an internet crawler. Mostly to make the world a better place, as the project is non-commercial. In this interview he explains about his do-good project, which we proudly host on our Leafcloud servers.
Hi Thomas, can you tell us how Repo Lookout came to life?
While we were doing penetration tests, I discovered that when people deploy their websites using Git source code management, they often tend to forget to properly protect the internal Git folder. Consequently, it is exposed, and if a hacker can find the Git repository, they can download all the files, including its full history. There might even be some passwords or other sensitive data in there. Like if you have a WordPress instance running, then there is the typical config.php, which might contain all the information you need to access the database.
At first, I thought this might not be a widespread problem. But it turned out that it was. So, during COVID when I had some spare time on my hands, I decided to spend time developing an internet crawler that looks specifically for this issue. Repo Lookout, as it is called, turns out to be quite successful. So far more than 8 billion URLs have been checked, and we are closing in on 1 million inadvertently exposed repositories found soon.
So how does it work?
When we see a repository is exposed, we send an email to whoever is responsible for the website to inform them. The feedback we get is super positive. As it is a non-commercial project and there are some costs involved, people can tip us. It's not something to get rich with, but I get way more out of it than money. It’s a very fulfilling project. People tell me this is a crucial service for the internet and that our report “saved their ass”. The fact that people are finding Repo Lookout so useful gives me a lot of joy.
But these crawlers must take up a lot of resources?
The crawlers have typical crawler infrastructure. They are small programs that just fetch all the URLs and look if it is a Git repository. And if it is, a workflow is kicked off, with other small programs running on different nodes in the cloud to find the owner of the URL, read the first few commits of the repository, and more.
Before, this was all running on Amazon Web Services, which was fine until I realized that it is a real energy waster. Even though I feel it's a very important project, I didn’t feel good about running multiple crawler nodes on AWS. Especially after a customer inquired with Amazon about the sustainability of their hosting. I was really blown away by their reply.
So, I decided if I do something like this, I want to do it the right way. Therefore, I started looking around for sustainable solutions.
What made you decide to switch to Leafcloud?
Some of my friends think it's crazy that I spend so much time in this project. But I find it very motivating. And I don't want to destroy my motivation by knowing that there is energy waste or CO2 pollution because of it. Then a friend told me about Leafcloud and a few days later I checked Ifconfig.co and saw it was sponsored by Leafcloud. Then I thought: okay, this is already the second time I hear about it, it must be a good thing. I read through the website and was convinced immediately that this is a very, very good approach. So, I reached out to the CTO – Jegor van Opdorp.
How about the set up on the servers of Leafcloud? Was it complicated?
It was super easy to set up. Interaction with you guys was great. In the beginning, I asked Jegor if I could get a small system in terms of CPU and memory with huge bandwidth. Because that's what I'm doing. I don't need much processing power. I just want to crawl a lot of data. Of course, it is a balance for you guys, and he replied that the network speed is tied to the number of CPU cores. And while I would love to see a different approach to this, I understand that this would make the whole UI way more complex. At the end I think you made the right call, as I would always choose the easier interface and the easier setup instead of asking for every small knob to customize everything.
So, no, I felt at home immediately, and so far, everything has been working perfectly. It's very reliable and super fast. Just a few days ago, I increased the level of the crawler nodes that I'm using. And I'm planning to double the numbers again. So, I'm very happy. The only feedback is that I don't know how to see my bills, although you do send me these by e-mail. <Leafcloud: this is being worked on and should be a one-button thing soon>.
Overall, it just feels better than just compensating the CO2, which seems to be the fake solution. So yeah, I do this without feeling guilty of polluting.
Would you recommend Leafcloud to other developers?
Actually, I tell a lot of people about Leafcloud, Just the other week, I had a talk at a conference and advised people who are looking for a sustainable solution to look into Leafcloud. It would be great if Leafcloud would also open data centers in Germany though <Leafcloud: also coming soon>.
There are moments when I wonder if this project is the right approach. But when I consider the Git repositories and even GitHub credentials that are exposed; of all the million repositories I can access, like one hundred fifty thousand of them contain GitHub credentials. When you look at the size of data that is readable by everyone, acting in good or bad faith, I feel like it's my obligation to let people know that there's something wrong.
So, yeah, I think we can support each other and make the internet a better place. You with reducing CO2 emissions and me with warning people about their security.
More about Repo Lookout: www.repo-lookout.org
More about Crissy Field: www.crissyfield.de
More about Thomas: linkedin.com/in/tcjansen/