PyData Challenge 2024

Mastering Realtime Data: How I Topped the Leaderboard at PyData

Two weeks ago I went to PyData Amsterdam which kicked off with a tutorial day. These tutorials included a workshop on creating multi-modal LLM agents, advanced web-scraping techniques and even one where I learned how to extend polars with custom rust functions.

Why did I go to PyData? More on that later…

The next day I joined the first conference day and learned about the PyData 2024 Challenge, a strategic challenge running during the two days of conference. To participate in the challenge you have to distribute 100 armies across 10 castles. Each castle has increasing points from 1 to 10, making strategic distribution essential to outscore other participants.

This post is about my strategy beating this challenge and more importantly, the lessons I learned along the way.

Challenge Accepted

So how do you participate in this challenge? There are 10 castles with input boxes. The first box (castle) is worth 1 point, with each successive box being worth one more up to a total of 10 points for the last box. When you allocate more points on a castle than another player, you win that castle (and the points for that castle).

The goal of the game is to beat the highest % of players, where your strategy is calculated against each player’s strategy, making it virtually impossible to beat each player. Beating the highest percentage of players nets you a top spot at the leaderboard, and at the end of the second conference day, the top spot at the board wins. The win-ratio’s are calculated with each new submission, so a strategy that works well at some moment with a high win-ratio will be demoted each time a new submission beats it.

The challenge is apparently based on an old math game from The Riddler called: can you rule riddler nation?1

Mastering Realtime Data: How I Topped the Leaderboard at PyData Screenshot 2024 09 26 at 14 46 10 PyData 2024
Screenshot of the landing page of the challenge, where you can practice against bots and compete for real.

To start the challenge, I tried a few different strategies on my phone, and was able to break into the top 10 of the leaderboard within a few tries, but the distance to the top 2 was big, and doing this on my phone was feeling a bit inefficient and cumbersome. Also other attendees at the conference where joining the challenge, altering and even lowering my win-ratio and place on the leaderboard with each new submission.

The challenge had started, but this is only the beginning…

We need Data, lots of Data…

Realizing that most strategies of winning this challenge probably include understanding the data (it is a data science conference after all), it was time to start web-scraping with selenium. I started with writing a simple function that can input values in each box, submit the results, and fetch the result % without user interaction. To keep track and for future analysis I just made it write each result with the input values to file, and included the option to create multiple workers with their own thread to increase the amount of submits that can be done each minute. This way I could test more strategies in a short amount of time.

Mastering Realtime Data: How I Topped the Leaderboard at PyData 94r209 1
Disclaimer; it was actually conda install conda-forge::selenium but you get the picture

After the automated submitting was working properly, I setup a linear solver using the simplex method (simple but efficient technique used for solving linear programming problems) with an objective to maximize the result % by changing values in each box (as long as they total a 100). This was working pretty good, and with 3 workers running, the win ratio was improving and I was slowly moving up on the leaderboard.

Richard and his peasants

After a fun first conference day and a long ride home I noticed a few new top dogs in the leaderboard, one using a custom username “richard”, followed by numerous “richard-peasant-#” entries (more on this later). Subsequently, my scores had dropped from the mid 90’s to the high 80’s. Clearly my solver was not solving the right problem. The nature of this game is that you need to beat other players, and I was optimizing something static, even though the winning strategy was ever changing.

Time to Evolve

My solution to counter this was to build a genetic algorithm, a method inspired by natural selection, to replace the linear solver. I got it running pretty fast with some standard NumPy functions, but dialing in the parameters for it to actually outperform the solver was not an easy task, nor was it fast. I had not played with evolutionary models in years, but after a lot of back and forth, trial and error, it was outputting something that looked promising.

Lets run 20 threads, put the objective at 100% and call it a day.

Mastering Realtime Data: How I Topped the Leaderboard at PyData image
Mutate function I settled on, with a population of 8 and 3 parents for quicker adjustments to changes in other players strategies.

Getting outside of the box

The results the next morning were good, I was back in the top 10, but not consistently beating the first spot on the leaderboard. After a hint from the creator of the challenge, ‘to not overthink it’, I came up with a new strategy (which was probably the definition of overthinking it): manipulate the player pool by creating a good strategy, that will beat most players, but that I can consistently beat myself. Effectively playing against my own strategy with a better strategy to get the highest % win-ratio and reach that number one spot on the leaderboard.

The way I did this was by setting up 4 workers, of which 3 had different constraints on their upper and lower bounds, limiting the freedom of the mutations in a predetermined way. The last worker had none of these constraints and could therefore evolve to beat the other ones easily, boosting the win-ratio all the way to the top of the leaderboard.

Redefine the rules of the game

The next step was to refine this strategy by creating a system with moving ranges of freedom, and to generate this I created a small script using recursive partitioning, a method to break down the problem into smaller, manageable parts. This script created a matrix of all potential army distributions within the given constraints, allowing for sophisticated strategic planning.

In order to make it perform well on a web-scraper setup, I set it to start at 0 and operate within 100 and -100 while using steps of 10 moving from the outer ranges, and only using steps of 1 from the valid ranges to limit the amount of steps (submits on the website), but get the results accurate to 1 point.

The results where surprising. Inspecting the matrix, it showed negative values up to -10 were allowed for all the boxes. After testing this manually, by setting the first low tier boxes to -10 and putting those points in the higher tier boxes, my submission went above 99% instantly.

THIS CHANGES EVERYTHING

Mastering Realtime Data: How I Topped the Leaderboard at PyData Screenshot 2024 09 26 151826
First part of the final script: find the possible ranges with the minimal amount of steps

This new insight allowed me to run my old strategy but in a much more optimized way. It propelled me to the top with all 4 workers, effectively running 3 impossible to beat strategies and one worker beating those 3 strategies to claim the top spot.

Soon other players started using negative values also, can I keep the advantage?

Mastering Realtime Data: How I Topped the Leaderboard at PyData Screenshot 2024 09 26 151918
Second part of the final script: main function that runs multiple thread with multiple strategies dynamically.

On a side-note, a funny trick that I discovered thanks to “richard” and his peasants, was that you could set a custom player name by changing the user field value in the cookies. I chose to stay anonymous in order to not give away my strategy, but it was surprisingly simple to implement with the selenium webdriver: driver.add_cookie({'name': 'user', 'value': username, 'path': '/', 'domain': 'pydata.probabl.ai'})

Bricks and Clicks

After two more hours of frantically checking the latest results of my script, while attending some great sessions, the deadline passed, and I met up with some of the other players. What was interesting is that more people discovered that negatives values where allowed, and I wasn’t the only one with a genetic algorithm. Combining an adaptable algorithm using negative values with a pool manipulating strategy was the deciding factor in the end.

With the price of the challenge being a SUPER MARIO LEGO set, I came home with both a nice t-shirt and something fun to build at home. The real price however was the thrill of the competition and the overall experience. Even with all the tutorials, and great talks on both conference days, participating in this challenge was the biggest highlight of PyData for me.

Mastering Realtime Data: How I Topped the Leaderboard at PyData 20240920 184702

Lessons learned

This was my first time participating in a competition like this, and it absolutely tastes like more. The moral of this story is:

If you want to win, fix the game and use insider information to manipulate the market for your own gains.

Mastering Realtime Data: How I Topped the Leaderboard at PyData

All jokes aside, what I learned from this is that jumping in and building something straight away is definitely fine to start with. Re-evaluating your strategy, taking your time to study the data, and using experiments to gain even more insights after jumping in is even more important.

Another thing is that thinking about what should be good and what should be good enough is crucial. I sometimes tend to over-engineer things, but setting up a proper web-scraper that scaled and was reusable saved me a lot of time. Doing things in a systematic way, like finding the possible ranges for the boxes, also takes a bit more time to set up, but saves time in the end. Without investing in these things, this script would have not been as complete as it was now within a single day.

Having a streamlined IDE with Copilot, Ruff, IntelliCode, and Ollama/Continue configured beforehand was also a dream to work with. No more type errors in NumPy, and autocomplete on steroids with Copilot and Continue. Was happily surprised to have selenium working faster than back when I was using selenium a few years ago without all these tools.

About PyData

A few months ago I decided to join PyData Amsterdam this year to brush up on my knowledge of the latest and greatest machine learning techniques and python libraries. I usually don’t write any code in my job, but I used to have a lot of fun writing python for passion projects, analysis work and automating boring stuff. Last month I started working on an internal project that actually requires me to do some data engineering, time series forecasting and mathematical optimization in python. All the more reason why I was looking forward to 3 days of PyData.

PyData Amsterdam2 2024 is a 3-day event for the international community of users and developers of data analysis tools to share ideas and learn from each other. The global PyData network promotes discussion of best practices, new approaches, and emerging technologies for data management, processing, analytics, and visualization. PyData communities approach data science using many languages, including (but not limited to) Python, Julia, and R.

Over the span of 3 days, attendees will have the opportunity to participate in workshops, attend live keynote sessions and talks, as well as get to know fellow members of the PyData Community.

Can’t way to join PyData again next year, and thanks again Vincent Warmerdam at :probabl.3 for setting up this challenge. In the mean time some kaggle competitions will have to do, and if you are reading this and have some tips for me, do share!

  1. https://fivethirtyeight.com/features/can-you-rule-riddler-nation/ ↩︎
  2. https://amsterdam.pydata.org/ ↩︎
  3. https://probabl.ai/ ↩︎

If you want me to share the notebook, please send me a message.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.