Python: A Google Translate service using Playwright

Maarten Smeets

There are a lot of use-cases in which you might want to automate web-browser actions. For example to automate tedious repetitive tasks or to perform automated tests of front-end applications. There are several tools available to do this such as Selenium, Cypress and Puppeteer. Several blog posts (such as here) and a workshop by Lucas Jellema picked my interest in Playwright so I decided to give it a try. I’m not a great fan of JavaScript so I decided to go with Python for this one. I also took a quick look at performance using wrk (an HTTP bench-marking tool) and was not disappointed.

Introduction

Playwright

The following here gives a nice comparison of Selenium, Cypress, Puppeteer and Playwright. Microsoft Playwright has been created by the same people who created Google Puppeteer and is relatively new. Playwright communicates bidirectionally with the browser. This allows events in the browser to trigger events in your scripts (see here). This is something which can be done but is more difficult using something like Selenium (see here, I suspect polling is usually involved). Playwright provides extensive options for waiting for things to happen and since the interaction is bidirectional, I suspect polling will not be used which increases performance and makes scripts more robust and fast.

Playwright

I used PlayWright on Python. The Node.js implementation is more mature. On Python, before the first official non-alpha release, there might still be some breaking changes in the API so the sample provided here might not work in the future due to those API changes. Since the JavaScript and Python API are quite similar, I do not expect major changes though.

Python

Python 3.4 introduced the asyncio module and since Python 3.5 you can use keywords like async and await. Because I was using async libraries and I wanted to wrap my script in a webservice, I decided to take a look at what webservice frameworks are available for Python. I stumbled on this comparison. Based on the async capabilities, simple syntax, good performance and (claimed, did not check) popularity, I decided to go with Sanic.

Sanic Framework

Another reason for me to go with Python is that it is very popular in the AI/ML area. If in the future I want to scrape a lot of data from websites and do smart things with that data, I can stay within Python and do not have to mix several languages.

Google Translate API

Of course Google provides a commercial translate API. If you are thinking about seriously implementing a translate API, definitely go with that one since is is made to do translations and provide SLAs. I decided to create a little Python REST service to use the Google Translate website (a website scraper). For me this was a tryout of Playwright so in this example, I did not care about support, performance or SLAs. If you overuse the site, you will get Capcha’s and I did not automate those away.

Getting things ready

I started out with a clean Ubuntu 20.04 environment. First I tried JupyterLabs but Playwright and JupyterLabs did not seem to play well together since probably JupyterLabs also extensively uses a browser itself. I decided to go with PyCharm instead. PyCharm has some code completion features which I like among other things and of course the interface is similar to IntelliJ and DataGrip which I also use for other things.

 #Python 3 was already installed. pip wasn't yet  
 sudo apt-get install python3-pip  
 sudo pip3 install playwright  
 sudo pip3 install lxml  
 #Webservice framework  
 sudo pip3 install sanic  
 sudo apt-get install libenchant1c2a  
 #This installs the browsers which are used by Playwright  
 python3 -m playwright install  
 #PyCharm  
 sudo snap install pycharm-community --classic  

The Google Translate API scraper


from playwright import async_playwright from sanic import Sanic from sanic import response app = Sanic(name='Translate application') @app.route("/translate") async def doTranslate(request): async with async_playwright() as p: sl = request.args.get('sl') tl = request.args.get('tl') translate = request.args.get('translate') browser = await p.chromium.launch() # headless=False context = await browser.newContext() page = await context.newPage() await page.goto('https://translate.google.com/?sl='+sl+'&tl='+tl+'&op=translate') textarea = await page.waitForSelector('//textarea') await textarea.fill(translate) waitforthis = await page.waitForSelector('div.Dwvecf',state='attached') result = await page.querySelector('span.VIiyi >> ../span/span/span') textresult = await result.textContent() await browser.close() return response.json({'translation':textresult}) if __name__ == '__main__': app.run(host="0.0.0.0", port=5000)

You can also find the code (more nicely highlighted) here.

How does it work?

The app.run command at the end starts an HTTP server on port 5000. @app.route indicates the function doTranslate will be available at /translate. The async function doTranslate has a single parameter; the request. This request is used to obtain the GET arguments which are used to indicate ‘target language’ (tl), ‘source language’ (sl) and the text to translate (translate). Next a Chrome browser is started in headless mode. Headless is the default for Playwright but during development, it helps to disable this headless mode so you can see what happens in the browser. The Google Translate site is opened and Playwright waits until a textarea appears. It fills it with the text to be translated. Playwright waits for the translation to appear (the box ‘Translations of  auto’ in the screenshot below). After the box has appeared, the result is selected and saved. First I select span.VIiyi (a CSS selector) and within that span I select ../span/span/span (an XPATH selector). After the result is obtained, I close the browser and return it.

Playwright at work to automate the Google Translate site

When you start the service, you will get something like

 [2021-01-24 09:16:38 +0100] [3278] [INFO] Goin' Fast @ http://0.0.0.0:5000  
 [2021-01-24 09:16:38 +0100] [3278] [INFO] Starting worker [3278]  

Next you can test the service. You can do this with 

 curl 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'  

It will translate the Dutch (sl=nl) ‘auto’ (translate=auto) to English (tl=en). The result will be car.

 {"translation":"car"}  

Performance

Of course performance tests come with a big disclaimer. I tried this on specific hardware, on a specific OS, etc. You will not be able to exactly reproduce these results. Also the test I did was relatively simple just to get an impression. I additionally logged the responses since the script I used did not include proper exception handling.

I used wrk, an HTTP benchmark tool on the service. If you’re interested in automatically parsing wrk results in Python, read this.

 wrk -c2 -t2 -d30s --timeout=30s 'http://localhost:5000/translate?sl=nl&tl=en&translate=auto'

wrk used 2 threads and kept 2 connections open at the same time using this command-line (2 concurrent requests). Per request I set the timeout to 30s. This timeout was never reached.

This gave me average results of about 2.5s per request. At 4 concurrent requests, this became about 3s on average. At 8 concurrent requests this became about 4.5s on average and started to give some errors. A ‘normal’ API can of course do much better. I was a bit surprised though Playwright could handle 8 headless browsers simultaneously on an 8Gb Ubuntu VM which also had PyCharm open at the same time (knowing what a memory-hog Chrome can be). I was also surprised Google didn’t start to bother me with Capcha’s yet.

Maybe this could have been done more efficiently in the script by using different tabs in the same browser or maybe even the same tab. I didn’t try that yet.

Tips for development

Automatically generate scripts

You can use Playwright to automatically generate scripts for you from manual browser interactions. This can be a good start of a script. See for example here. The generated scripts however do not contain smart logic such as waiting for certain elements in the page to appear before selecting text from other elements.

Use browser developer tools

For web-developers, this is of course a given. In order to automate a browser using Playwright, you need to select elements in the web-page. An easy way to do this is by looking at the developer tools which are present in most web-browsers nowadays. Here you can easily browse the DOM (document object model, the model on which the browser bases what it shows you). This allows you to find specific elements of interest to manipulate using Playwright.

Chrome Developer Tools

Approximate human behavior

One of the dangers of creating screen scrapers is that if the site changes, your code might not work anymore. In my sample script I used specific identifiers like Dwvecf and VIiyi and queried for elements based on the sites DOM. When you look at a website yourself, you select elements in a more visual way. The better you can approximate the way a human would interact with a website, the more stable your script will be. For example, selecting the first textarea on the site is more stable then expecting the result to be in span.VIiyi and in that element under span/span/span. 

The right tool for the job

If an API is available and you can use it directly, that is of course preferable to using a website scraper since an API is made for automated interaction and a web-site is made for human interaction. You usually get much better performance and stability when using an official API instead. Playwright includes an API to monitor and modify HTTP and HTTPS requests done by the browser. This might help you in determining back-end APIs so you can try if you can use them directly.

When using a tool like Playwright to automate browser interaction in for example tests for custom developed applications, you can get better stability since you know what will change when the site is updated and can make sure the automation scripts keep working. When you’re using Playwright against an external website, you will have less control. Google will not inform me when they change https://translate.google.com and this script will most likely break because of it.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next Post

Using one-time upload URLs in AWS with S3 versioning

In this blog, I will show how you can use the SAM (Serverless Application Model) to get a presigned upload URL to AWS S3 that can be used exactly once [1]. In AWS it is possible to use a presigned URL to upload files, but the URL is valid for […]