In the past couple of months, our team has been working very hard to build a migration engine on python, for taking backup of drives using encryption mounts. Recently, I was asked a very interesting question about python:
“I often find .pyc files being generated after I run the python scripts. What are the .pyc files? Why are the pre compiled files being generated when python is an interpreter ?? Can the .pyc files be deleted?”
Python is an interpreted language and not a compiled one. Python uses a CPython bytecode interpreter. Python source code is compiled into bytecode, which is the internal representation of a Python program. The bytecode is also cached in .pyc and .pyo files, so that executing the same file is faster the second time (recompilation from source to bytecode can be avoided).
The Python interpreter actually has the structure of a classic compiler. When you invoke the “python” command, the raw source code is scanned for tokens, these tokens are parsed into a tree representing the logical structure of the program, which is finally transformed into bytecode. Finally, this bytecode is executed by the virtual machine. The details of the process are shown in the figure below:
A tokenizer breaks input text into a series of tokens. The process of transforming the input text into tokens is known as “lexical analysis” or “tokenizing”. Tokens are identified based on the specific rules of the lexer. Some methods used to identify tokens include regular expressions, specific sequences of characters known as a flag and specific separating characters called delimiters. Special characters, including punctuation characters, are commonly used by lexers to identify tokens because of their natural use in written and programming languages.
In computing, a parser is one of the components in an interpreter or compiler that takes input text and builds a data structure like parse tree giving a structural representation of the input, checking for correct syntax in the process. Python’s parser takes a token stream as input and based on the rules declared in the Python grammar produces an Abstract Syntax Tree (AST).
The next phase of compilation is code generation, which takes the AST constructed in the previous phase and produces a PyCodeObject as output. A PyCodeObject is an independent unit of executable code, containing all the data and code necessary for independent execution by the Python bytecode interpreter.
The execution of Python bytecode is handled by the bytecode interpreter. Python’s bytecode interpreter is a stack-based virtual machine. The process of bytecode execution manipulates a data stack, by pushing and popping instructions.