## Summary
This PR removes the `Expr::Invalid` variant from the AST. Instead, we'll
try to retain as much valid information as possible and use an empty
`Expr::Name` with `ExprContext::Invalid` as a replacement.
## Test Plan
- [x] All tests pass
- [x] No performance regression
## Summary
This PR updates the string flags to include an `Invalid` variant for any
invalid string nodes as deemed by the parser. This is to avoid dropping
the nodes and instead just mark it as invalid. The nodes will be empty
for now but we can discuss on whether to keep the raw source text or
not. It's not really required because the range can be used to get the
same.
It also adds a new `handle_implicitly_concatenated_strings` method which
is similar to existing `concatenated_strings` function. The reason to
have a separate method is to avoid dropping all strings if there's an
error. The error being that it's concatenating bytes and non-bytes
literal. Now, we need to decide which strings to retain. Currently, I've
kept it simple to retain bytes literal _only_ if all of them are bytes
otherwise we'll have string / f-string with invalid nodes instead of
bytes literal.
This removes the need for having a `StringType::Invalid` variant.
## Test Plan
- [x] Existing test cases pass
- [x] No performance regression
This PR merges the different string parsing functions into a single
entry point function.
Previously there were two entry points, one for string or byte literal
and the other for f-strings. The reason for this separation is that our
old parser raised a hard syntax error if an f-string was used as a
pattern literal. But, it's actually a soft syntax error as evident
through the CPython parser which raises it during the runtime.
This function basically implements the following grammar:
```
strings: (string|fstring)+
```
And it delegates it to the list parsing for better error recovery.
- [x] All tests pass
- [x] No performance regression
This PR does the following around f-string parsing:
1. Removes the `FStringElement::Invalid` variant
2. Move the parsing of f-string elements to use list parsing logic
3. Add error recovery for f-string elements
- [x] All tests pass
- [x] No performance regression
## Summary
This PR removes the `skip_until` parser method. The main use case for it
was for error recovery which we want to isolate only in list parsing.
There are two references which are removed:
1. Parsing a list of match arguments in a class pattern. Take the
following code snippet as an example:
```python
match foo:
case Foo(bar.z=1, baz):
pass
```
This is a syntax error as the keyword argument pattern can only have an
identifier but here it's an attribute node. Now, to move on to the next
argument (`baz`), the parser would skip until the end of the argument to
recover. What we will do now is to parse the value as a pattern (per
spec) thus moving the parser ahead and add the node with an empty
identifier.
The above code will produce the following AST:
<details><summary><b>AST</b></summary>
<p>
```rs
Module(
ModModule {
range: 0..52,
body: [
Match(
StmtMatch {
range: 0..51,
subject: Name(
ExprName {
range: 6..9,
id: "foo",
ctx: Load,
},
),
cases: [
MatchCase {
range: 15..51,
pattern: MatchClass(
PatternMatchClass {
range: 20..37,
cls: Name(
ExprName {
range: 20..23,
id: "Foo",
ctx: Load,
},
),
arguments: PatternArguments {
range: 24..37,
patterns: [
MatchAs(
PatternMatchAs {
range: 33..36,
pattern: None,
name: Some(
Identifier {
id: "baz",
range: 33..36,
},
),
},
),
],
keywords: [
PatternKeyword {
range: 24..31,
attr: Identifier {
id: "",
range: 31..31,
},
pattern: MatchValue(
PatternMatchValue {
range: 30..31,
value: NumberLiteral(
ExprNumberLiteral {
range: 30..31,
value: Int(
1,
),
},
),
},
),
},
],
},
},
),
guard: None,
body: [
Pass(
StmtPass {
range: 47..51,
},
),
],
},
],
},
),
],
},
)
```
</p>
</details>
2. Parsing a list of parameters. Here, our list parsing method makes
sure to only call the parse element function when it's a valid list
element. A parameter can start either with a `Star`, `DoubleStar`, or
`Name` token which corresponds to the 3 `if` conditions. Thus, the
`else` block is not required as the list parsing will recover without
it.
## Summary
This PR improves error related things around assignment nodes, mainly
the following:
1. Rename parse error variant:
a. `AssignmentError` -> `InvalidAssignmentTarget`
b. `NamedAssignmentError` -> `InvalidNamedAssignmentTarget`
c. `AugAssignmentError` -> `InvalidAugmnetedAssignmentTarget`
2. Add `InvalidDeleteTarget` for invalid `del` targets
a. Add helper function to check if it's a valid delete target similar to
other target check functions.
4. Fix: named assignment target can only be a `Name` node
## Test Plan
Various test cases locally. As mentioned in my previous PR, I want to
keep the testing part separate.
## Summary
This PR removes the deprecated parsing list functions and updates the
references to use the new functions.
There are now 4 functions to accommodate this pattern. They are divided
into 2 groups: one to parse a sequence of elements and the other to
parse a sequence of elements _separated_ by a comma. In each of the
groups, there are 2 functions: one collects and returns all the parsed
elements as a vector and the other delegates the collection part to the
user. This separation is achieved by using `Fn` and `FnMut` to allow
mutation in the later case.
The error recovery context has been updated to accommodate the new
sequence kind. Currently, the terminator token kinds only contain the
necessary token to end the list and not necessarily the ones which might
help in error recovery. This will be updated as I go through the testing
phase. This phase is basically coming up with a bunch of invalid
programs to check how the parser is acting and how can we help in the
recovery phase.
## Test Plan
Currently, my plan is to keep the testing part separate than the actual
update. This doesn't mean I'm not testing locally, but it's not
thorough. The main reason is to keep the diffs to a minimal and writing
test cases will require some effort which I want to decouple with the
actual change. This is ok here as it's not getting merged into `main`
but the parser PR.
Small quality of life improvement to rename the following method:
1. `current_kind` -> `current_token_kind`
2. `current_range` -> `current_token_range`
It's a PR for visibility.
## Summary
This PR updates the fields in `Program` struct to be private and exposes
methods to get the values. The motivation behind this is to encapsulate
the internal representation of the parsed program which we could alter
in the future.
## Summary
This PR updates fixes one of the `FIXME` comment to assert that the
parser is at one of the possible augmented assignment token when parsing
an augmented assignment statement.
## Test Plan
1. Add valid test cases for all the possible augmented assignment tokens
2. Add invalid test cases similar to assignment statement
## Summary
I used `codespell` and `gramma` to identify mispellings and grammar
errors throughout the codebase and fixed them. I tried not to make any
controversial changes, but feel free to revert as you see fit.
## Summary
Fix#10282
This PR updates the Python grammar to include the `*` character in
`*args` `**kwargs` in the range of the `Parameter`
```
def f(*args, **kwargs): pass
# ~~~~ ~~~~~~ <-- range before the PR
# ^^^^^ ^^^^^^^^ <-- range after
```
The invalid syntax `def f(*, **kwargs): ...` is also now correctly
reported.
## Test Plan
Test cases were added to `function.rs`.
This PR modifies our AST so that nodes for string literals, bytes literals and f-strings all retain the following information:
- The quoting style used (double or single quotes)
- Whether the string is triple-quoted or not
- Whether the string is raw or not
This PR is a followup to #10256. Like with that PR, this PR does not, in itself, fix any bugs. However, it means that we will have the necessary information to preserve quoting style and rawness of strings in the `ExprGenerator` in a followup PR, which will allow us to provide a fix for https://github.com/astral-sh/ruff/issues/7799.
The information is recorded on the AST nodes using a bitflag field on each node, similarly to how we recorded the information on `Tok::String`, `Tok::FStringStart` and `Tok::FStringMiddle` tokens in #10298. Rather than reusing the bitflag I used for the tokens, however, I decided to create a custom bitflag for each AST node.
Using different bitflags for each node allows us to make invalid states unrepresentable: it is valid to set a `u` prefix on a string literal, but not on a bytes literal or an f-string. It also allows us to have better debug representations for each AST node modified in this PR.
The expression types in our AST are called `ExprYield`, `ExprAwait`,
`ExprStringLiteral` etc, except `ExprNamedExpr`, `ExprIfExpr` and
`ExprGenratorExpr`. This seems to align with [Python AST's
naming](https://docs.python.org/3/library/ast.html) but feels
inconsistent and excessive.
This PR removes the `Expr` postfix from `ExprNamedExpr`, `ExprIfExpr`,
and `ExprGeneratorExpr`.
## Summary
This PR fixes the `DebugText` implementation to use the expression range
instead of the parenthesized range.
Taking the following code snippet as an example:
```python
x = 1
print(f"{ ( x ) = }")
```
The output of running it would be:
```
( x ) = 1
```
Notice that the whitespace between the parentheses and the expression is
preserved as is.
Currently, we don't preserve this information in the AST which defeats
the purpose of `DebugText` as the main purpose of the struct is to
preserve whitespaces _around_ the expression.
This is also problematic when generating the code from the AST node as
then the generator has no information about the parentheses the
whitespaces between them and the expression which would lead to the
removal of the parentheses in the generated code.
I noticed this while working on the f-string formatting where the debug
text would be used to preserve the text surrounding the expression in
the presence of debug expression. The parentheses were being dropped
then which made me realize that the problem is instead in the parser.
## Test Plan
1. Add a test case for the parser
2. Add a test case for the generator
## Summary
This PR reduces the size of `Expr` from 80 to 64 bytes, by reducing the
sizes of...
- `ExprCall` from 72 to 56 bytes, by using boxed slices for `Arguments`.
- `ExprCompare` from 64 to 48 bytes, by using boxed slices for its
various vectors.
In testing, the parser gets a bit faster, and the linter benchmarks
improve quite a bit.
## Summary
Given:
```python
F"{"ڤ
```
We try to locate the "unclosed left brace" error by subtracting the
quote size from the lexer offset -- so we subtract 1 from the end of the
source, which puts us in the middle of a Unicode character. I don't
think we should try to adjust the offset in this way, since there can be
content _after_ the quote. For example, with the advent of PEP 701, this
string could reasonably be fixed as:
```python
F"{"ڤ"}"
````
Closes https://github.com/astral-sh/ruff/issues/9379.
## Summary
This PR modifies our `Cargo.toml` files to use workspace dependencies
for _all_ dependencies, rather than the status quo of sporadically
trying to use workspace dependencies for those dependencies that are
used across multiple crates. I find the current situation more confusing
and harder to manage, since we have a mix of workspace and crate-local
dependencies, whereas this setup consistently uses the same approach for
all dependencies.
## Summary
I always found it odd that we had to pass this in, since it's really
higher-level context for the error. The awkwardness is further evidenced
by the fact that we pass in fake values everywhere (even outside of
tests). The source path isn't actually used to display the error; it's
only accessed elsewhere to _re-display_ the error in certain cases. This
PR modifies to instead pass the path directly in those cases.
## Summary
This helps a bit with (but does not close) the issues described in
https://github.com/astral-sh/ruff/issues/9311. E.g., now, we at least
see: `error: Failed to format main.py: source contains syntax errors:
invalid syntax. Got unexpected token '=' at byte offset 20`.
## Summary
This PR adds some helper structs to the linter paths to enable passing
in the pre-computed tokens and parsed source code during benchmarking,
to remove lexing and parsing from the overall linter benchmark
measurement. We already remove parsing for the formatter, and we have
separate benchmarks for the lexer and the parser, so this should make it
much easier to measure linter performance changes.
This sets `lto = "thin"` instead of using "fat" LTO, and sets
`codegen-units = 16`. These are the defaults for Cargo's `release`
profile, and I think it may give us faster iteration times, especially
when benchmarking. The point of this PR is to see what kind of impact
this has on benchmarks. It is expected that benchmarks may regress to
some extent.
I did some quick ad hoc experiments to quantify this change in compile
times. Namely, I ran:
cargo build --profile release -p ruff_cli
Then I ran
touch crates/ruff_python_formatter/src/expression/string/docstring.rs
(because that's where i've been working lately) and re-ran
cargo build --profile release -p ruff_cli
This last command is what I timed, since it reflects how much time one
has to wait between making a change and getting a compiled artifact.
Here are my results:
* With status quo `release` profile, build takes 77s
* with `release` but `lto = "thin"`, build takes 41s
* with `release`, but `lto = false`, build takes 19s
* with `release`, but `lto = false` **and** `codegen-units = 16`, build
takes 7s
* with `release`, but `lto = "thin"` **and** `codegen-units = 16`, build
takes 16s (i believe this is the default `release` configuration)
This PR represents the last option. It's not the fastest to compile, but
it's nearly a whole minute faster! The idea is that with `codegen-units
= 16`, we still make use of parallelism, but keep _some_ level of LTO on
to try and re-gain what we lose by increasing the number of codegen
units.
Rebase of #6365 authored by @davidszotten.
## Summary
This PR updates the AST structure for an f-string elements.
The main **motivation** behind this change is to have a dedicated node
for the string part of an f-string. Previously, the existing
`ExprStringLiteral` node was used for this purpose which isn't exactly
correct. The `ExprStringLiteral` node should include the quotes as well
in the range but the f-string literal element doesn't include the quote
as it's a specific part within an f-string. For example,
```python
f"foo {x}"
# ^^^^
# This is the literal part of an f-string
```
The introduction of `FStringElement` enum is helpful which represent
either the literal part or the expression part of an f-string.
### Rule Updates
This means that there'll be two nodes representing a string depending on
the context. One for a normal string literal while the other is a string
literal within an f-string. The AST checker is updated to accommodate
this change. The rules which work on string literal are updated to check
on the literal part of f-string as well.
#### Notes
1. The `Expr::is_literal_expr` method would check for
`ExprStringLiteral` and return true if so. But now that we don't
represent the literal part of an f-string using that node, this improves
the method's behavior and confines to the actual expression. We do have
the `FStringElement::is_literal` method.
2. We avoid checking if we're in a f-string context before adding to
`string_type_definitions` because the f-string literal is now a
dedicated node and not part of `Expr`.
3. Annotations cannot use f-string so we avoid changing any rules which
work on annotation and checks for `ExprStringLiteral`.
## Test Plan
- All references of `Expr::StringLiteral` were checked to see if any of
the rules require updating to account for the f-string literal element
node.
- New test cases are added for rules which check against the literal
part of an f-string.
- Check the ecosystem results and ensure it remains unchanged.
## Performance
There's a performance penalty in the parser. The reason for this remains
unknown as it seems that the generated assembly code is now different
for the `__reduce154` function. The reduce function body is just popping
the `ParenthesizedExpr` on top of the stack and pushing it with the new
location.
- The size of `FStringElement` enum is the same as `Expr` which is what
it replaces in `FString::format_spec`
- The size of `FStringExpressionElement` is the same as
`ExprFormattedValue` which is what it replaces
I tried reducing the `Expr` enum from 80 bytes to 72 bytes but it hardly
resulted in any performance gain. The difference can be seen here:
- Original profile: https://share.firefox.dev/3Taa7ES
- Profile after boxing some node fields:
https://share.firefox.dev/3GsNXpD
### Backtracking
I tried backtracking the changes to see if any of the isolated change
produced this regression. The problem here is that the overall change is
so small that there's only a single checkpoint where I can backtrack and
that checkpoint results in the same regression. This checkpoint is to
revert using `Expr` to the `FString::format_spec` field. After this
point, the change would revert back to the original implementation.
## Review process
The review process is similar to #7927. The first set of commits update
the node structure, parser, and related AST files. Then, further commits
update the linter and formatter part to account for the AST change.
---------
Co-authored-by: David Szotten <davidszotten@gmail.com>
<!--
Thank you for contributing to Ruff! To help us out with reviewing,
please consider the following:
- Does this pull request include a summary of the change? (See below.)
- Does this pull request include a descriptive title?
- Does this pull request include references to any relevant issues?
-->
## Summary
Our `SoftKeywordTokenizer` only respected soft keywords in compound
statement positions -- for example, at the start of a logical line:
```python
type X = int
```
However, type aliases can also appear in simple statement positions,
like:
```python
class Class: type X = int
```
(Note that `match` and `case` are _not_ valid keywords in such
positions.)
This PR upgrades the tokenizer to track both kinds of valid positions.
Closes https://github.com/astral-sh/ruff/issues/8900.
Closes https://github.com/astral-sh/ruff/issues/8899.
## Test Plan
`cargo test`
## Summary
Given `with (a := b): pass`, we truncate the `WithItem` range by one on
both sides such that the parentheses are part of the statement, rather
than the item. However, for `with (a := b) as x: pass`, we want to avoid
this trick.
Closes https://github.com/astral-sh/ruff/issues/8913.