Cleaning Math Expressions of Unknown Syntax
Our goal is to convert as many math expressions as possible into a standardized postfix array. We want to accept regular math expressions that may not exactly follow correct syntax, Python/JavaScript and hopefully more languages, and LaTeX.
For maximum flexibility, we prefer to accept weird combinations of these syntaxes within the same expression with just one universal parser. We don't want the algorithm to be too slow, but the assumption is that we will be handling a reasonable number of short strings so the priority is simplicity and robustness over speed.
What to do first
To keep the postfixing algorithm as simple as possible, I will pass the input through several steps first to create a more canonical form. Many conversions can occur at any time and we want to perform them at the step that simplifies the overall process. If a straight-forward search and replace works we will do that first. Also anything that is too complex to work into the postfix algorithm will be done before postfixing.
The current process is:
- Properly deal with important whitespace.
- Remove whitespace.
- Search and replace lots of substrings.
- Complex parsing of LaTeX (and maybe more).
- Create Infix array.
- Create Postfix array.
Remove whitespace
Most whitespace (spaces and possibly tabs) in math expressions can be ignored so I want to get rid of any whitespace as early as possible to avoid dealing with it in later steps. Some whitespace is important so the first step is to carefully remove these complications.
For now, my list only contains a few examples. The strings in the first element of each 2 element array are replaced by the strings in the second element. Mostly some strings are replaced to avoid concatenations with variable names. The "%" can be used for both percentages and modular arithmetic, while the "!" can be a factorial or part of a not equal sign. Whitespace can help distinguish between the possiblities.
var whitespaceArray = [[" And ","&"], [" Or ","|"], [" In ","∈"], [" Not In ","∉"], [" Mod ","\\mod"], [" Perm ","\\perm"], [" Comb ","\\comb"], [" choose ","\\comb"], [" % ","\\mod"], ["! ","!_"]];
Each element is added to a trie for replacement. We automatically add " And ", " and ", and " AND " to the trie to handle the common cases the same. We do the same for the other words since we don't know what case to expect. Converting the input to all lowercase would also work here but for some expressions the case is important. The output strings include the double slashes to escape the second one. So "\\comb" is really "\comb".
Once we have completed this step we can eliminate the remaining spaces.
input = input.replace(/\s/g,"");
Replace Python, JavaScript, and more
Instead of trying to postfix "Math.sin(", "math.sin(", and "sin(" we will convert every sine function to "sin(" in this step. We want to properly convert any of the basic math functions of Python or JavaScript. Eventually we can add support for more languages. There are a relatively small number of functions and expressions to replace so we just add them all to a spreadsheet.
Lots of LaTeX expressions are easy to replace if they start with "\". Expressions like "\pi" will be replaced by "π", but "pi" without a leading slash will remain as "pi" because there could be more letters to follow. If someone types "\pit" they probably want "πt", but "pit" might just be the word pit.
I update this spreadsheet with all of the replacements that seem like good ideas and then generate a trie that can be easily searched. Strings that are case sensitive like "pi" versus "Pi" are only added with the inputted case, but other strings are added to the map in multiple cases.
Trie search and replace
Once we have a spreadsheet with a bunch of conversions, we will construct a trie. If you like regex you can probably use that, but a trie makes it simple to do what I want. A trie is a data structure that is a type of tree that can easily search a set of strings. A short explanation of tries can be found here and a longer explanation of tries is here.
For a quick visualization of the above trie, start typing a Python/JavaScript/LaTeX math expression in the box below. As you type the tree will be pruned to only show the strings that are still possible. Nodes with green backgrounds represent the end of a potential string. Click or hover on an ending node to see how that string will be replaced.
Input:
A trie efficiently searches a large set of keys by generating this large tree that can easily be pruned so that the fewest number of elements need to be checked at each step. For example code, see this repl that replaces Greek letters or this repl that produces the above interactive graphic.
To find every substring and not just substrings at the beginning, we need to repeat this search starting at each character. At each character we replace the longest substring that starts in that place.
function trieReplace(input) { var len = input.length; for (var i=0;i<input.length;i++){ //trieSearch() searches the trie for the substring starting at index i //root is the root of the trie to search var ts = trieSearch(input.substring(i),root); if (ts){ //ts is an 2 element array of the form [length(key),key] let out = trieMap[ts[1]]; //edit the string and continue the search at the correct index input = input.substring(0,i)+out+input.substring(i+ts[0]); i += out.length-1; } } return input; }
Assuming the root element, trieSearch function, and trieMap map are correctly generated the above function will replace every substring that needs to be replaced. A poorly constructed trie can lead to replacing too many strings, so any strings that require more context either need to be handled separately or while postfixing.
Final Product
Other expressions like LaTeX matrices will be handled in a future post, and this post covers the basics of postfixing. Several repls including this one have lots of the code to parse expressions. Visit Mathzetta to see how much of the output can be converted to LaTeX for display.
Or just input an expression below to see the first few steps in action.
Input:
Cleaned String:
Step | Input | Output |
---|