Wednesday, August 8, 2018

text processing - Limit grep output to short lines


I often use grep to find files having a certain entry like this:


grep -R 'MyClassName'

The good thing is that it returns the files, their contents and marks the found string in red. The bad thing is that I also have huge files where the entire text is written in one big single line. Now grep outputs too much when finding text within those big files. Is there a way to limit the output to for instance 5 words to the left and to the right? Or maybe limit the output to 30 letters to the left and to the right?



grep itself only has options for context based on lines. An alternative is suggested by this SU post:



A workaround is to enable the option 'only-matching' and then to use
RegExp's power to grep a bit more than your text:


grep -o ".\{0,50\}WHAT_I_M_SEARCHING.\{0,50\}" ./filepath

Of course, if you use color highlighting, you can always grep again to
only color the real match:


grep -o ".\{0,50\}WHAT_I_M_SEARCHING.\{0,50\}"  ./filepath | grep "WHAT_I_M_SEARCHING"


As another alternative, I'd suggest folding the text and then grepping it, for example:


fold -sw 80 input.txt | grep ...

The -s option will make fold push words to the next line instead of breaking in between.


Or use some other way to split the input in lines based on the structure of your input. (The SU post, for example, dealt with JSON, so using jq etc. to pretty-print and grep ... or just using jq to do the filtering by itself ... would be better than either of the two alternatives given above.)




This GNU awk method might be faster:


gawk -v n=50 -v RS='MyClassName' '
FNR > 1 { printf "%s: %s\n",FILENAME, p prt substr($0, 0, n)}
{p = substr($0, length - n); prt = RT}
' input.txt


  • Tell awk to split records on the pattern we're interested in (-v RS=...), and the number of characters in context (-v n=...)

  • Each record after the first record (FNR > 1) is one where awk found a match for the pattern.

  • So we print n trailing characters from the previous line (p) and n leading characters from the current line (substr($0, 0, n)), along with the matched text for the previous line (which is prt)

    • we set p and prt after printing, so the value we set is used by the next line

    • RT is a GNUism, that's why this is GNU awk-specific.



For recursive search, maybe:


find . -type f -exec gawk -v n=50 -v RS='MyClassName' 'FNR>1{printf "%s: %s\n",FILENAME, p prt substr($0, 0, n)} {p = substr($0, length-n); prt = RT}' {} +

No comments:

Post a Comment

11.10 - Can't boot from USB after installing Ubuntu

I bought a Samsung series 5 notebook and a very strange thing happened: I installed Ubuntu 11.10 from a usb pen drive but when I restarted (...