When you are automating tasks you often come to the point where you have to remove newlines / line breaks from documents or ASCII streams to use this output as parameters for the next steps. There are many options how this can be done. Especially in Big Data streaming you should also consider the performance of a tool / command. This article “sed replace newline” compares different tools regarding their usability and performance.
In the Linux world “sed” is a very common tool to manipulate strings and regular expressions – but in the case of removing line breaks the solution is not very intuitive. So in most cases it might be easier just to use the tr tool to remove the line breaks. If you need more options and functionality you can use python for the job – directly on the command line.
So I compared to the following tools (regarding usability and performance):
- sed (unreadable, not intuitive, slow)
- xargs (easy, slow, but can only do one specific job)
- tr (easy, fast, but can only replace single character)
- python (easy to read, fast, can do every job)
- perl (easy, slow)
For testing / demonstrating I use a document “example.txt” that contains 5 lines with newlines at the end and for performance comparison I will use a big_example.txt with 1.000.000 lines :
[email protected]:~$ cat example.txt line 1 line 2 line 3 line 4 line 5
Replace newlines with sed
sed is indeed a very powerful tool to manipulate strings or streams (sed = Stream EDitor). The syntax is basically: sed ‘s/SEARCH/REPLACE/g’ when you want to replace SEARCH with REPLACE within a line of a stream. But when you try to remove a newline character (=”\n” = backslash n) then this is not working with sed as you would expect it (because it is not a normal character within the same line of the stream but somehow a divider between 2 lines). The following is a workaround that does the job – but is not readable or intuitive at all:
sed ‘:a;N;$!ba;s/\n/ /g’
# sed replace newline usage [email protected]:~$ cat example.txt | sed ':a;N;$!ba;s/\n/ /g' line 1 line 2 line 3 line 4 line 5
sed performance to remove line breaks of a 1 Mio. rows file:
# sed replace newline performance [email protected]:~$ time cat big_example.txt | sed ':a;N;$!ba;s/\n/ /g' &gt;/dev/null real 0m0.367s user 0m0.281s sys 0m0.079s
Replace linefeeds with xargs
xargs is a standard linux tool that is made to translate output from one pipe to input to another pipe. Taken from its man page:
xargs reads items from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or newlines, and executes the command (default is /bin/echo) one or more times with any initial-arguments followed by items read from standard input. Blank lines on the standard input are ignored.
[email protected]:~$ cat example.txt | xargs line 1 line 2 line 3 line 4 line 5
[email protected]:~$ time cat big_example.txt | xargs >/dev/null real 0m0.678s user 0m0.268s sys 0m0.905s
Replace line breaks with tr
tr is perfect to translate, replace or delete single characters. Notice that it also replaces the last newline (unlike xargs).
[email protected]:~$ cat example.txt | tr '\n' ' ' line 1 line 2 line 3 line 4 line 5 [email protected]:~$
[email protected]:~$ time cat big_example.txt | tr '\n' ' ' >/dev/null real 0m0.040s user 0m0.013s sys 0m0.026s
Remove newlines with python directly from stdin – one liner
Of course you can also use the power of python directly from stdin / stdout to remove newlines. This seems a bit too much to type – but you can do pretty much everything with python (think of JSON or XML formatting etc.)
python -c “import sys; print sys.stdin.read().replace(‘\n’,’ ‘)”
[email protected]:~$ cat example.txt | python -c "\ import sys; print sys.stdin.read().replace('\n',' ')" line 1 line 2 line 3 line 4 line 5
[email protected]lash4.de:~$ time cat big_example.txt | python -c "\ import sys; print sys.stdin.read().replace('\n',' ')" > /dev/null real 0m0.076s user 0m0.016s sys 0m0.052s
perl replace newline
Last but not least – you can also use perl to remove newlines from a file or a stream. Perl is also pretty powerful – but not as intuitive as python.
[email protected]:~$ cat example.txt | perl -pe 's/\n/ /' line 1 line 2 line 3 line 4 line 5 [email protected]:~$
[email protected]:~$ time cat big_example.txt | perl -pe 's/\n/ /' >/dev/null real 0m0.517s user 0m0.485s sys 0m0.029s
comparing replace methods
So the winner is “tr”. It is easy to use, short to write and pretty fast. If you need to apply more complex logic (than just removing line feeds) python is the second winner – it is still fast and fully flexible.