sed replace newline (or 5 ways to remove line breaks with sed, python, tr, perl, xargs) 1


When you are automating tasks you often come to the point where you have to remove newlines / line breaks from documents or ASCII streams to use this output as parameters for the next steps. There are many options how this can be done. Especially in Big Data streaming you should also consider the performance of a tool / command. This article “sed replace newline” compares different tools regarding their usability and performance.

sed remove line break vs replace newline with other tools

In the Linux world “sed” is a very common tool to manipulate strings and regular expressions – but in the case of removing line breaks the solution is not very intuitive. So in most cases it might be easier just to use the tr tool to remove the line breaks. If you need more options and functionality you can use python for the job – directly on the command line.

 

So I compared to the following tools (regarding usability and performance):

  • sed (unreadable, not intuitive, slow)
  • xargs (easy, slow, but can only do one specific job)
  • tr (easy, fast, but can only replace single character)
  • python (easy to read, fast, can do every job)
  • perl (easy, slow)

For testing / demonstrating I use a document “example.txt” that contains 5 lines with newlines at the end and for performance comparison I will use a big_example.txt with 1.000.000 lines :

[email protected]:~$ cat example.txt 

line 1
line 2
line 3
line 4
line 5

Replace newlines with sed

sed is indeed a very powerful tool to manipulate strings or streams (sed = Stream EDitor). The syntax is basically: sed ‘s/SEARCH/REPLACE/g’ when you want to replace SEARCH with REPLACE within a line of a stream. But when you try to remove a newline character (=”\n” = backslash n) then this is not working with sed as you would expect it (because it is not a normal character within the same line of the stream but somehow a divider between 2 lines). The following is a workaround that does the job – but is not readable or intuitive at all:

sed ‘:a;N;$!ba;s/\n/ /g’

# sed replace newline usage
[email protected]:~$ cat example.txt | sed ':a;N;$!ba;s/\n/ /g'
line 1 line 2 line 3 line 4 line 5

 

sed performance to remove line breaks of a 1 Mio. rows file:

# sed replace newline performance
[email protected]:~$ time cat big_example.txt | sed ':a;N;$!ba;s/\n/ /g' >/dev/null

real	0m0.367s
user	0m0.281s
sys	0m0.079s

Replace linefeeds with xargs

xargs is a standard linux tool that is made to translate output from one pipe to input to another pipe. Taken from its man page:

xargs reads items from the standard input, delimited by blanks (which can be protected with double or single quotes or a backslash) or newlines, and executes the command (default is /bin/echo) one or more times with any initial-arguments followed by items read from standard input. Blank lines on the standard input are ignored.

[email protected]:~$ cat example.txt | xargs
line 1 line 2 line 3 line 4 line 5

xargs performance

[email protected]:~$ time cat big_example.txt | xargs >/dev/null

real	0m0.678s
user	0m0.268s
sys	0m0.905s

Replace line breaks with tr

tr is perfect to translate, replace or delete single characters. Notice that it also replaces the last newline (unlike xargs).

[email protected]:~$ cat example.txt | tr '\n' ' '
line 1 line 2 line 3 line 4 line 5 [email protected]:~$

tr performance

[email protected]:~$ time cat big_example.txt | tr '\n' ' ' >/dev/null

real	0m0.040s
user	0m0.013s
sys	0m0.026s

Remove newlines with python directly from stdin – one liner

Of course you can also use the power of python directly from stdin / stdout to remove newlines. This seems a bit too much to type – but you can do pretty much everything with python (think of JSON or XML formatting etc.)

python -c “import sys; print sys.stdin.read().replace(‘\n’,’ ‘)”

[email protected]:~$  cat example.txt | python -c "\
import sys; print sys.stdin.read().replace('\n',' ')"
line 1 line 2 line 3 line 4 line 5 

Python Performance

[email protected]:~$ time cat big_example.txt | python -c "\
import sys; print sys.stdin.read().replace('\n',' ')" > /dev/null

real	0m0.076s
user	0m0.016s
sys	0m0.052s

perl replace newline

Last but not least – you can also use perl to remove newlines from a file or a stream. Perl is also pretty powerful – but not as intuitive as python.

[email protected]:~$ cat example.txt | perl -pe 's/\n/ /'
line 1 line 2 line 3 line 4 line 5 [email protected]:~$

Perl Performance

[email protected]:~$ time cat big_example.txt | perl -pe 's/\n/ /' >/dev/null

real	0m0.517s
user	0m0.485s
sys	0m0.029s

comparing replace methods

So the winner is “tr”. It is easy to use, short to write and pretty fast. If you need to apply more complex logic (than just removing line feeds) python is the second winner – it is still fast and fully flexible.


Leave a comment

Your email address will not be published. Required fields are marked *

One thought on “sed replace newline (or 5 ways to remove line breaks with sed, python, tr, perl, xargs)

  • Mark

    Nice comparison.

    I always hate all those sed regex hieroglyph scripts. The python CLI trick is really nice – I will remember that one.

    Cheers,
    Mark