Remove CRLF

To maintain the standard used in old Teletypes, DOS/Windows uses two control characters to represent a line break, a Carriage Return (CR = 0x0D) and a Line Feed (LF = 0x0A), but Unix-Like systems realized that this was a waste of 1 byte per line and only an LF was enough to represent a new line.

Example:

$ cat bacon.txt
Bacon Ipsum:

Bacon ipsum dolor sit amet salami
pork belly tail tongue pancetta,
pork loin tri-tip drumstick bresaola shankle.
$ file bacon_windows.txt bacon_linux.txt
bacon_windows.txt: ASCII text, with CRLF line terminators
bacon_linux.txt:   ASCII text
$ hexdump -C bacon_linux.txt
00000000  42 61 63 6f 6e 20 49 70  73 75 6d 3a <strong>0a 0a</strong> 42 61  |Bacon Ipsum:..Ba|
00000010  63 6f 6e 20 69 70 73 75  6d 20 64 6f 6c 6f 72 20  |con ipsum dolor |
00000020  73 69 74 20 61 6d 65 74  20 73 61 6c 61 6d 69 <strong>0a</strong>  |sit amet salami.|
00000030  70 6f 72 6b 20 62 65 6c  6c 79 20 74 61 69 6c 20  |pork belly tail |
00000040  74 6f 6e 67 75 65 20 70  61 6e 63 65 74 74 61 2c  |tongue pancetta,|
00000050  <strong>0a</strong> 70 6f 72 6b 20 6c 6f  69 6e 20 74 72 69 2d 74  |.pork loin tri-t|
00000060  69 70 20 64 72 75 6d 73  74 69 63 6b 20 62 72 65  |ip drumstick bre|
00000070  73 61 6f 6c 61 20 73 68  61 6e 6b 6c 65 2e <strong>0a</strong>     |saola shankle..|
0000007f
$ hexdump -C bacon_windows.txt
00000000  42 61 63 6f 6e 20 49 70  73 75 6d 3a <strong>0d 0a 0d 0a</strong>  |Bacon Ipsum:....|
00000010  42 61 63 6f 6e 20 69 70  73 75 6d 20 64 6f 6c 6f  |Bacon ipsum dolo|
00000020  72 20 73 69 74 20 61 6d  65 74 20 73 61 6c 61 6d  |r sit amet salam|
00000030  69 <strong>0d 0a</strong> 70 6f 72 6b 20  62 65 6c 6c 79 20 74 61  |i..pork belly ta|
00000040  69 6c 20 74 6f 6e 67 75  65 20 70 61 6e 63 65 74  |il tongue pancet|
00000050  74 61 2c <strong>0d 0a</strong> 70 6f 72  6b 20 6c 6f 69 6e 20 74  |ta,..pork loin t|
00000060  72 69 2d 74 69 70 20 64  72 75 6d 73 74 69 63 6b  |ri-tip drumstick|
00000070  20 62 72 65 73 61 6f 6c  61 20 73 68 61 6e 6b 6c  | bresaola shankl|
00000080  65 2e <strong>0d 0a</strong>                                       |e...|
00000084

Notice that the file created in Windows contains the sequence 0d 0a where the file created in Linux only shows 0a. It was clear that this would bring compatibility issues and headaches.

Although good editors can work with both files (in Notepad, the lines of bacon_linux.txt will all appear concatenated), files that should be identical appear as different when using tools like “diff”:

$ diff bacon_windows.txt bacon_linux.txt
1,5c1,5
< Bacon Ipsum:
<
< Bacon ipsum dolor sit amet salami
< pork belly tail tongue pancetta,
< pork loin tri-tip drumstick bresaola shankle.
---
> Bacon Ipsum:
>
> Bacon ipsum dolor sit amet salami
> pork belly tail tongue pancetta,
> pork loin tri-tip drumstick bresaola shankle.

Today I put some codes made partly in Windows and partly in Linux under version control and preferred to standardize everything to use only LF. Here I will show how I did the conversion automatically on Linux:

Vim

Two ways:

1

  • Display CRLF as ^M: :e ++ff=unix.
  • Replace all ^M with ^N: :s/\r/\r/.
  • Remove ^M only if it is at the end of the line: :s/\r\+$//
  • Remove all ^M: :s/^M//

^M is typed as ^V^M (ctrl+V ctrl+M).

2

  • Convert the file format to unix: :setlocal fileformat=unix
  • Save: :w
  • Reload: :e

Convert multiple files

  • Assume DOS format: :set ffs=dos.
  • List of files to be converted: :args *.c *.h.
  • Change the format of each argument: :argdo set ff=unix|w.

Other useful commands: :set list and :set nobomb.

sed

sed, Stream EDitor, is one of the most useful command-line utilities for text processing.

It works line by line, so if we know the file is in DOS format, we can use sed to replace the last two characters (CRLF = /r/n) of each line with just LF (/n):

sed -i 's/.$//' file.txt

But be careful, if the file is already in Unix format, you will end up deleting the last character of each line.

Conversely, converting a file to DOS format can also be done with GNU sed:

sed -i 's/$/\r/' file.txt

Tofrodos

If you don’t want to worry about checking if the file already follows the Unix standard, an alternative to sed is Tofrodos. If the file is already converted, it simply leaves it as it is, which is especially useful in scripts.

  1. Download and install Tofrodos, available on AUR. (Or dos2unix from the community)

  2. Run the following command in the folder that will be version controlled (where ^M is Ctrl+V + Ctrl+M):

    grep -IUrl --color '^M' . | xargs -ifile fromdos 'file'
    

This command will convert all CR+LF to LF in all files in the current folder and subfolders.

Replace fromdos with todos to do the opposite conversion (LF to CR+LF).


Julio Batista Silva
Julio Batista Silva
Data Engineer

I’m a computer engineer passionate about science, technology, photography, and languages. Currently working as a Data Engineer in Germany.

comments powered by Disqus