
Data Structure
Networking
RDBMS
Operating System
Java
MS Excel
iOS
HTML
CSS
Android
Python
C Programming
C++
C#
MongoDB
MySQL
Javascript
PHP
- Selected Reading
- UPSC IAS Exams Notes
- Developer's Best Practices
- Questions and Answers
- Effective Resume Writing
- HR Interview Questions
- Computer Glossary
- Who is Who
Split Unicode String and Specify Byte Offset with TensorFlow and Python
Unicode string can be split, and byte offset can be specified using the ‘unicode_split’ method and the ‘unicode_decode_with_offsets’methods respectively. These methods are present in the ‘string’ class of ‘tensorflow’ module.
Read More: What is TensorFlow and how Keras work with TensorFlow to create Neural Networks?
To begin, represent Unicode strings using Python, and manipulate those using Unicode equivalents. Separate the Unicode strings into tokens based on script detection with the help of the Unicode equivalents of standard string ops.
We are using the Google Colaboratory to run the below code. Google Colab or Colaboratory helps run Python code over the browser and requires zero configuration and free access to GPUs (Graphical Processing Units). Colaboratory has been built on top of Jupyter Notebook.
print("Split unicode strings") tf.strings.unicode_split(thanks, 'UTF-8').numpy() codepoints, offsets = tf.strings.unicode_decode_with_offsets(u"???", 'UTF-8') print("Printing byte offset for characters") for (codepoint, offset) in zip(codepoints.numpy(), offsets.numpy()): print("At byte offset {}: codepoint {}".format(offset, codepoint))
Code credit: https://www.tensorflow.org/tutorials/load_data/unicode
Output
Split unicode strings Printing byte offset for characters At byte offset 0: codepoint 127880 At byte offset 4: codepoint 127881 At byte offset 8: codepoint 127882
Explanation
- The tf.strings.unicode_split operation splits the unicode strings into substrings of individual characters.
- The character tensor that is generated has to be aligned by tf.strings.unicode_decode with the original string.
- For this purpose, it is required to know the offset where each character begins.
- The method tf.strings.unicode_decode_with_offsets is similar to unicode_decode method, except that the former returns a second tensor that contains the start offset of each character.